Building for Resilience

$ cat content.md

Early Monday morning around 12 a.m. PT, Amazon Web Services (AWS) experienced a major outage in its us-east-1 region, one of the busiest data centers in the world. The disruption rippled across the internet, affecting companies large and small.

At Block, many of our brands — including Square, Cash App, Afterpay, and TIDAL — depend on AWS to power core services. But we design our systems with the assumption that outages like this will happen. Over the years, our teams have invested heavily in diverse reliability strategies: multi-region architectures for global services like Square, and multi-Availability-Zone deployments in the us-west region for Cash App, chosen specifically for its stability and low failure correlation.

Designing for Resilience Across Block

Each Block brand builds for resilience in a way that fits its product and risk model.

Square operates the most critical workloads — including Payments, Point of Sale, Login, and Authentication — across multiple AWS regions. This architecture allows us to quickly redirect traffic away from a failing region while maintaining service continuity elsewhere. We regularly shift traffic between our active regions for various reasons (e.g. testing, maintenance, disaster exercises, mitigating smaller incidents). Additionally, if requests are failing in one region, they can often be retried, automatically, in the other.

When AWS us-east-1 began experiencing issues, our monitoring systems detected failures quickly. For most key services, automated systems retried requests or seamlessly rerouted traffic to our secondary region. For services that required manual intervention our on-call teams were notified immediately. Within a half hour all traffic had been redirected and services stabilized.

Because we distribute traffic geographically, US-based workloads experienced minimal disruption. Sellers in Europe, whose transactions are primarily routed through us-east-1, saw higher impact during the initial phase — with up to 30% of transactions failing at peak — before recovery completed.

Cash App runs its services across multiple Availability Zones in the us-west region, which provides geographic diversity within a single region and has proven exceptionally reliable. Our teams closely monitored the event to provide instant response to issues.

Incident response and Areas of Impact

Our centralized reliability and incident-response teams coordinate major events across brands. When AWS us-east-1 began experiencing issues, shared monitoring quickly detected anomalies and alerted service owners. Square’s regional failover systems automatically shifted traffic, while Cash App - operating independently in us-west-2 - was mostly unaffected.

Some non-core systems, such as phone support (which relied on AWS Connect) and certain cross-regional workflows, were impacted. These issues were detected, isolated, and mitigated within hours through established incident channels shared across Block engineering teams.

Not all systems were fully shielded from the outage. Some services still rely on single-region infrastructure:

Phone support in North America was unavailable from 12:00 – 2:20 a.m PT and again from 8:00 a.m. – 12:45 p.m. PT, due to dependencies on AWS Connect.
Square Refund processing was delayed because those workflows must execute in the same region as the original payment.
A few background jobs experienced minor delays due to cross-regional dependencies, which are already being reviewed.

Square’s core payment services remained operational at all times, and sellers were able to continue transacting throughout the event. A few CashApp features were briefly impacted but recovered fully later.

Lessons Learned

Even with strong architecture, no system is fully immune to provider-level disruption.

Key lessons from this event apply across Block:

Regional diversity works. Multi-region and multi-AZ strategies contained the blast radius and kept essential customer experiences online.
Cross-brand coordination matters. Centralized monitoring and shared incident tooling allowed for faster communication and unified recovery tracking.
Support systems need equal redundancy. Dependencies such as telephony, messaging, and vendor APIs must follow the same reliability standards as core products.
Continuous testing pays off. Regular Gamedays and failover drills gave our teams the confidence to act quickly when a real event occurred.

Resilience as a Shared Responsibility

Reliability isn’t owned by one team or brand - it’s a shared discipline across Block. Our engineering groups learn from each other’s incidents, run cross-brand reviews, and continuously refine architectures and processes.

While AWS outages will always make headlines, this event reinforced that resilience comes from preparation, collaboration, and a culture of ownership. Thanks to that, when much of the internet was struggling, Block’s products — from Square to Cash App and beyond — stayed available for the people and businesses who rely on them.

Designing for Resilience Across Block

Incident response and Areas of Impact

Lessons Learned

Resilience as a Shared Responsibility

Randy Wigginton