Cloud resilience refers to the ability for an application to resist or recover from disruptions, including those related to infrastructure, dependent services, misconfigurations, transient network issues, and load spikes. Cloud resilience also plays a critical role in an organization’s broader business resilience strategy, including the ability to meet digital sovereignty requirements. Customers need to know that their workloads in the cloud will continue to operate in the face of natural disasters, network disruptions, and disruptions due to geopolitical crises. Public sector organizations and customers in highly regulated industries rely on AWS to provide the highest level of resilience and security to help meet their needs. AWS protects millions of active customers worldwide across diverse industries and use cases, including large enterprises, startups, schools, and government agencies.
The AWS Global Cloud Infrastructure is designed to enable customers to build highly resilient workload architectures. AWS has made significant investments in building and running the world’s most resilient cloud by building safeguards into our service design and deployment mechanisms and instilling resilience into our operational culture. We build to guard against outages and incidents, and account for them in the design of AWS services—so when disruptions do occur, their impact on customers and the continuity of services is as minimal as possible. To avoid single points of failure, we minimize interconnectedness within our global infrastructure. The AWS global infrastructure is geographically dispersed, spanning 105 Availability Zones (AZs) within 33 AWS Regions around the world.
Each Region is comprised of multiple Availability Zones, and each AZ includes one or more discrete data centers with independent and redundant power infrastructure, networking, and connectivity. Availability Zones in a Region are meaningfully distant from each other, up to 60 miles (approximately 100 km) to help prevent correlated failures, but close enough to use synchronous replication with single-digit millisecond latency. AWS is the only cloud provider to offer three or more Availability Zones within each of its Regions, providing more redundancy and better isolation to contain issues. Common points of failure, such as generators and cooling equipment, aren’t shared across Availability Zones and are designed to be supplied by independent power substations. To better isolate issues and achieve high availability, customers can partition applications across multiple Availability Zones in the same Region. Learn more about how AWS maintains operational resilience and continuity of service.
Resilience is deeply ingrained in how we design services. At AWS, the services we build must meet extremely high availability targets. We think carefully about the dependencies that our systems take. Our systems are designed to stay resilient even when those dependencies are impaired; we use what is called static stability to achieve this level of resilience. This means that systems operate in a static state and continue to operate as normal without needing to make changes during a failure or when dependencies are unavailable. For example, in Amazon Elastic Compute Cloud (Amazon EC2), after an instance is launched, it’s just as available as a physical server in a data center. The same property holds for other AWS resources such as virtual private clouds (VPCs), Amazon Simple Storage Service (Amazon S3) buckets and objects, and Amazon Elastic Block Store (Amazon EBS) volumes. Learn more in our Fault Isolation Boundaries whitepaper.