Resilience | AWS Cloud Operations Blog

AWS Unified Operations: Building Resilient Operations for Mission-Critical Workloads

Achieve Mission-Critical Resiliency at Scale with AWS Unified Operations – The Top Tier of AWS Support to Achieve High Availability, Faster Migrations, and Accelerated Incident Resolution The Shift-Left Paradigm: From Reactive Firefighting to Proactive Prevention Organizations running mission-critical workloads face three critical operational gaps that undermine resilience and slow cloud adoption. Skills gaps make cloud-native […]

Guide to AWS Cloud Resilience sessions at re:Invent 2025

If you’re attending AWS re:Invent with the goal of learning how to prevent costly downtime for your organization, you can look forward to more than 150 breakout sessions, workshops, chalk talks, builders’ sessions, and code talks that will help you improve the resilience of your critical applications. New this year, we’ll also be hosting two […]

Learn from AWS Fault Injection Service team’s approach to Game Days

In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready […]

Simulating partial failures with AWS Fault Injection Service

Modern distributed systems must be resilient to unexpected disruptions to maintain availability, performance, and stability. Chaos engineering helps teams uncover hidden weaknesses by deliberately injecting faults into a system and observing how it recovers. While traditional testing validates expected behavior, chaos engineering tests system resilience during failures. AWS Fault Injection Service (AWS FIS) is a […]

Maximizing Multi-Region Resilience with AWS Resilience Hub

In today’s fast-paced digital world, business continuity isn’t just a goal — it’s an achievable reality. As organizations continue to innovate and grow, their cloud-based applications have become the beating heart of modern business operations, delivering value to customers around the clock. Companies are taking their cloud strategy to the next level by embracing multi-Region […]

Scaling AWS Fault Injection Service Across Your Organization And Regions

In the first two parts of our series, we explored how to scale AWS Fault Injection Service (FIS) across AWS Organizations. Part one focused on implementing FIS in a single AWS account environment, introducing the concept of standardized IAM roles and Service Control Policies (SCPs) as guardrails for controlled chaos engineering experiments, particularly in centralized […]

Scaling AWS Fault Injection Service Across Your Organization And Accounts

Welcome to part two of our series where we focus on scaling AWS Fault Injection Service (FIS) within your organization. In part one, we learned how customers can enable individual accounts within organizations by introducing a Service Control Policies (SCPs) rule to run network experiments when operating with a centralized networking infrastructure. In this blog, […]

Scaling AWS Fault Injection Service Across Your Organization Using Account Controls

AWS Fault Injection Service (FIS) empowers you to adopt chaos engineering at scale within your AWS environment. Chaos engineering injects real-world, controlled failures into a system to verify resilience and reliability, ultimately improving the customer experience. This proactive, resilience-focused approach increases your confidence in a system’s ability to respond to adverse conditions in production. You […]

New AWS Fault Injection Service recovery action for zonal autoshift

We’re excited to announce that AWS Fault Injection Service (FIS) now supports a recovery action for Amazon Application Recovery Controller (ARC) zonal autoshift. With this integration, you can now perform more comprehensive testing by creating disruptive events and trigger a zonal autoshift as part of the same experiment. That way, you can observe how your application […]

Streamlining the Correction of Errors process using Amazon Bedrock

Generative AI can streamline the Correction of Errors process, saving time and resources. By using generative AI to leverage large language models, combined with the Correction of Errors process, businesses can expedite the identification and documentation of the cause of errors, while saving time and resources. Purpose and set-up The purpose of this blog is […]

AWS Cloud Operations Blog

Category: Resilience