Learn from AWS Fault Injection Service team’s approach to Game Days

In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready to respond to failure scenarios. How do you build the muscle memory needed to respond effectively under pressure? At AWS, we regularly conduct organization-wide game days that help verify the resilience of all our services. Our team, the AWS Fault Injection Service (FIS) team, takes this further by using our own service to run additional team-focused game days to strengthen our operational resilience. In this blog post, we’ll share our approach to running game days using FIS to help you implement some of these best practices for your organization.

Our approach to validate our own service through operational readiness exercises

FIS is a managed service for running controlled fault injection experiments. As the team behind FIS, we don’t just build the service—we, along with other AWS service teams, rely on it to test our systems, train our engineers, and improve our operational resilience. This practice, known as “dogfooding,” allows us to experience our products exactly as our customers do, while also strengthening our operational capabilities.

Our team runs game day exercises regularly, as part of the AWS Well-Architected Framework reliability best practices. This creates events in production-like environments to test systems, processes, and team responses using FIS. Our game days serve dual purposes: validating new features before launch and training on-call operators. Each feature release in FIS requires a game day exercise. We perform these game days to validate assumptions about dependency failures as well as our alarm settings and incident response procedures. With a proactive approach to operational preparedness, engineers gain confidence in handling actual incidents, runbooks are refined through practical use, and systems evolve to be more resilient against common failure modes.

A structured framework for using FIS to test and improve our incidence response

Our game day approach follows a well-defined framework with standardized templates and processes. Standardizing helps to eliminate guesswork and ensure reliable, repeatable results so that our teams can focus on learnings rather than reinventing testing procedures. The high-level framework is shown below, and we will dive into each step.

Overview of a framework with sequential steps for running game days. There are phases for preparation, game day, and post game day, before cycling back to preparation.]

Figure 1: Framework for running game days

Game day preparation

Before each game day, we create a comprehensive test plan for each scenario, and keep track of each of these in an execution tracker. A sample of our execution tracker is shown below.

Example game day execution tracker that includes a purpose section, checklist items, and an empty schedule table.

Sample game day tracker for illustrative purposes

To create the test plan, we first need to understand the failure scenarios to test. With distributed systems, there could be countless failure scenarios; scenarios can range from simple component issues to complex multi-system scenarios such as an Availability Zone (AZ) dependency impact. With limited time and resources, we prioritize our most critical services, and prioritize scenarios based on likelihood of occurrence and the severity of potential customer impact. Testing issues in key service dependencies, fault boundaries, and reviewing previous incidents for common patterns of issues are appropriate starting points. If you’re testing operational response, testing a runbook that has not been exercised in a while is also a good place to start.

For each scenario, we define:

A clear purpose statement: A specific description of what we’re testing and why, such as “simulate a failure to reach one of FIS’s dependencies in us-east-1b and evaluate if alarms are triggered” or “test our response to a controlled and gated bug where a specific customer usage pattern triggers internal errors”
Specific hypothesis: What we expect to happen during the fault injection. This should include the specific event or issue you’re testing, the expected impact on your services, and any expected alarms or tickets that should be generated.
Event/issue details: The exact failure scenario being simulated. For feature validation, we focus on testing dependencies, runbooks, and alarming thresholds—essentially using game days as additional integration tests for our monitoring and alarming systems. For training operators, we design scenarios that build familiarity with possible failure modes and our response procedures.
Expected effects: How we expect the application to respond, such as impacts on ongoing FIS experiments and API impact, as well as on operational response, such as alarms and tickets created.
Relevant runbooks: Links to documentation operators are expected to consult during the game day.
Success criteria: Specific outcomes that define a successful response, such as “issue mitigated within 15 minutes”
Technical requirements: Environment, accounts, and resources needed. We do our testing in a pre-production environment that is representative of production so that we don’t impact customers.

This test plan is then reviewed with key stakeholders before scheduling the game day, assigning roles, and preparing a dedicated environment. We’ve found that clearly defined roles are essential for effective game days. In our process, we assign three distinct roles:

Author: Responsible for writing the test plan and defining the scenario
Runner: Executes the game day, performs necessary actions, and answers questions from the operator
Operator: Receives tickets and works on resolution, similar to a real on-call situation. Is not aware of the contents of the test plan.

While the Author and Runner can be the same person, we recommend that the Operator role be filled by someone different—ideally someone who will participate in the on-call rotation but who did not directly work on the particular component being tested. This separation ensures the Operator experiences a realistic scenario and provides an opportunity to identify gaps in knowledge transfer and documentation. We treat these as actual events, so that if an escalation manager would be paged, the actual on-call escalation manager will be notified.

During the game day

On the day of the event, we take the following steps:

Setup: The Runner prepares the environment and necessary resources
Notification: The Runner notifies the participants
Execution: The Runner executes the test plan, which includes multiple scenarios
Observation and response: The Runner observes how the Operator navigates tools and ticket communications, follows runbooks, and makes decisions
Documentation: The Runner documents the response in real-time
Verification: Measure and compare the actual response against the initial hypothesis

For new hires, these exercises provide invaluable hands-on experience before they take their first on-call shift. The controlled environment allows them to build muscle memory around tools and processes while identifying gaps in documentation that experienced team members might overlook.

Post-game day activities

After the event, we do the following:

Document results, detailing for each scenario:
- Summary of the event
- Customer/user impact assessment
- Root cause analysis
- Mitigation steps taken
- Analysis of metrics and alarms
- Runbook effectiveness
- Key learnings
- Action items
Clean up resources: Remove any test resources used during the exercise
Create action items: Generate tickets for any identified gaps in monitoring, documentation, or procedures
Measure confidence: For operator training, we collect confidence scores before and after the exercise on a scale of 1-10

Lessons Learned

Here are a few lessons we have learned along the way:

The value is in both execution and analysis: Gaps in runbooks, monitoring, and procedures are discovered during the execution itself as operators actively work through the incident in real-time. The post-execution analysis then helps us systematically understand what prevented us from meeting our hypotheses and identify monitoring improvements that could reduce mitigation time going forward. For example, after finding incomplete dependency monitoring, we created comprehensive dashboards showing client calls and responses for each dependency and updated our runbooks to review these dashboards during an incident. This change has reduced mitigation time by helping operators quickly identify whether an issue is internal or related to a dependency.
Unexpected discoveries in runbooks: As services evolve, runbooks can quickly become outdated. Game days allow us to identify these discrepancies before outdated runbooks impact operational responses in production. During our exercises, we’ve found instances where runbooks referenced outdated log groups, contained incorrect commands, or missed critical steps.
Measuring improvement in response time: One of the most tangible benefits we’ve observed is the dramatic reduction in on-call trainees incidence response times. Through repeated practice, runbook improvements, and incorporating feedback through each exercise, we reduced response times from hours to minutes. The consistent practice allowed new operators to build familiarity with tools and procedures, while the post-execution analysis helped us streamline our response processes.
Building operator confidence across experience levels: We measured operator confidence using a subjective “operator confidence score” on a scale of 1-10, collected before and after each game day. We observed an average 2.7 point improvement as operators gained hands-on experience with our systems under stress that shadowing alone was not able to provide. Importantly, this confidence-building benefit extends to all team members, including tenured and senior engineers. For experienced team members who might be prone to working from memory rather than following runbooks, game days provide a structured reminder of the importance of documented procedures.
Validating alerting and communication procedures: By simulating failures in controlled environments, we’ve been able to verify that the right alerts trigger at the right thresholds and that the information provided is actionable. We extend our game days to include the full escalation chain, including paging managers, to ensure our communication procedures work end-to-end. This will help us identify gaps in our escalation runbooks, such as the need to clearly document the reason for escalation, quantify customer impact, assess risk to other customers, identify root cause, and suggest mitigations.
Evolving hypotheses through iteration: Building effective hypotheses is an iterative process. Our initial assumptions about how systems would respond to specific failure modes can be incomplete or incorrect. By repeating failure scenarios, we’ve been able to refine our understanding of system behavior and improve our response procedures.

Best practices for implementing game days in your organization

Based on our experience, we’ve developed a set of best practices that can help you implement effective game days in your organization:

Start simple and scale: With hundreds of potential failure modes, start with simple failure modes, clear hypotheses, and a focus on learning. As your team gains experience, you can gradually increase the complexity and scope.
Limit operator visibility into scope and plans: We recommend keeping operators uninformed about the specific scenarios they’ll face. While they should know that a game day is occurring, they shouldn’t know the details of what will fail or how. This approach creates a more realistic experience and better tests the operator’s ability to diagnose and respond to unexpected issues.
Run frequently: The return on investment for game days improve dramatically when they’re run frequently. We’ve optimized our processes over time, including by using repeatable FIS experiment templates. Over time, we’ve reduced execution time from days to hours, enabling a weekly game day cadence.
Measure and track improvement: To quantify the value of your program, establish metrics that track improvement over time. While some metrics may be subjective, they can still provide valuable insights when tracked consistently. Some useful metrics to track include: Time to resolution/mitigation, operator confidence scores (before and after), number of runbook gaps identified, number of monitoring improvements implemented.
Prioritize business critical services: Focus on services where issues would have the greatest impact to maximize the value of your resilience testing program.

FIS provides a platform for implementing controlled resilience experiments, and the true value comes from the process, analysis, and continuous improvement that follows.

Conclusion

By applying the learning outcomes and best practices we’ve shared, you can build a game day program that not only improves your system’s resilience but also enhances your team’s operational capabilities and confidence. You can create scenarios using FIS, and we recommend the FIS workshop if you need help getting started. The result will be more reliable systems, more confident operators, and ultimately, a better experience for your customers. These improvements create a virtuous cycle where each game day builds on the lessons from previous exercises, continuously enhancing operational excellence. While your specific implementation may differ based on your services and organizational structure, the presented framework provides a foundation for building a successful game day program.

AWS Cloud Operations Blog

Learn from AWS Fault Injection Service team’s approach to Game Days

Our approach to validate our own service through operational readiness exercises

A structured framework for using FIS to test and improve our incidence response

Game day preparation

For each scenario, we define:

During the game day

Post-game day activities

Lessons Learned

Best practices for implementing game days in your organization

Conclusion

Resources

Follow

Learn

Resources

Developers

Help