AWS Public Sector Blog
Chaos engineering made clear: Generate AWS FIS experiments using natural language through Amazon Bedrock
This post explores how combining AWS Fault Injection Service (FIS) with the power of generative AI through Amazon Bedrock can streamline chaos engineering practices in the public sector. Using large language models (LLMs) to generate and customize application-specific AWS FIS experiments using natural language allows organizations to significantly lower the barrier to entry for chaos engineering while maintaining the rigorous testing standards needed for government workloads. This approach enables teams to more effectively validate their Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) while meeting compliance mandated disaster recovery requirements.
The solution demonstrates how Amazon Bedrock interprets natural language descriptions of plausible failure scenarios for an application, and automatically generates the relevant AWS FIS experiment templates. This capability empowers teams across the organization, from DevOps engineers to compliance officers, to contribute to resilience testing without needing deep technical expertise in chaos engineering methodologies. Democratizing access to chaos engineering tools allows public sector organizations to build a more comprehensive and collaborative approach to system reliability.
As public sector organizations continue their digital transformation journeys, the ability to systematically verify application resilience becomes increasingly critical to maintaining public trust and verifying continuous delivery of essential services. By using Amazon Bedrock to generate resilience tests for AWS FIS, organizations can now implement chaos engineering practices that are realistic, quick to setup and accessible, helping them to gain confidence in their disaster recovery plans and demonstrate evidence of testing for compliance.
Solution overview
Chaos engineering isn’t just about improving reliability. It’s essential for generating evidence of testing required compliance and regulatory requirements. Organizations should be able to continually prove that their systems preserve data integrity and maintain service availability during failures. AWS FIS is a valuable tool for proactively testing data preservation and recovery methods with automated failover mechanisms.
Amazon Bedrock is a fully managed service that provides access to leading foundation models (FMs) through a single API. For this solution, we use LLMs through the Amazon Bedrock service to turn natural language descriptions of failure scenarios into AWS FIS experiment templates. These templates, which are essentially JSON AWS CloudFormation templates, define the application-specific chaos engineering scenarios. This approach streamlines the creation of AWS FIS experiments while maintaining precise control over the testing parameters and safety guardrails.
To demonstrate the power and accessibility of this approach, we use the Amazon Bedrock playground, a web-based interface that allows immediate experimentation with FMs without needing more infrastructure or application development. This enables organizations to quickly prototype and validate AWS FIS experiments before implementing them in production environments. In turn, this significantly reduces the time and resources needed to begin chaos engineering practices. To start, it helps to understand the composition of an AWS FIS experiment:
- Actions run specific AWS service disruptions (such as stopping Amazon Elastic Compute Cloud (Amazon EC2) instances or rebooting Amazon Relational Database Servers (Amazon RDS) clusters) that can be executed either in sequence or simultaneously to replicate the symptoms of real-world failures.
- Targets define which AWS resources (identified by Amazon Resource Names (ARNs) or tags) are affected by the experiment’s actions, such as specific EC2 instances, Amazon Elastic Container Service (Amazon ECS) tasks, or Amazon RDS databases within an application.
- Stop conditions use Amazon CloudWatch alarms to automatically roll back experiments when performance thresholds are breached, serving as a critical safety mechanism to prevent unintended consequences.
- Experiment role is an AWS Identity and Access Management (IAM) role that follows the principle of least privilege to grant AWS FIS the specific permissions needed to execute actions on your behalf.
- Experiment report configuration enables detailed documentation of test procedures, metrics, and outcomes, providing valuable data for analysis and compliance requirements.
- Experiment options allow customization of template parameters and execution behavior, such as the ability to define whether experiments run across single or multiple AWS accounts.
Now that the structure and purpose of AWS FIS is understood, we can navigate to the generative AI part of the solution. The solution uses a system prompt, which is a set of initial instructions that guide the model’s behavior and output format. The system prompt includes detailed specifications about the AWS FIS experiment structure, valid AWS actions, and necessary safety parameters such as stop conditions. This confirms that the generated experiments conform to AWS best practices and include all the necessary components, such as IAM role parameters.
Many of the cutting edge FMs available in Amazon Bedrock can generate valid AWS FIS experiments. Users should select the model that best suits their specific needs based on factors such as response length, accuracy, and processing time. We are using Anthropic Claude 3.7 Sonnet, which at the time of writing excels at understanding complex system architectures and generating detailed AWS FIS experiment templates.
The first part of the system prompt is the background. The model is instructed to be an expert in AWS cloud architecture with deep understanding of systems and their failure modes. It sets the stage for analyzing complex architectures that AWS FIS experiments generate results are generated from. The initial staging is shown as follows:
You are an expert in building large, complex systems on the AWS cloud with a focus on operational excellence and high availability. You are aware of how large distributed systems fail, and how to mitigate and address such failures. You are also knowledgeable in the details of how such systems are built - databases, storage systems, web servers, application servers, language runtimes, caching, load balancing and other components. The goal is to generate an experiment that meets stringent compliance requirements in both the private and public sector.
Carefully study the description of a specific application architecture between the <arch> </arch> tags, and come up with suggestions for AWS Fault Injection Service (FIS) experiments specifically applicable to this application architecture. Your output should consist of two parts - part 1 should be a concise description of the kinds of FIS experiments that can be run against this architecture, part 2 should be machine-readable (runnable) JSON template for these experiments. Once the experiment is generated, supply an overview of the experiment that includes whether a stop condition exists, where the end user must define specific instance, database, storage, or other necessary names, and the role requirements to run the experiment.
To maintain strict control over the generated experiments, the prompt instructs the model to only use existing valid FIS actions (get the updated list via the AWS CLI):
<valid_fis_actions>
aws:cloudwatch:assert-alarm-state
aws:dynamodb:global-table-pause-replication
aws:ebs:pause-volume-io
aws:ec2:api-insufficient-instance-capacity-error
aws:ec2:asg-insufficient-instance-capacity-error
aws:ec2:reboot-instances
aws:ec2:send-spot-instance-interruptions
aws:ec2:stop-instances
aws:ec2:terminate-instances
aws:ecs:drain-container-instances
aws:ecs:stop-task
aws:ecs:task-cpu-stress
aws:ecs:task-io-stress
aws:ecs:task-kill-process
aws:ecs:task-network-blackhole-port
aws:ecs:task-network-latency
aws:ecs:task-network-packet-loss
aws:eks:inject-kubernetes-custom-resource
aws:eks:pod-cpu-stress
aws:eks:pod-delete
aws:eks:pod-io-stress
aws:eks:pod-memory-stress
aws:eks:pod-network-blackhole-port
aws:eks:pod-network-latency
aws:eks:pod-network-packet-loss
aws:eks:terminate-nodegroup-instances
aws:elasticache:replicationgroup-interrupt-az-power aws:fis:inject-api-internal-error
aws:fis:inject-api-throttle-error
aws:fis:inject-api-unavailable-error
aws:fis:wait
aws:lambda:invocation-add-delay
aws:lambda:invocation-error
aws:lambda:invocation-http-integration-response
aws:network:disrupt-connectivity
aws:network:route-table-disrupt-cross-region-connectivity
aws:network:transit-gateway-disrupt-cross-region-connectivity
aws:rds:failover-db-cluster
aws:rds:reboot-db-instances
aws:s3:bucket-pause-replication
aws:ssm:send-command
aws:ssm:start-automation-execution
</valid_fis_actions>
The next part of the system prompt is referred to as few-shot prompting. Few-shot prompting provides examples of desired results within the system prompt to confirm that valid formatting, guardrails, and accurate predictions are made. In our case, we use two example AWS FIS experiments directly from AWS documentation:
Example 1:
{
"tags": {
"Name": "StopEC2InstancesWithFilters"
},
"description": "Stop and restart all instances in us-east-1b with the tag env=prod
in the specified VPC",
"targets": {
"myInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"env": "prod"
},
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1b"]
},
{
"path": "State.Name",
"values": ["running"]
},
{
"path": "VpcId",
"values": [ "vpc-aabbcc11223344556"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"StopInstances": {
"actionId": "aws:ec2:stop-instances",
"description": "stop the instances",
"parameters": {
"startInstancesAfterDuration": "PT2M"
},
"targets": {
"Instances": "myInstances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:alarm-name"
}
],
"roleArn": "arn:aws:iam::111122223333:role/role-name"
}
Example 2:
{
"description": "ChaosMesh StressChaos example",
"targets": {
"Cluster-Target-1": {
"resourceType": "aws:eks:cluster",
"resourceArns": [
"arn:aws:eks:arn:aws::111122223333:cluster/cluster-id"
],
"selectionMode": "ALL"
}
},
"actions": {
"TestCPUStress": {
"actionId": "aws:eks:inject-kubernetes-custom-resource",
"parameters": {
"maxDuration": "PT2M",
"kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
"kubernetesKind": "StressChaos",
"kubernetesNamespace": "default",
"kubernetesSpec": "{\"selector\":{\"namespaces\":[\"default\"],
\"labelSelectors\":{\"run\":\"nginx\"}},\"mode\":\"all\",\"stressors\": {\"cpu\":
{\"workers\":1,\"load\":50}},\"duration\":\"1m\"}"
},
"targets": {
"Cluster": "Cluster-Target-1"
}
}
},
"stopConditions": [{
"source": "none"
}],
"roleArn": "arn:aws:iam::111122223333:role/role-name",
"tags": {}
}
Combine all parts of the prompt to create a complete system prompt. This prompt is used during the generation of the AWS FIS experiment.
Generate the experiments
To run the solution:
- Navigate to Amazon Bedrock within your desired AWS Region.
- On the left tab, scroll down to Model access below the Bedrock configurations tab.
- Choose Enable specific models.
- Enable your desired model. For this post, enable Anthropic Claude 3.7 Sonnet. Choose Next and Submit.
- Choose Chat / Text below the Playgrounds
- Choose Select Model and choose Anthropic. Choose Anthropic Cloud 3.7 Sonnet and Apply.
- Paste the fully combined system prompt into the System prompts text box.
- Within the main text box for a prompt, define an example architecture. Within your own environment, this is where you would describe your existing architecture to get a more personalized AWS FIS experiment. An example architecture is defined as follows:
Generate three AWS FIS experiments based on the following AWS architecture:
<arch> We are deploying the WordPress Content Management system on AWS following all applicable best practices. These include (but are not limited to) ensuring high availability by deploying the 3x application and web servers across multiple availability zones in a single region, deploying the RDS MYSQL database using a primary and secondary for failover in different availability zones, using the EFS service to store unchanging/static application-related files on an NFS filesystem that is mounted on the application and web servers, having a regular backup schedule for the database, deploying caching using the ElastiCache Redis service, using an Application Load Balancer to front the deployment and so on. </arch>
- Choose Run.
Review the results
Results may vary with all generative AI solutions. Responses look similar to the following results generated directly from the model, system prompt, and text prompt used throughout this post:
Part 1: Experiment Descriptions
Based on the provided WordPress architecture, I recommend the following FIS experiments to test the resilience of your system:
Experiment 1: RDS MYSQL Failover Resilience Test
This experiment will force a failover of your RDS MYSQL Multi-AZ deployment to verify that WordPress continues functioning with minimal disruption. It tests the application's ability to handle database failovers gracefully.
Experiment 2: Web Server Availability Zone Failure Simulation
This experiment simulates the failure of web servers in a single availability zone by stopping them. It verifies that your Application Load Balancer correctly distributes traffic to healthy instances in the remaining availability zones, maintaining WordPress availability.
Experiment 3: ElastiCache Redis Degradation Test
This experiment introduces network latency between your application tier and the ElastiCache Redis cluster to test how WordPress performs when caching becomes degraded but not completely unavailable.
Part 2: FIS Experiment Templates
Experiment 1: RDS MYSQL Failover Resilience Test
{
"description": "Force RDS MYSQL failover to test WordPress application resilience",
"targets": {
"dbInstances": {
"resourceType": "aws:rds:db",
"resourceArns": [
"arn:aws:rds:us-east-1:111122223333:db:wordpress-primary-db"
],
"selectionMode": "ALL"
}
},
"actions": {
"rebootDBInstance": {
"actionId": "aws:rds:reboot-db-instances",
"description": "Reboot the primary MYSQL instance to trigger a failover",
"parameters": {
"forceFailover": "true"
},
"targets": {
"DBInstances": "dbInstances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:wordpress-error-rate-high"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole",
"tags": {
"Name": "RDSFailoverResilienceTest",
"Application": "WordPress"
}
}
Experiment 2: Web Server Availability Zone Failure Simulation
{
"description": "Simulate failure of all WordPress web servers in a single availability zone",
"targets": {
"ec2WebServers": {
"resourceType": "aws:ec2:instance",
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1a"]
},
{
"path": "State.Name",
"values": ["running"]
},
{
"path": "Tags.Role",
"values": ["wordpress-webserver"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"stopWebServers": {
"actionId": "aws:ec2:stop-instances",
"description": "Stop all WordPress web servers in us-east-1a to simulate AZ failure",
"parameters": {
"startInstancesAfterDuration": "PT5M"
},
"targets": {
"Instances": "ec2WebServers"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:wordpress-availability-low"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole",
"tags": {
"Name": "WebServerAZFailureSimulation",
"Application": "WordPress"
}
}
Experiment 3: ElastiCache Redis Degradation Test
{
"description": "Introduce network latency between WordPress app servers and ElastiCache Redis",
"targets": {
"ec2AppServers": {
"resourceType": "aws:ec2:instance",
"filters": [
{
"path": "State.Name",
"values": ["running"]
},
{
"path": "Tags.Role",
"values": ["wordpress-appserver"]
}
],
"selectionMode": "COUNT(1)"
}
},
"actions": {
"addNetworkLatencyToRedis": {
"actionId": "aws:ssm:send-command",
"description": "Add network latency to ElastiCache Redis endpoints",
"parameters": {
"documentArn": "arn:aws:ssm:us-east-1::document/AWS-RunShellScript",
"documentParameters": "{\"commands\":[\"sudo tc qdisc add dev eth0 root handle 1: prio\",\"sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 150ms 50ms distribution normal\",\"sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst your-elasticache-endpoint/32 flowid 1:3\",\"echo \\\"Sleeping for 5 minutes\\\"\",\"sleep 300\",\"sudo tc qdisc del dev eth0 root\"]}"
},
"targets": {
"Instances": "ec2AppServers"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:wordpress-response-time-high"
}
],
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole",
"tags": {
"Name": "ElastiCacheLatencyTest",
"Application": "WordPress"
}
}
Experiment Overview
Experiment 1: RDS MYSQL Failover Resilience Test
Stop Condition: Will stop if WordPress error rate exceeds a threshold defined by the CloudWatch alarm.
Required Replacements:
Replace wordpress-primary-db with your actual RDS MYSQL instance identifier
Replace 111122223333 with your AWS account number
Replace the CloudWatch alarm ARN with your actual alarm ARN
Role Requirements: The role must have permissions to perform rds:RebootDBInstance operations on the specified RDS instances
etc...
Running AWS FIS experiments against your application architecture
When you have the Amazon Bedrock-generated AWS FIS experiment templates, you must edit them to add specific resource ARNs, or use tags and stop conditions relevant to your architecture and CloudWatch alarm setup. Then, you may run these experiments through either the AWS FIS console or the API:
- Within the console, navigate to AWS FIS and choose Experiment templates.
- Choose Actions, then Import experiment template. Copy and paste one of the experiments generated from the prompt.
- Select Import experiment template and run the experiment.
Conclusion
In this post, we demonstrated how to use the generative AI capabilities of Amazon Bedrock to streamline the creation of AWS FIS experiments. Combining natural language processing with chaos engineering allows teams to generate application-specific AWS FIS experiments without deep technical expertise while maintaining compliance requirements and safety guardrails. The solution uses a system prompt to produce valid AWS FIS experiment templates that include proper stop conditions, resource requirements, and IAM permissions. Organizations can get started by using our template, describing their architecture, and generating experiments through Amazon Bedrock. This approach democratizes chaos engineering tools, enabling teams to efficiently validate their disaster recovery capabilities and build confidence in their system’s resilience.
For more information, visit the AWS FIS documentation and Amazon Bedrock developer guide.