AWS Cloud Operations Blog
Category: Developer Tools
Simulating partial failures with AWS Fault Injection Service
Modern distributed systems must be resilient to unexpected disruptions to maintain availability, performance, and stability. Chaos engineering helps teams uncover hidden weaknesses by deliberately injecting faults into a system and observing how it recovers. While traditional testing validates expected behavior, chaos engineering tests system resilience during failures. AWS Fault Injection Service (AWS FIS) is a […]
Best practices for utilizing AWS Systems Manager with AWS Fault Injection Service
Introduction In today’s cloud-centric world, ensuring the resilience of mission-critical applications is paramount. The ability to withstand and recover from unexpected failures, including degradation of cloud provider services, can mean the difference between seamless operation and costly downtime. This is where the powerful combination of AWS Systems Manager (SSM) and AWS Fault Injection Service (AWS […]
Tracing ETL Workloads using AWS X-Ray and AWS Distro for OpenTelemetry
Introduction Data pipelines are essential for modern data-driven companies to gain critical business insights. However, data pipelines commonly fail when new files or datasets from data sources do not conform to the expected schema, leading to downstream job failures, workflow breakdowns, and delayed insights. Additionally, fluctuating data volumes, from a few gigabytes to multiple terabytes, […]
Scaling AWS Fault Injection Service Across Your Organization And Regions
In the first two parts of our series, we explored how to scale AWS Fault Injection Service (FIS) across AWS Organizations. Part one focused on implementing FIS in a single AWS account environment, introducing the concept of standardized IAM roles and Service Control Policies (SCPs) as guardrails for controlled chaos engineering experiments, particularly in centralized […]
Scaling AWS Fault Injection Service Across Your Organization And Accounts
Welcome to part two of our series where we focus on scaling AWS Fault Injection Service (FIS) within your organization. In part one, we learned how customers can enable individual accounts within organizations by introducing a Service Control Policies (SCPs) rule to run network experiments when operating with a centralized networking infrastructure. In this blog, […]
Scaling AWS Fault Injection Service Across Your Organization Using Account Controls
AWS Fault Injection Service (FIS) empowers you to adopt chaos engineering at scale within your AWS environment. Chaos engineering injects real-world, controlled failures into a system to verify resilience and reliability, ultimately improving the customer experience. This proactive, resilience-focused approach increases your confidence in a system’s ability to respond to adverse conditions in production. You […]
New AWS Fault Injection Service recovery action for zonal autoshift
We’re excited to announce that AWS Fault Injection Service (FIS) now supports a recovery action for Amazon Application Recovery Controller (ARC) zonal autoshift. With this integration, you can now perform more comprehensive testing by creating disruptive events and trigger a zonal autoshift as part of the same experiment. That way, you can observe how your application […]
Developing an AWS Service Catalog self-managed engine for governance
AWS Service Catalog lets you centrally manage your cloud resources to achieve governance at scale of your Infrastructure as Code (IaC) templates. AWS Service Catalog supports AWS CloudFormation natively and allows customers to use other IaC such as Terraform Community and Terraform Cloud via Service Catalog reference engine. We often hear customers asking how to […]
Enabling Self Service for Cloud Custodian policies on AWS using AWS Service Catalog
Customers are increasingly seeking tools and solutions that can help them achieve their desired outcomes more efficiently and effectively. In the context of cloud management, the need for self-service capabilities has become more pronounced as organizations strive to optimize their cloud resources, improve security, and enhance their overall cloud operations. AWS Service Catalog offers the […]
Respond to CloudWatch Alarms with Amazon Bedrock Insights
Overview When operating complex, distributed systems in the cloud, quickly identifying the root cause of issues and resolving incidents can be a daunting task. Troubleshooting often involves sifting through metrics, logs, and traces from multiple AWS services, making it challenging to gain a comprehensive understanding of the problem. So how can you streamline this process […]