Best practices for resilience and availability on Amazon ECS

Building mission-critical applications needs a deep understanding of both high-availability and resilience. In our previous post, A Deep Dive into Resilience and Availability on Amazon Elastic Container Service, we defined availability as the probability of service being operational, and resilience as a service’s ability to maintain operations during adverse conditions and quickly recover from failures. The previous post discusses how designing for high availability and resilience shaped the Amazon Elastic Container Service (Amazon ECS) architecture, and it features Amazon Web Services (AWS) best practices such as static stability across AWS Availability Zones (AZs) and workload isolation mechanisms. This post continues with this theme by exploring advanced implementation patterns we have found useful for building highly available services such as idempotency, resilience to transient failures in your applications, using static stability offered by AZs, deployment and rollback safety, and chaos engineering techniques such as fault injection testing. In this post, we describe how you can use these patterns when deploying applications on Amazon ECS.

Idempotency

In service-oriented architectures, where components interact over networks, transient issues can cause requests to fail or time out. A common approach to handling these failures, as highlighted in the AWS Builder’s Library post Timeouts, Retries and Backoff with Jitter, is to implement retry mechanisms. When interacting with AWS APIs, AWS SDKs offer various retry strategies, which improve the likelihood of success when transient failures are encountered. However, we have observed that retries can sometimes lead to unintended consequences, which are described in the following section, especially for operations that create, update, or delete resources.

The unintended consequences of retries

Consider this scenario when starting a task with the Amazon ECS RunTask API:

A client sends a request to start a task.
The request times out from the client’s perspective.
Unknown to the client, the task actually starts in the background.
The client retries the request, starting a second task.

This situation can result in the following:

Running (and paying for) more resources/tasks than intended. You may end up hitting your account limits due to unchecked “zombie” resources.
Unexpected application states, especially for systems sensitive to concurrent executions.
Higher latency as observed by the client: every retried request would entail repetition of work done previously for the original request, thus the end-to-end observable latency is higher.
Amplification of errors when the service is facing heavy traffic because retries by clients drive increased call volume and each request is treated as a fresh request.

Introducing idempotency: a solution for “at most-once” operations

To address these challenges, many AWS services follow the guidance laid out in the AWS Builders Library’s post Making retries safe with idempotent APIs and build idempotency into their APIs. An idempotent API makes sure that an action is performed only once, regardless of the number of times a request is retried. For example, the Amazon Elastic Compute Cloud (Amazon EC2) RunInstances and Amazon ECS RunTask APIs provide an optional “client token” parameter to support exactly-once operations up to a certain time period since the first request. If you use AWS SDK or AWS Command Line Interface (AWS CLI) to start Amazon ECS tasks, then a client token is set by default if one is not provided explicitly for the Amazon ECS APIs that support idempotency, as shown in the following figure.

A client sends a request to start a task.
The request times out from the client’s perspective.
Unknown to the client, the task actually starts in the background.
The client retries the request and gets the started task in the response.

Idempotency serves as a useful mechanism for our users to limit wasted resources and get responses back sooner when retries happen. It also greatly helps the service protect itself against overload by acting as a mechanism for reducing work amplification when the service is already operating in a degraded state. Resuming progress from the original request for retries, the service takes on necessary work only for new unique requests, which allows it to recover from degradations faster.

Static stability

The following sections cover multiple static stability circumstances.

Static stability with multiple AZs

Static stability is a property of distributed systems, which implies that the system remains operational even when a dependency fails. Systems that exhibit static stability can operate normally without the need to make changes when experiencing impairments or failures. You can learn more about the concept of static stability in the AWS Builders Library post Static Stability using Availability Zones.

AWS Regions are composed of multiple isolated locations—AZs. These AZs are both logically and physically separated to make sure that they fail independently in the event of natural disasters, utility failures, and faulty hardware or software. In our experience of operating at scale, using multiple AZs and overprovisioning services to survive a single AZ’s failure is a fundamental practice for achieving static stability. We recommend distributing your tasks across three or more AZs and overprovisioning as we prescribed later in this section to keep your application statically stable during AZ failures, because no scaling actions would be needed in such events. As described in the previous post, the number of tasks you should provision depends on the number of AZs across which your tasks are spread.

Target Desired Count = Base Desired Count X (AZ spread count / (AZ spread count – 1))

If your tasks are spread across three AZs and your application needs 10 tasks to run in each AZ, then each AZ must be provisioned with 15 tasks to survive an AZ outage.

Task Count per AZ = 10 X (3/2) = 15

AZ failure mitigation for Amazon ECS tasks and services

In the event that an AZ becomes impaired, AWS provides mechanisms to shift your application’s traffic away from the affected AZ. One of these mechanisms is the Application Recovery Controller (ARC) zonal shift feature. This feature allows you to reroute traffic to healthy AZs for supported resources such as network load balancers (NLB), Auto Scaling groups, and Amazon Elastic Kubernetes Service (Amazon EKS) clusters. This minimizes downtime and makes sure of continuous availability. For services using load balancers with the cross-zone load balancing feature disabled, we recommend setting a minimum healthy target count or percentage for your target groups to make sure that targets in a zone with insufficient tasks do not serve a disproportionate amount of requests. When you configure your Amazon ECS services to span multiple AZs, Amazon ECS automatically routes your new task launch requests away from an impaired AZ, without any zonal shift configuration. This intelligent routing of task launches increases the likelihood of successful task launches and helps maintain your workload’s availability.

Amazon ECS service rebalancing for equal distribution

Imagine a scenario where your Amazon ECS service spans three AZs: use1-az1, use1-az2, and use1-az3, and the use1-az1 zone is experiencing an outage. In these situations, Amazon ECS prioritizes successfully deploying your application and helps your service get to the desired number of running tasks to maintain the best availability posture for your service. If there is an active impairment in use1-az1, then Amazon ECS automatically directs incoming task launch requests to the healthier AZs: use1-az2 and use1-az3 .

When the outage is resolved, you might find that use1-az2 and use1-az3 are hosting more tasks than the impaired AZ use1-az1 if your service deployed, scaled up, or tasks were replaced to maintain the configured desiredCount during an AZ outage, as shown in the following figures.

ECS Service running three tasks in each of the three AZs during normal conditions.

ECS Service with imbalance across AZs as a result of AZ impairment.

This imbalance can pose a risk to your service’s static stability if another AZ becomes impaired, especially if that AZ hosts a disproportionately high number of tasks. To mitigate this risk, Amazon ECS offers a feature called “service rebalancing.” This automatically redistributes your service tasks across all configured AZs, making sure of a more balanced and stable deployment.

Enabling the service rebalancing feature for your service means that Amazon ECS automatically redistributes the tasks across all three AZs. This makes sure of a more even distribution and enhances static stability. You can find detailed documentation on service rebalancing here.

Application resilience with local container restarts

At AWS, we consider static stability at all layers in the stack. An application can stop working for unexpected reasons such as transient failures, reaching resource limits (for example memory, open file descriptors, number of processes), or unanticipated bugs. A resilient application should be designed to continue to operate normally under adverse conditions, without the need to make changes. Applications built on top of Amazon ECS can use the container restart policies feature, which offers the ability to recover from unexpected container failures by restarting the container. This provides containerized applications with the opportunity to automatically recover without manual intervention or Amazon ECS provisioning a replacement task.

Container restart policies can be applied to both essential and non-essential containers. Although non-essential containers, such as sidecar containers, may not be essential for your application, they may be used to support your service by functions such as collecting metrics or emitting logs. These containers should also be considered in scope for restart policies, because unexpected failures can impact different aspects of your application such as observability.

Container restart policies are based on exit codes emitted when a container’s entry point process terminates. By convention, an exit code of zero (0) indicates successful completion, while non-zero codes suggest errors. By default, containers restart regardless of the exit code. When configuring restart policies, consider whether errors are retryable for your specific application, and add only non-retryable exit codes to the restart policy’s ignoredExitCodes field. For example, configuration errors such as exit code 127 (command not found) are better handled by failing immediately rather than attempting restarts and should be included in ignoredExitCodes.

Users can configure the restartAttemptPeriod, which establishes a minimum runtime threshold before a container becomes eligible for restart. When a container starts, it must run successfully for the specified duration (ranging from 60 to 1,800 seconds, defaulting to 300 seconds) before Amazon ECS considers restarting it upon failure. This prevents rapid restart cycles from persistently failing containers, which could otherwise consume unnecessary resources. For example, if you set a restartAttemptPeriod of 180 seconds and your container exits after only 60 seconds of runtime, Amazon ECS doesn’t attempt to restart it. This indicates a potential fundamental problem that needs investigation. Conversely, if the container runs successfully for the full 180 seconds but fails afterward, then Amazon ECS initiates a restart. This behavior helps distinguish between transient issues that can be resolved through a restart and more serious issues needing intervention.

Although container restart policies can help mitigate impact, unexpected container restarts can indicate an underlying issue that may need to be addressed. We recommend alarming on restarts to know when your team needs to be engaged to investigate issues. This might indicate that tasks or containers are not sized properly, a resource leak that needs to be addressed, or a bug that needs to be addressed. Amazon ECS emits container restart metrics that you can alarm on through Amazon ECS Container Insights metrics. This includes RestartCount for a cluster, Amazon ECS service, or task definition family. Container restart information is also emitted through Task Metadata Service (TMDS) such that it can be picked up by any observability agents ran alongside your service in your Amazon ECS task.

Deployment and rollback safety

When developers package their applications into containers, they create immutable images that remain identical across test and production environments. Although container images themselves are immutable, container image tags present a challenge. Before pulling the contents of a container image, the tag must first be resolved to a specific image digest by the container registry. For example, the tag myapplication:latest may resolve to one image digest at a given time, as shown in the following figure. However, as newer versions are deployed, myapplication:latest can be updated to point to a different image digest, resulting in unexpected container deployments.

To make sure of version consistency, Amazon ECS records the image digest of the first task launched in each Amazon ECS service deployment. This is referred to as container image resolution. All subsequent Amazon ECS service tasks launched by Amazon ECS due to scaling or service health related events use this identical container image digest, even if the tag now points to a newer version. Performing an Amazon ECS Service deployment results in the recording of a new image digest used by subsequent service tasks. Refer to our earlier post for more details.

If issues are discovered post-deployment, then rollbacks should be one of the first remediation steps to revert to a previous, known-good state. Having Amazon ECS track the service image digest or use tag immutability means that rollbacks also become more predictable and reliable, which helps decrease the time to recover from failures.

For users who store container images in Amazon Elastic Container Registry (Amazon ECR) , image tag immutability can be enabled to enforce this best practice when using AWS container orchestrators (Amazon ECS or Amazon EKS). This makes sure that the underlying container image for a tag remains constant, making it easier to track application versions in production environments.

Chaos testing with AWS FIS

At AWS, we run a suite of tests against all changes to validate expected behavior and reduce the likelihood of introducing regressions. However, a complete testing strategy must go beyond functional testing—it should also validate application resiliency and static stability properties. Recreating adverse conditions to test an application’s resiliency and static stability can be challenging, as it often necessitates simulating these scenarios. Chaos engineering practices can help recreate these scenarios to uncover application latent bugs and failure modes. At AWS, we regularly exercise chaos engineering to test our services and incident response processes in the form of gamedays. A gameday is a simulation of a real outage scenario where we get to test our services’ resiliency guardrails (such as throttling/load shedding), degraded mode behavior, fault isolation boundaries at scale, and our engineers’ preparedness for real world outages. During a gameday we intentionally inject a fault into one or many of our services. This exercise allows teams to identify and address bottlenecks in the system, identify scaling limits, and prepare teams to exercise alerting and mitigations to build muscle memory when real outages occur. This also gives teams the opportunity to make changes to either software or response procedures to make sure that they are ready for the real thing.

AWS Fault Injection Service (AWS FIS) is a managed AWS service designed for controlled chaos engineering experiments in production-like environments. AWS FIS enables teams to perform fault injection experiments on AWS workloads, creating disruptive events that allow you to observe and analyze application behavior and responses. Containerized applications launched as Amazon ECS tasks can use fault injection directly, using AWS FIS to validate application resilience and static stability. For example, Amazon ECS users can test the response behavior of their application when CPU usage is higher than normal by running a stress test through the aws:ecs:task-cpu-stress action. Arbitrary tasks can be stopped through the aws:ecs:stop-task action to make sure that traffic is routed to remaining tasks through the Elastic Load Balancer (ELB). Arbitrary processes in the container can also be killed through the aws:ecs:task-kill-process action to verify the static stability properties of their application.

Although these features are available for Amazon ECS users using the Amazon EC2 capacity provider, AWS Fargate has recently introduced support for injecting network faults such as dropping inbound or outbound traffic, adding latency and jitter to the task network interface, and stopping and killing tasks. This can be used to simulate connection issues and high latency when communicating with dependencies.

Conclusion

Amazon ECS makes sure that your applications maintain a high availability and resilience posture in the event of outages and transient failures. To build applications that can survive known failures, you can adopt best practices such as idempotency, static stability, and deployment safety mechanisms, and use AWS FIS to identify unknown failures. To learn more, please visit Amazon ECS.

About the authors

Shreyansh Gandhi is a Senior Software Development Engineer at Amazon ECS. He joined Amazon in 2016 and has been with AWS since 2022. His current focus is on availability, resiliency and scalability of Amazon Elastic Container Service.

Nick Peters is a Senior Software Development Engineer at Amazon ECS. He joined Amazon in 2014 and has been with AWS since 2020. His current focus is on the fleet, platform, and agents underlying Amazon Elastic Container Service.

Containers