Planning for failure: How to make generative AI workloads more resilient

As more and more public sector organizations deploy generative AI workloads, we are increasingly asked what can be done to make sure that these workloads are resilient to failures. Although we’ve published many documents around best practices for creating highly available workloads, generative AI workloads do contain unique characteristics that need special attention. In this post, we discuss the key factors that mission-based organizations should consider so that their generative AI workloads are resilient to failures.

There are many kinds of events that could cause a generative AI workload to become unavailable, such as a failed rollout, large spikes in traffic, or even a ransomware attack. Regardless of the cause, it’s often helpful to think about resilience in terms of five broad categories: redundancy, sufficient capacity, timely output, correct output, and fault isolation. You need these five properties of a workload for it to be resilient. When one of these properties is missing from a workload, it’s likely you will encounter an availability problem.

Redundancy

The redundancy property is all about eliminating single points of failure from a workload. There are a few ways this property appears in generative AI workloads. The advent of agentic AI has moved us toward Model Context Protocol (MCP). You can think of MCP as an architecture pattern that allows large language models (LLMs) to speak to tools. Tools are external data or services that an LLM can use to complete a requested action. The MCP pattern allows you to decouple tools from the LLM, thus allowing you to scale the tools independently. You should also build your agentic workloads so that they can degrade gracefully in instances where the tools needed by an LLM become unavailable.

Another consideration is the use of cross-Region inference profiles with Amazon Bedrock. Cross-Region inference profiles allow you to seamlessly manage unplanned bursts of traffic by using compute across different AWS Regions. To use a cross-Region inference profile, you call Amazon Bedrock with the inference profile ID. Inference profiles are immutable, meaning new AWS Regions aren’t added to an existing inference profile. Inference profiles for the United States route the request between US Regions, inference profiles for the EU route the request between EU Regions, and inference profiles for APAC route the request between APAC Regions. For more details, go to the supported Regions and models for inference profiles.

Some mission-based organizations use service control policies (SCPs) to deny access to specific AWS Regions. Restricting AWS Region use is often done for compliance and governance purposes. However, when using SCPs, you must allow the AWS Regions listed in the inference profile so that the Amazon Bedrock service can properly route generative AI traffic between AWS Regions. This post provides an example SCP to allow Amazon Bedrock cross-Region inference.

Sufficient capacity

Sufficient capacity means that the workload has enough resources to function as intended. Resources include the obvious such as memory and CPU, but also the less obvious such as throughput and service quotas.

Amazon Bedrock has a large number of quotas, which can be viewed through the console. You must be aware of the quotas of the models that you are using, and you must be proactive in requesting quota increases. Furthermore, quota defaults may be different across AWS Regions. This is especially important to consider as you enable cross-Region inference profiles. If you use cross-Region inference profiles, then you must make sure that as your workload scales, you proactively increase the appropriate service quotas across all AWS Regions associated with your inference profile.

As discussed in the redundancy section, one of the advantages of the MCP pattern is that it allows independent scaling of the tools from the LLM inference. Depending on your environment, you may be running MCP servers in Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), or even AWS Lambda. Regardless of which compute platform you might be using, make sure that you can effectively scale your servers to support demand.

If you need a predictable amount of throughput, then you may consider purchasing provisioned throughput with Amazon Bedrock. Provisioned throughput allows you to provision a higher level of throughput for a model at a fixed cost. Provisioned throughput is also needed if you’ve customized a model.

Finally, it’s worth considering patterns such as load shedding to prioritize important traffic and reduce the impact of a flood of traffic. Although no one wants to reject traffic, we often need to make hard decisions when prioritizing the availability of a workload. In the case of being overwhelmed with traffic, we can choose to have a bad experience for a small set of users by rejecting a subset of traffic, or we choose to have a bad experience for all of our users by experiencing a server brownout.

Timely output

Timely output concerns the workload generating its expected output within a reasonable amount of time. If the workload takes longer that what is “reasonable”, then users, customers, or members will likely consider the workload to be unavailable.

A best practice here is to determine a metric that allows operators to easily understand the health of the system. These types of metrics go beyond traditional CPU or memory metrics by representing the business or organizational value of the workload. One good example is Netflix’s “starts per second” (SPS). Monitoring a metric that is closely aligned with the workload’s business value means that it can help to more quickly identify when something is amiss. When looking at metrics, consider monitoring leading and lagging indicators, which inform you that the system is approaching an impacting condition and the impact of the failure mode when it has occurred.

When dealing with LLMs, consider the latency as opposed to the predictive power of the model. Typically, when you look at a model family (say Anthropic’s Claude family), the model with the most predictive power tends to have the highest latency (in this case Claude Opus), while the model with lower predictive power tends to have a lower latency (such as Claude Haiku). In many cases, the best model isn’t necessarily the one with the best predictive power, rather the model that trades some predictive power for lower latency. Using lower latency models allows you to return results to users or members more quickly. This is especially important as you begin looking at multi-agent collaboration, where the workload latency is the sum of the latency from all agents used in the workload.

When a request fails within a workload, it’s common to retry with backoff and jitter. This pattern works well for transient failures, because the number of retry requests is usually small. However, in cases where failures are caused by overload, retries make the problem worse by increasing the traffic on the workload even more. A better solution in these cases is to limit retries using a token bucket. In this pattern, retries are allowed if there are tokens available, which usually refill at a slow rate that is aligned with completing successful requests. More guidance is available in the Amazon Builder’s Library.

Correct output

Correct output makes sure that the workload does not return incomplete or incorrect output, which can often be worse than no response at all. When dealing with generative AI, this principle aligns closely with the concept of controllability as it relates to responsible AI. This principle involves having mechanisms in place to steer the output of a generative AI model so that it avoids topics, content, or responses that are not aligned with your mission. Organizations using Amazon Bedrock for generative AI can use Guardrails for Amazon Bedrock, which are a set of tools you can use to prevent factual errors, avoid specific categories and topics, prevent prompt attacks from overriding system instructions, avoid profanity or custom lists of words and phrases, filter personally identifiable information (PII), and more.

At the end of the day, generative AI workloads are still workloads. Therefore, they can be vulnerable to the same types of attacks to which non-generative AI workloads are vulnerable. The Open Web Application Security Project (OWASP) continues to warn that you should validate untrusted user input before using it. Unvalidated user input—passed directly into a generative AI prompt—could be used to poison the prompt and change the model’s results.

Many generative AI workloads use a knowledge base for a retrieval augmented generation (RAG) architecture. In this architecture, organization knowledge is stored as both plain text and an embedding representation in a vector database. Knowledge bases can be expensive to create—from curating the content, to determining chunking strategy and generating and storing the embedding. These databases should be regularly backed up and aligned to a cadence that represents the frequency of updates to the knowledge base. Using solutions such as AWS Backup to regularly back up and protect your vector databases allows you to avoid situations where you might lose precious embedding or knowledge base data. Regular backups that are secure and protected can insulate you against operational issues that range from accidental data deletion to malicious destruction through malware or ransomware.

Finally, you must understand how your generative AI models operate under a variety of conditions. I typically see organizations test their models by vibes. In other words, some organizations tend to test their models based on their emotional response to the model’s output. Evaluation tools such as Amazon Bedrock Evaluations allow you to get away from vibe testing and have confidence in your workload’s performance under a variety of different conditions. Bedrock Evaluations generates a quantitative result that can be used to compare different model evaluations. This approach can provide the confidence to change your workload prompts or model without endless cycles of vibe testing.

Fault isolation

Fault isolation is about reducing the scope or impact of failures, so that a failure in one system doesn’t cascade and affect other systems. Fault isolation can become a problem with shared components. In the case of generative AI workloads, you might have a set of tools (often MCP servers) that are shared across a set of generative AI workloads. Each generative AI workload should be configured to gracefully handle situations where the set of tools might not be available. Some patterns useful in these situations include those discussed in the timely output section, specifically implementing backoff and jitter, which are useful for transient errors. The retry pattern with a token bucket can be particularly effective for non-transient errors, because it allows retries only after a large number of successful requests have completed. Another pattern is the circuit breaker pattern, which, when tripped, prevents the calling service from accessing a downstream service that has experienced repeated timeouts or failures. However, this pattern should be used with great caution, because it can introduce bi-modal behavior, or situations where the workload operates one way under normal conditions and another way when operating under a failure mode. Furthermore, a poorly placed circuit breaker can make a small operational problem much worse.

Like traditional workloads, generative AI workloads need to be patched and hardened. These workloads might consist of Amazon EC2, containers, or Lambda. You should consider using automated tools such as AWS Systems Manager Patch Manager to automatically patch EC2 instances, or consider creating golden images through tools such as EC2 Image Builder that you redeploy across the workload. For container-based workloads, consider scanning your containers stored in Amazon ECS for OS and programming language vulnerabilities. When vulnerabilities are found, you can kick off a pipeline to rebuild and deploy your container images. Lambda allows you to use Amazon Inspector to continuously scan your functions for vulnerabilities and other deviations from best practices.

Conclusion

Building a resilient system isn’t a one-time activity. Instead, it’s a continuous approach where you set resilience objectives, design and implement those objectives, evaluate and test what was built, operate the workload, and respond and learn from your previous activities.

In this post, we outlined several areas where generative AI workloads need special attention. Furthermore, the unique compliance and security requirements of many mission-based organizations mean that you must build generative AI workloads in ways that are safe, predictable, and resilient to failures.

As a next step, begin evaluating your generative AI workloads against the five resilience properties outlined in this post. Use the guidance provided here to make your workloads more resilient today.

AWS Public Sector Blog

Planning for failure: How to make generative AI workloads more resilient

Redundancy

Sufficient capacity

Timely output

Correct output

Fault isolation

Conclusion

Resources

Follow

Learn

Resources

Developers

Help