AWS Cloud Operations Blog
Category: Resilience
Streamlining the Correction of Errors process using Amazon Bedrock
Generative AI can streamline the Correction of Errors process, saving time and resources. By using generative AI to leverage large language models, combined with the Correction of Errors process, businesses can expedite the identification and documentation of the cause of errors, while saving time and resources. Purpose and set-up The purpose of this blog is […]
Creating a correction of errors document
This blog post will walk you through an example of creating a Correction of Errors (COE) document. At Amazon, operational excellence is in our DNA. One best practice that we have learned at Amazon is to have a standard mechanism for post-incident analysis. The COE process facilitates learning from an event to avoid reoccurrences in […]
Detecting gray failures with outlier detection in Amazon CloudWatch Contributor Insights
You may have encountered a situation in the past where a single user or small subset of users of your system are reporting an event that is impacting their experience, but your observability systems didn’t show any clear impact. The discrepancy between the customer’s experience and the system’s observation of its health is referred to […]
Why you should develop a correction of error (COE)
Application reliability is critical. Service interruptions result in a negative customer experience, thereby reducing customer trust and business value. One best practice that we have learned at Amazon, is to have a standard mechanism for post-incident analysis. This lets us analyze a system after an incident in order to avoid reoccurrences in the future. These […]


