AWS Cloud Operations Blog

Using Amazon Bedrock and Amazon Nova for AI-Powered Incident Response

In today’s cloud-native world, incident response teams face overwhelming challenges. When critical applications fail, engineers must sift through mountains of observability data across multiple services; all while under intense pressure to restore service quickly. This manual correlation process is time-consuming, error-prone, and often delays resolution, resulting in extended outages and frustrated customers. Traditional monitoring tools alert you to problems but leave complex analysis to humans, creating a significant operational bottleneck.

In this blog, you will learn how to use Amazon Bedrock and Amazon Nova Pro to create an AI-powered incident response system. Amazon Bedrock offers a flexible foundation for building tailored generative AI solutions that can enhance incident response processes by incorporating AWS observability tools data, third-party data sources, and application architecture diagrams. Amazon Bedrock offers a serverless API that gives you access to state-of-the-art foundation models, while Amazon Nova Pro provides advanced multimodal capabilities that can process both text and visual data simultaneously.

By combining these powerful AI services with AWS observability tools, you can develop a system that automatically ingests Amazon CloudWatch metrics, AWS Config changes, AWS X-Ray traces, and architecture diagrams to provide comprehensive incident analysis. The solution not only identifies potential causes of outages but also ranks them by probability, suggests specific troubleshooting steps, and even generates appropriate customer communications; all without requiring deep expertise in machine learning or data science.

Solution overview

Figure 1 – Solution Architecture

Figure 1 – Solution Architecture

At a high level, the solution works through the following process.

  1. Data collection: Collect and correlate data from infrastructure and observability data sources such as Amazon CloudWatch, AWS Config, and AWS X-Ray.
  2. Data storage: When an incident occurs, the `fetch-obsv-data.sh` script captures relevant data for a specific duration during the outage and stores it in an Amazon Simple Storage Service (Amazon S3) bucket for analysis.
  3. AI analysis: The `bedrock-demo-nova-pro.py` script invokes Amazon Bedrock with Amazon Nova Pro to process both textual and visual data, applying advanced AI reasoning to understand the system state.
  4. Insight generation and resolution: The AI model produces comprehensive insights including ranked probable causes, specific troubleshooting steps, and suggested customer communications. Operations teams leverage these AI-generated insights to quickly implement fixes and restore service, dramatically reducing the Mean Time to Resolution.

Prerequisites

  • An AWS account with access to Amazon Bedrock and Amazon Nova Pro.
  • An IAM user/role with required permissions to access Amazon Bedrock, CloudWatch, AWS Config, AWS X-Ray, and Amazon S3.
  • The AWS CLI configured with appropriate credentials.
  • Python 3.x installed locally.
  • boto3 Python library installed.
  • jq (JSON processor) for processing observability data.
  • Amazon S3 bucket to store observability data and architecture diagrams.

Walk-through

This walk-through utilizes PetShop as a sample application. For PetShop deployment, refer to the One Observability Workshop which provides comprehensive guidance on setting up the PetShop sample application.

Step 1: Clone the repository and set up your environment

  • Clone the github repository.
    git clone https://github.com/aws-samples/sample-aiops-nova-demo.git
  • Navigate to the project directory.
    cd sample-aiops-nova-demo.git
  • Install the required Python packages.
    pip install boto3 botocore
  • Install jq.
    sudo apt-get install jq (Linux)
    Use the jq website for binaries and installation instructions for different operating systems

Step 2: Create an Amazon S3 Bucket for storing observability data

  • Replace ‘your-region’ and ‘your-unique-bucket-name’ with your values.
    aws s3 mb s3://your-unique-bucket-name --region your-region

Step 3: Upload your architecture diagram

  • Upload your application architecture diagram to the Amazon S3 bucket.
    aws s3 cp app_diagram.png s3://your-unique-bucket-name/

Step 4: When an incident occurs, run the fetch script to collect data:

  • To simulate an outage, modify the security group attached to the elastic load balancer (ELB) by changing the inbound rule from HTTP (port 80) to a different unused port number. This will effectively block incoming traffic to the PetSite application.
    chmod +x fetch-obsv-data.sh
    ./fetch-obsv-data.sh your-region your-unique-bucket-name

This script will:

  1. Run cwreport.py to collect CloudWatch metrics.
  2. Query AWS Config for configuration changes.
  3. Extract AWS X-Ray traces for the application.
  4. Upload all data to your Amazon S3 bucket.

Step 5: Analyze the incident with Amazon Bedrock

  • Run the Amazon Bedrock script to analysis the collected data.
    python bedrock-demo-nova-pro.py your-region your-unique-bucket-name

This script will:

  1. Download the data from Amazon S3.
  2. Construct a prompt for Amazon Nova Pro that includes all data sources.
  3. Invoke Amazon Bedrock with multimodal input.
  4. Process and display the AI-generated insights.

Step 6: Review AI recommendations

The output provided will include:

  1. Ranked list of probable incident causes
  2. Analysis of recent configuration changes
  3. Specific troubleshooting steps
  4. Suggested customer communications

See the output-example.txt file to see an Amazon Nova Pro model sample response. This approach transforms traditional incident management by automating the most time-consuming aspects of troubleshooting while providing clear, actionable guidance to your operations team.

Resources

You can access the code from this blog at https://github.com/aws-samples/sample-aiops-nova-demo.

Cleaning up

To avoid ongoing charges in your AWS account, you should delete any AWS resources created in following this blog post.

Conclusion

As we’ve demonstrated throughout this blog, combining AWS observability services with generative AI creates a powerful new paradigm for incident response. By automating the analysis of complex, multi-dimensional data, you can dramatically reduce MTTR while improving the quality of their incident communications. This approach doesn’t just solve today’s operational challenges; it scales to meet the growing complexity of modern cloud architecture.

The solution we’ve built represents just the beginning of what’s possible. As foundation models continue to evolve, their ability to understand complex systems and provide actionable insights will only improve. Organizations that embrace these technologies now will be well-positioned to maintain reliable services even as their infrastructure grows in complexity.

Here’s how to get started:

  • Get started with AWS Observability solutions: Enhance your observability foundation by implementing comprehensive monitoring to ensure you’re capturing the data needed for effective analysis.
  • Explore Amazon Bedrock and its foundation model offerings—particularly Amazon Nova Pro’s multimodal capabilities that can process both text and visual information simultaneously.
  • Join the AWS Generative AI community to stay updated on the latest advancements and best practices for applying these technologies to operational challenges.
  • See the One Observability Workshop that provides a hands-on experience for the wide variety of toolsets AWS offers to setup monitoring and observability of your applications.

Acknowledgment: Special thanks to Katreena Mullican (former AWS employee) for her contributions in making this project successful.

Arun Chandapillai

Arun Chandapillai

Arun Chandapillai is a Senior Cloud Architect who is a diversity and inclusion champion. He is passionate about helping his Customers accelerate IT modernization through business-first Cloud adoption strategies and successfully build, deploy, and manage applications and infrastructure in the Cloud. Arun is an automotive enthusiast, an avid speaker, and a philanthropist who believes in ‘you get (back) what you give’.

Garrett Johnson

Garrett Johnson

Garrett Johnson is a Senior Solutions Architect at AWS. He enjoys working with customers to help them build highly available, scalable, and resilient applications. He is currently focused on helping customers leverage cloud native technology to achieve their desired business outcomes.