AWS Cloud Operations Blog
Category: Management Tools
GPU Cost Attribution in Amazon EKS Using Amazon Managed Service for Prometheus, Amazon Managed Grafana, and OpenTelemetry
As organizations scale their AI and machine learning workloads on Amazon Elastic Kubernetes Service (Amazon EKS), GPU instances often represent the largest portion of compute costs. Without granular visibility into how these resources are consumed, teams struggle to attribute costs accurately. Consider a shared EKS cluster where Team A (Research) runs experimental ML models on […]
Introducing native histogram support in Amazon Managed Service for Prometheus
If you run Kubernetes or microservices workloads on AWS, you probably track latency, request durations, and other value distributions with Prometheus histograms. To do that with classic histograms, you predefine a set of bucket boundaries, and Prometheus emits one time series per boundary plus a sum and a count. A single latency histogram with 20 […]
Build a Multi Account Patch Compliance Dashboard with Kiro Specs
Introduction Robust patch management is essential for maintaining system security, reliability, and compliance across your IT infrastructure. AWS Systems Manager Patch Manager provides a full-featured patching solution, enabling you to automate the deployment of operating system updates to managed nodes across AWS accounts, on-premises, and multicloud environments. However, as your organization scales across dozens or […]
From Monolith to Multi-Account: Pinterest’s AWS Organization Transformation Journey
Introduction Pinterest launched in 2009 with a mission to bring everyone the inspiration to create a life they love. As one of the early cloud pioneers, Pinterest grew to hundreds of thousands of resources and exabytes of data within a single AWS account well before most cloud-native organizations operated at that scale or the best […]
How Honeycomb improved resilience using AWS Fault Injection Service
Building resilience within cloud workloads is an important goal for ISVs to prevent application downtime, increase system reliability, and build customer trust. Honeycomb.io is a fast and collaborative observability platform for software developers and engineering teams to understand and troubleshoot their cloud-native applications. Honeycomb gives you the rich context at sub-second query speeds and AI-assisted […]
AWS Observability ICYMI: Jan-May 2026
Welcome to the first edition of the AWS Observability ICYMI (In Case You Missed It) recap! The first five months of 2026 has been transformational for AWS observability with over 40 launches across Amazon CloudWatch, AWS X-Ray, Amazon Managed Grafana, and Amazon Managed Service for Prometheus. Two major themes defined this period: OpenTelemetry as the […]
Import Historical data from AWS CloudTrail Lake to Amazon CloudWatch
Organizations managing workloads on AWS rely on AWS CloudTrail to answer the fundamental questions: Who did what, where, and when? Since January 2022, customers have stored their CloudTrail activity logs in CloudTrail Lake, a managed data lake purpose-built for capturing, storing, querying user and API activity across their AWS environment. As organizations scale across multiple […]
Simplifying Prometheus metrics collection across your AWS infrastructure
If you’re running services such as Amazon EC2 instances, Amazon Elastic Container Service (Amazon ECS) containers, and Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters in AWS, maintaining separate Prometheus servers for each environment creates significant operational burden. Managing scraper configurations, high availability, scaling, and security distracts you from building great applications. AWS managed […]
Introducing OpenTelemetry and PromQL support in Amazon CloudWatch
If you run Kubernetes or microservices workloads on AWS, your metrics likely carry dozens of labels: namespace, pod, container, node, deployment, replica set, and custom business dimensions. To get a complete picture of your environment, you may be splitting your metrics pipeline: Amazon CloudWatch for AWS metrics, and a separate Prometheus-compatible backend for high-cardinality (many […]
AWS Unified Operations: Building Resilient Operations for Mission-Critical Workloads
Achieve Mission-Critical Resiliency at Scale with AWS Unified Operations – The Top Tier of AWS Support to Achieve High Availability, Faster Migrations, and Accelerated Incident Resolution The Shift-Left Paradigm: From Reactive Firefighting to Proactive Prevention Organizations running mission-critical workloads face three critical operational gaps that undermine resilience and slow cloud adoption. Skills gaps make cloud-native […]









