Management & Governance | AWS Cloud Operations Blog

Deploy OpenTelemetry Gateway on AWS: Monitoring Your Observability Pipeline

Deploy an OpenTelemetry gateway on Amazon EKS, export metrics to CloudWatch over native OTLP, and monitor the pipeline’s own health with PromQL dashboards and alarms.

Turn Your Amazon CloudWatch Alarms into Actionable Signals

Your alarm fires at 2 AM. You grab your phone, squint at the notification, and see: “ALARM: my-service-alarm has transitioned to ALARM state.” No context. No application. No hint about which of your 200 instances is the problem, or whether it even matters. I’ve been there. We’ve all been there. Alarm frustration often comes from […]

Using Amazon S3 Server Access Logs with Amazon CloudWatch Logs

TL;DR What if you could go from raw Amazon S3 server access logs to a complete security dashboard without building a custom pipeline? The dashboard below is deployed using the CloudFormation template provided in this post. Figure 1: Amazon S3 Server Access Logs Security, Compliance & Audit Dashboard Until now, getting security visibility from Amazon […]

Log analysis with facets, correlation, enrichment, and automation in Amazon CloudWatch Log Analytics

Teams working with distributed applications accumulate logs across multiple log groups, including application logs, access logs, and audit trails. When something needs investigating, an engineer opens the console and starts writing queries from scratch. The same query gets written differently by different people. The results lack context because the log event does not contain who […]

GPU Cost Attribution in Amazon EKS Using Amazon Managed Service for Prometheus, Amazon Managed Grafana, and OpenTelemetry

As organizations scale their AI and machine learning workloads on Amazon Elastic Kubernetes Service (Amazon EKS), GPU instances often represent the largest portion of compute costs. Without granular visibility into how these resources are consumed, teams struggle to attribute costs accurately. Consider a shared EKS cluster where Team A (Research) runs experimental ML models on […]

Build a Multi Account Patch Compliance Dashboard with Kiro Specs

Introduction Robust patch management is essential for maintaining system security, reliability, and compliance across your IT infrastructure. AWS Systems Manager Patch Manager provides a full-featured patching solution, enabling you to automate the deployment of operating system updates to managed nodes across AWS accounts, on-premises, and multicloud environments. However, as your organization scales across dozens or […]

From Monolith to Multi-Account: Pinterest’s AWS Organization Transformation Journey

Introduction Pinterest launched in 2009 with a mission to bring everyone the inspiration to create a life they love. As one of the early cloud pioneers, Pinterest grew to hundreds of thousands of resources and exabytes of data within a single AWS account well before most cloud-native organizations operated at that scale or the best […]

How Honeycomb improved resilience using AWS Fault Injection Service

Building resilience within cloud workloads is an important goal for ISVs to prevent application downtime, increase system reliability, and build customer trust. Honeycomb.io is a fast and collaborative observability platform for software developers and engineering teams to understand and troubleshoot their cloud-native applications. Honeycomb gives you the rich context at sub-second query speeds and AI-assisted […]

Import Historical data from AWS CloudTrail Lake to Amazon CloudWatch

Organizations managing workloads on AWS rely on AWS CloudTrail to answer the fundamental questions: Who did what, where, and when? Since January 2022, customers have stored their CloudTrail activity logs in CloudTrail Lake, a managed data lake purpose-built for capturing, storing, querying user and API activity across their AWS environment. As organizations scale across multiple […]

Shift-Left Tag Compliance using AWS Organizations and Terraform

In this post you will learn about AWS Organizations tag policies, the tag_policy_compliance Terraform provider setting, a reusable tagging module that automatically applies required tags, and a test-driven approach that dynamically validates against your organizational policies.

AWS Cloud Operations Blog

Category: Management & Governance