AWS Cloud Operations Blog

Category: Management & Governance

Log analysis with facets, correlation, enrichment, and automation in Amazon CloudWatch Log Analytics

Teams working with distributed applications accumulate logs across multiple log groups, including application logs, access logs, and audit trails. When something needs investigating, an engineer opens the console and starts writing queries from scratch. The same query gets written differently by different people. The results lack context because the log event does not contain who […]

GPU Cost Attribution in Amazon EKS Using Amazon Managed Service for Prometheus, Amazon Managed Grafana, and OpenTelemetry

As organizations scale their AI and machine learning workloads on Amazon Elastic Kubernetes Service (Amazon EKS), GPU instances often represent the largest portion of compute costs. Without granular visibility into how these resources are consumed, teams struggle to attribute costs accurately. Consider a shared EKS cluster where Team A (Research) runs experimental ML models on […]

Build a Multi Account Patch Compliance Dashboard with Kiro Specs

Introduction Robust patch management is essential for maintaining system security, reliability, and compliance across your IT infrastructure. AWS Systems Manager Patch Manager provides a full-featured patching solution, enabling you to automate the deployment of operating system updates to managed nodes across AWS accounts, on-premises, and multicloud environments. However, as your organization scales across dozens or […]

From Monolith to Multi-Account: Pinterest’s AWS Organization Transformation Journey

Introduction Pinterest launched in 2009 with a mission to bring everyone the inspiration to create a life they love. As one of the early cloud pioneers, Pinterest grew to hundreds of thousands of resources and exabytes of data within a single AWS account well before most cloud-native organizations operated at that scale or the best […]

How Honeycomb improved resilience using AWS Fault Injection Service

Building resilience within cloud workloads is an important goal for ISVs to prevent application downtime, increase system reliability, and build customer trust. Honeycomb.io is a fast and collaborative observability platform for software developers and engineering teams to understand and troubleshoot their cloud-native applications. Honeycomb gives you the rich context at sub-second query speeds and AI-assisted […]

Import Historical data from AWS CloudTrail Lake to Amazon CloudWatch

Organizations managing workloads on AWS rely on AWS CloudTrail to answer the fundamental questions: Who did what, where, and when? Since January 2022, customers have stored their CloudTrail activity logs in CloudTrail Lake, a managed data lake purpose-built for capturing, storing, querying user and API activity across their AWS environment.  As organizations scale across multiple […]

Shift-Left Tag Compliance using AWS Organizations and Terraform

In this post you will learn about AWS Organizations tag policies, the tag_policy_compliance Terraform provider setting, a reusable tagging module that automatically applies required tags, and a test-driven approach that dynamically validates against your organizational policies.

Simplifying Prometheus metrics collection across your AWS infrastructure

If you’re running services such as Amazon EC2 instances, Amazon Elastic Container Service (Amazon ECS) containers, and Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters in AWS, maintaining separate Prometheus servers for each environment creates significant operational burden. Managing scraper configurations, high availability, scaling, and security distracts you from building great applications. AWS managed […]

AWS Unified Operations: Building Resilient Operations for Mission-Critical Workloads

Achieve Mission-Critical Resiliency at Scale with AWS Unified Operations – The Top Tier of AWS Support to Achieve High Availability, Faster Migrations, and Accelerated Incident Resolution The Shift-Left Paradigm: From Reactive Firefighting to Proactive Prevention Organizations running mission-critical workloads face three critical operational gaps that undermine resilience and slow cloud adoption. Skills gaps make cloud-native […]

Essential security controls to prevent unauthorized account removal in AWS Organizations

When AWS member accounts are compromised, attackers can remove them from your organization, disabling all governance controls. In this post, you’ll learn how to protect your AWS environment from account compromise leaving your AWS Organization using layered security controls, including service control policies, secure account migration, and centralized root access management. AWS secures the infrastructure […]