Containers
Category: Compute
Centralized Amazon ECS task logging with Amazon OpenSearch
As enterprises continue to adopt containerized workloads, the need for robust and scalable logging solutions has become increasingly important. Logging is a crucial element in monitoring and troubleshooting distributed applications, especially in modern containerized environments such as those deployed on Amazon Elastic Container Service (Amazon ECS). As microservices architectures grow in complexity, managing logs across multiple […]
Deep dive into cluster networking for Amazon EKS Hybrid Nodes
In this post, we dive deep into cluster networking configurations for Amazon EKS Hybrid Nodes, exploring different Container Network Interface (CNI) options and load balancing solutions to meet various networking requirements. The post demonstrates how to implement BGP routing with Cilium CNI, static routing with Calico CNI, and set up both on-premises load balancing using MetalLB and external load balancing using AWS Load Balancer Controller.
Under the hood: Amazon EKS ultra scale clusters
This post was co-authored by Shyam Jeedigunta, Principal Engineer, Amazon EKS; Apoorva Kulkarni, Sr. Specialist Solutions Architect, Containers and Raghav Tripathi, Sr. Software Dev Manager, Amazon EKS. Today, Amazon Elastic Kubernetes Service (Amazon EKS) announced support for clusters with up to 100,000 nodes. With Amazon EC2’s new generation accelerated computing instance types, this translates to […]
Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster
We’re excited to announce that Amazon Elastic Kubernetes Service (Amazon EKS) now supports up to 100,000 worker nodes in a single cluster, enabling customers to scale up to 1.6 million AWS Trainium accelerators or 800K NVIDIA GPUs to train and run the largest AI/ML models. This capability empowers customers to pursue their most ambitious AI […]
Improving Amazon ECS deployment consistency with SOCI Index Manifest v2
Seekable OCI (SOCI) helps Amazon Elastic Container Service (Amazon ECS) customers reduce task launch times by starting containers before their images are fully downloaded. To ensure reliable deployments, Amazon ECS software version consistency ensures that the same container image is used throughout an ECS deployment. However, when running ECS tasks with SOCI, there was still […]
Fully Sharded Data Parallel with Ray on Amazon ECS
In this post, we demonstrate how to implement Fully Sharded Data Parallel (FSDP) fine-tuning of the dolly-v2-7b model using Amazon ECS. The solution uses a Ray cluster running on ECS with two services (head and worker) connected to Amazon S3, enabling efficient distributed training across multiple GPUs while abstracting away container orchestration complexities.
Amazon EKS Pod Identity streamlines cross account access
This post was co-authored by Ashok Srirama, Principal Container Specialist SA and George John, Senior Product Manager EKS. Introduction Today, we’re excited to announce a significant enhancement to Amazon EKS Pod Identity –streamlined cross-account access for Kubernetes applications. This new feature simplifies the process of granting pods permission to access AWS resources in other accounts. […]
Maximizing GPU Utilization using NVIDIA Run:ai in Amazon EKS
This post was co-authored with Chad Chapman of NVIDIA. Introduction In the fast-paced world of artificial intelligence and machine learning, GPU resources are both critical and in high demand. In this blog, we will cover key challenges related to GPU utilization in Artificial Intelligence and Machine Learning applications, and how NVIDIA Run:ai fractional GPU technology […]
Deep Dive: Amazon EKS Dashboard for Visibility into Multi-Cluster Operations and Governance
This blog post was jointly authored by Carlos Santana, Sr. Solution Architect, Containers; Sriram Ranganathan, Sr. Product Manager, Kubernetes; Sabari Sawant, Product Marketing Manager, Kubernetes; and Frank Carta, Sr. GTM specialist, Containers. As organizations grow their Kubernetes infrastructure across AWS Regions and accounts, they face increasing challenges in maintaining oversight of their Kubernetes clusters. Without […]
Introducing AI on EKS: powering scalable AI workloads with Amazon EKS
This blog post was jointly authored by Vara Bonthu, Principal OSS Specialist Solutions Architect and Omri Shiv, Senior Open Source ML Engineer Introduction We’re excited to announce the launch of AI on EKS: a new open source initiative from Amazon Web Services (AWS) designed to help customers deploy, scale, and optimize artificial intelligence/machine learning (AI/ML) […]