Containers
Simplify network connectivity using Tailscale with Amazon EKS Hybrid Nodes
This post guides readers through integrating Tailscale with Amazon EKS Hybrid Nodes to simplify and secure network connectivity between on-premises infrastructure and AWS. The integration enables encrypted point-to-point connections using the WireGuard protocol, creating a peer-to-peer mesh network that streamlines the network architecture needed for EKS Hybrid Nodes.
Testing network resilience of AWS Fargate workloads on Amazon ECS using AWS Fault Injection Service
In this post, we demonstrate how to test network resilience of AWS Fargate workloads on Amazon ECS using AWS Fault Injection Service’s new network fault injection capabilities, including network latency, blackhole, and packet loss experiments. Through a sample three-tier application architecture, we show how to conduct controlled chaos engineering experiments to validate application behavior during network disruptions and improve system resilience.
Streamline service-to-service communication during deployments with Amazon ECS Service Connect
When deploying containerized microservices, maintaining reliable service discovery and efficient routing during updates presents significant challenges. Traditional blue/green deployment approaches rely heavily on load balancer for traffic management, which can become complex when dealing with container-based service-to-service communication. This complexity increases the possibility of service disruption and makes it difficult to test new versions in […]
Scaling beyond IPv4: integrating IPv6 Amazon EKS clusters into existing Istio Service Mesh
Organizations are increasingly adopting IPv6 for their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, driven by three key factors: depletion of private IPv4 addresses, the need to streamline or eliminate overlay networks, and improved network security requirements on Amazon Web Services (AWS). In IPv6-enabled EKS clusters, each pod receives a unique IPv6 address from the […]
Centralized Amazon ECS task logging with Amazon OpenSearch
As enterprises continue to adopt containerized workloads, the need for robust and scalable logging solutions has become increasingly important. Logging is a crucial element in monitoring and troubleshooting distributed applications, especially in modern containerized environments such as those deployed on Amazon Elastic Container Service (Amazon ECS). As microservices architectures grow in complexity, managing logs across multiple […]
Deep dive into cluster networking for Amazon EKS Hybrid Nodes
In this post, we dive deep into cluster networking configurations for Amazon EKS Hybrid Nodes, exploring different Container Network Interface (CNI) options and load balancing solutions to meet various networking requirements. The post demonstrates how to implement BGP routing with Cilium CNI, static routing with Calico CNI, and set up both on-premises load balancing using MetalLB and external load balancing using AWS Load Balancer Controller.
Under the hood: Amazon EKS ultra scale clusters
This post was co-authored by Shyam Jeedigunta, Principal Engineer, Amazon EKS; Apoorva Kulkarni, Sr. Specialist Solutions Architect, Containers and Raghav Tripathi, Sr. Software Dev Manager, Amazon EKS. Today, Amazon Elastic Kubernetes Service (Amazon EKS) announced support for clusters with up to 100,000 nodes. With Amazon EC2’s new generation accelerated computing instance types, this translates to […]
Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster
We’re excited to announce that Amazon Elastic Kubernetes Service (Amazon EKS) now supports up to 100,000 worker nodes in a single cluster, enabling customers to scale up to 1.6 million AWS Trainium accelerators or 800K NVIDIA GPUs to train and run the largest AI/ML models. This capability empowers customers to pursue their most ambitious AI […]
Improving Amazon ECS deployment consistency with SOCI Index Manifest v2
Seekable OCI (SOCI) helps Amazon Elastic Container Service (Amazon ECS) customers reduce task launch times by starting containers before their images are fully downloaded. To ensure reliable deployments, Amazon ECS software version consistency ensures that the same container image is used throughout an ECS deployment. However, when running ECS tasks with SOCI, there was still […]
Fully Sharded Data Parallel with Ray on Amazon ECS
In this post, we demonstrate how to implement Fully Sharded Data Parallel (FSDP) fine-tuning of the dolly-v2-7b model using Amazon ECS. The solution uses a Ray cluster running on ECS with two services (head and worker) connected to Amazon S3, enabling efficient distributed training across multiple GPUs while abstracting away container orchestration complexities.