AWS Trainium | Containers

UTH - Amazon EKS ultra scale clusters featured image

Under the hood: Amazon EKS ultra scale clusters

This post was co-authored by Shyam Jeedigunta, Principal Engineer, Amazon EKS; Apoorva Kulkarni, Sr. Specialist Solutions Architect, Containers and Raghav Tripathi, Sr. Software Dev Manager, Amazon EKS. Today, Amazon Elastic Kubernetes Service (Amazon EKS) announced support for clusters with up to 100,000 nodes. With Amazon EC2’s new generation accelerated computing instance types, this translates to […]

Featured image: Amazon EKS 100K nodes per cluster

Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster

We’re excited to announce that Amazon Elastic Kubernetes Service (Amazon EKS) now supports up to 100,000 worker nodes in a single cluster, enabling customers to scale up to 1.6 million AWS Trainium accelerators or 800K NVIDIA GPUs to train and run the largest AI/ML models. This capability empowers customers to pursue their most ambitious AI […]

Introducing AI on EKS: powering scalable AI workloads with Amazon EKS

This blog post was jointly authored by Vara Bonthu, Principal OSS Specialist Solutions Architect and Omri Shiv, Senior Open Source ML Engineer Introduction We’re excited to announce the launch of AI on EKS: a new open source initiative from Amazon Web Services (AWS) designed to help customers deploy, scale, and optimize artificial intelligence/machine learning (AI/ML) […]

Amazon EKS optimized Amazon Linux 2023 accelerated AMIs now available

Introduction Earlier this year we announced support for Amazon EKS optimized AL2023 AMIs that provided many enhancements in terms of security and performance. Amazon Linux 2023 (AL2023) is the next generation of Amazon Linux from Amazon Web Services (AWS) and is designed to provide a secure, stable, and high-performance environment to develop and run your […]

Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter

Many organizations are building artificial intelligence (AI) applications using Large Language Models (LLMs) to deliver new experiences to their customers, from content creation to customer service and data analysis. However, the substantial size and intensive computational requirements of these models may have challenges in configuring, deploying, and scaling them effectively on graphic processing units (GPUs). […]

Containers

Category: AWS Trainium

Under the hood: Amazon EKS ultra scale clusters

Amazon EKS enables ultra scale AI/ML workloads with support for 100K nodes per cluster

Introducing AI on EKS: powering scalable AI workloads with Amazon EKS

Amazon EKS optimized Amazon Linux 2023 accelerated AMIs now available

Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter

Learn

Resources

Developers

Help