AWS Solutions Library

AWS Solutions Library›
Guidance for Training Protein Language Models (ESM-2) with Amazon SageMaker HyperPod

Guidance for Training Protein Language Models (ESM-2) with Amazon SageMaker HyperPod

Open guide

Go to sample code

Overview

This Guidance demonstrates how to streamline and accelerate the training of complex protein folding AI models using AWS SageMaker HyperPod's managed platform. By leveraging NVIDIA GPUs and automated cluster provisioning, researchers can significantly simplify the distributed training process for generative AI models like ESM-2. The solution addresses key challenges in high-performance computing for life sciences, enabling efficient model customization and deployment at scale. This approach helps research teams reduce operational complexity while maximizing computational resources, ultimately accelerating breakthrough discoveries in protein research and drug development.

Benefits

Streamline ESM-2 model training with pre-configured HyperPod clusters that automatically handle distributed computing requirements. Reduce time-to-market while maintaining operational excellence through automated infrastructure deployment.

Reserve compute capacity through Flexible Training Plans and On-Demand Capacity Reservations for predictable pricing. Scale ML training resources efficiently while maintaining cost optimization through managed infrastructure.

Monitor training progress through comprehensive observability tools that provide real-time metrics. Track cluster health and performance indicators while maintaining operational excellence through unified dashboards.

How it works

Deploy SageMaker HyperPod cluster with SLURM orchestrator

This reference architecture demonstrates how to deploy Amazon SageMaker AI HyperPod clusters based on HPC (SLURM) orchestrator.

Download the architecture diagram

Run protein language model distributed training workloads on HyperPod-SLURM clusters

This reference architecture demonstrates how to run distributed ESM-2 model training jobs on a SLURM based HyperPod cluster.

Download the architecture diagram

Deploy SageMaker HyperPod cluster with Amazon EKS (Kubernetes) orchestrator

This reference architecture demonstrates how to deploy SageMaker HyperPod clusters based on Amazon EKS orchestrator.

Download the architecture diagram

Run protein language model distributed training workloads on HyperPod-EKS clusters

This reference architecture demonstrates how to run distributed ESM-2 training jobs on an Amazon EKS based HyperPod cluster.

Download the architecture diagram

Deploy with confidence

We'll walk you through it

Dive deep into the implementation guide for additional customization options and service configurations to tailor to your specific needs.

Open guide

Let's make it happen

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.

Go to sample code

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for Training Protein Language Models (ESM-2) with Amazon SageMaker HyperPod

Overview

Benefits

How it works

Deploy SageMaker HyperPod cluster with SLURM orchestrator

Run protein language model distributed training workloads on HyperPod-SLURM clusters

Deploy SageMaker HyperPod cluster with Amazon EKS (Kubernetes) orchestrator

Run protein language model distributed training workloads on HyperPod-EKS clusters

Deploy with confidence

We'll walk you through it

Let's make it happen

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help

Guidance for Training Protein Language Models (ESM-2) with Amazon SageMaker HyperPod

Overview

Benefits

Accelerate ML model training deployment

Optimize ML infrastructure costs

Enhance ML operations visibility

How it works

Deploy SageMaker HyperPod cluster with SLURM orchestrator

Run protein language model distributed training workloads on HyperPod-SLURM clusters

Deploy SageMaker HyperPod cluster with Amazon EKS (Kubernetes) orchestrator

Run protein language model distributed training workloads on HyperPod-EKS clusters

Deploy with confidence

We'll walk you through it

Let's make it happen

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help