Amazon SageMaker HyperPod

Amazon SageMaker HyperPod customers

Top AI start-ups and organizations of all sizes are training and deploying foundation models at scale on SageMaker HyperPod

WRITER

With AWS infrastructure, WRITER transformed its approach to training LLMs. They used SageMaker HyperPod to support seamless multi-node distributed training. It empowered WRITER’s research team to focus on model development while improving performance across industry benchmarks.

Read the case study

Salesforce

Salesforce’s AI Research teams achieved rapid, large-scale deployment of training infrastructure—turning isolated nodes into a high-performance GPU fabric on SageMaker HyperPod. By eliminating DevOps overhead and offering advanced training stack recipes out of the box, HyperPod dramatically accelerates model training cycles, helping Salesforce innovate faster for their customers. Checkpointless training in Amazon SageMaker HyperPod will transform our LLM training infrastructure. This technology enables fault recovery in minutes without losing training progress or needing to fallback to checkpoints, enabling Salesforce's AI Research teams to accelerate our workloads and roadmap. Elastic training will enable our workloads to automatically scale to absorb idle GPUs as they become available and seamlessly yield resources, all without disrupting development cycles. Most importantly, it will save us hours spent manually reconfiguring jobs to match available compute, time that we can reinvest in innovation.

Luma AI

Training frontier visual AI models requires massive compute power and seamless infrastructure. Luma AI trains on 1,000 times more data than the largest LLMs, demanding an advanced, scalable solution. SageMaker HyperPod delivers the reliability, performance, and efficiency needed to keep GPUs, networking, and storage working in perfect unison. With HyperPod, AI developers can train complex models faster, optimize resources, and bring cutting-edge AI to market with confidence.

Amazon Nova

Amazon AGI team trained Amazon Nova foundation models on SageMaker HyperPod with optimized infrastructure, high-speed storage, and integrated monitoring and observability tools. SageMaker HyperPod enables resilient, efficient, and scalable model development across large, distributed clusters.

Hugging Face

Hugging Face used SageMaker HyperPod to create new open foundation models like StarCoder, IDEFICS, and Zephyr. SageMaker HyperPod purpose-built resiliency and performance capabilities have enabled their open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure.

Perplexity AI

Perplexity built and fine-tuned the LLMs that power their conversational answer engine, which answers questions along with references provided in the form of citations. With SageMaker HyperPod, they perform model training 40% faster and run experiments twice as fast.

Read the case study

Articul8 AI

With HyperPod, Articul8 increased 35% productivity and scaled up of GenAI operations. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, thereby reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. With SageMaker HyperPod observability, they deploy metric collection and visualization systems in a single click, saving teams days of otherwise manual setup and enhancing cluster observability workflows and insights.

Read the case study

Coastal Carbon

Coastal Carbon is revolutionizing environmental conservation through artificial intelligence and the cloud. With SageMaker HyperPod, they process thousands of petabytes of historical satellite data in order to create a digital twin and foundation model of the natural world.

EvolutionaryScale

EvolutionaryScale is a pioneering AI startup that enables scientists to understand, imagine, and create proteins. With SageMaker HyperPod, they trained on over 2 billion protein sequences, pushing the limits of protein engineering and drug discovery.

Noetik

Noetik is an AI-native biotechnology company leveraging SageMaker HyperPod to discover and develop cancer therapeutics.

Read the case study

Latent Labs

Latent Labs turned to SageMaker HyperPod to quickly scale model development tasks such as training, fine-tuning, or inference (using a model to make predictions based on new data), across a cluster of hundreds or thousands of AI accelerators. The ability to more precisely and easily generate and test new biological sequences (like DNA) via AI models will speed up their manufacture and deployment in the real world.

Read the blog

TwelveLabs

TwelveLabs is transforming the way businesses interact with and use AI-powered video intelligence. They use SageMaker HyperPod to train and scale its models more efficiently. With the resilience and the distributed training infrastructure, they can quickly spin up GPUs and train models as quickly as possible.

Read the blog

Arcee AI

Arcee AI develops domain-adapted small language models (SLMs) to help enterprises perform specialized tasks, such as analyzing legal documents. They use SageMaker HyperPod to efficiently distribute training workloads across GPUs, reducing model training time by 40%.

Intercom

At Intercom, we're constantly training new models to improve Fin, and we're very excited to integrate checkpointless training into our pipelines. This will completely eliminate the need for manual checkpoint recovery. Combined with elastic training, it will allow us to deliver improvements to Fin faster and with lower infrastructure costs.

Bayer

With SageMaker HyperPod, Bayer trained and utilized new FMs in just a few short months. Their scientific team can now process vast amounts of biomedical imaging data, train sophisticated machine learning (ML) models, and identify promising drug candidates based on phenotypic signatures. As Bayer continues to innovate, their work with AWS helps to pave the way for faster, more efficient pharmaceutical R&D.

Read the blog

Sony Honda Mobility

Sony Honda Mobility is using SageMaker HyperPod for model training within their MLOps pipeline to enhance AFEELA Intelligent Drive. “HyperPod's out-of-the-box observability features provide us with a comprehensive set of metrics across multiple dimensions (cluster, node, task, etc.), we look forward to gaining deeper, preconfigured health and performance insights, with task-level aggregation.“

Motoi Kataoka, MLOps Engineer in the Network Service Development Division at Sony Honda Mobility

Thomson Reuters

Thomson Reuters has been at the forefront of AI development for over 30 years, and we are committed to providing meaningful solutions that help our customers deliver results faster, with better access to trusted information. To accelerate our innovation in generative AI, in addition to partnering with LLM providers, we also are exploring training custom models more efficiently with our unique and proprietary content and human expertise. SageMaker HyperPod’s distributed training libraries helps us improve large scale model training performance. And its resiliency feature saves time as we monitor and manage infrastructure. Training our foundation models on SageMaker HyperPod will increase our speed to market and help us provide quality solutions for our customers at pace.

Joel Hron, Head of AI and Labs, Thomson Reuters and John Duprey, Distinguished Engineer, Thomson Reuters Labs

Read the blog

Stability AI

As the leading open-source generative AI company, our goal is to maximize the accessibility of modern AI. We are building foundation models with tens of billions of parameters, which require the infrastructure that can scale optimized training performance. With SageMaker HyperPod’s managed infrastructure and optimization libraries, we can reduce training time and costs by over 50%. It makes our model training more resilient and performant to build state-of-the-art models faster.

Emad Mostaque, Founder and CEO, Stability AI

Read the blog

Recursal AI

The whole process was streamlined. Using SageMaker HyperPod, we can take advantage of cluster resiliency features that identify and automatically recover training jobs from the last saved checkpoint in the event of a hardware failure. We run very diverse workloads - from application, inference and training - with Kubernetes as the common thread. For us, Amazon EKS with SageMaker HyperPod just works: the nodes just drop into our cluster.

Nathan Wilce, Infrastructure/data lead, Recursal

Hippocratic AI

Hippocratic AI, an AI company that develops the first safety-focused Large Language Model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. Amazon SageMaker HyperPod flexible training plans made it easier for them to gain access to Amazon Elastic Compute Cloud (Amazon EC2) P5 Instances. HippocraticAI is also leveraging AWS services such as Grafana to track important GPU utilization metrics. Using Amazon EC2 P5 Instances, Hippocratic AI has increased model training speed by four times and scales its solution to accommodate hundreds of use cases. It helped them to secure the required compute resources and train models quickly.

NinjaTech

NinjaTech AI, a generative AI company that provides an all-in-one SuperAgent for unlimited productivity, used Amazon SageMaker HyperPod flexible training plans to accelerate fine-tuning of various internal models including the Llama 3.1 405B model to reduce model training costs, and automate the process. The company aims to provide a seamless experience to its users wanting access to various AI agents powering their SuperAgent Technology. To achieve this, they needed a model that could automatically predict user intention and determine which AI agent would be a good fit for it. This mechanism required making frequent updates to the model by incorporating customer feedback and new features iteratively, involving 10m-100m tokens at each round of LoRA fine-tuning. As a startup, acquiring and operating high-performance compute resources is challenging due to its steep cost and bandwidth issues, specifically in multi-node clusters which involve fast network and fast storage in addition to accelerated computing. In addition, the training process is time-consuming, involving steps like model downloading, distributed training, checkpoint, monitoring, auto remediation, merging, and quantization. HyperPod’s flexible training plans provided the company with reliable and affordable compute in advance of the training run, matching their specific compute and timeline requirements, while ensuring efficient model training.

OpenBabylon

Developers and data scientists at OpenBabylon, an AI company that customizes large language models for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large scale experiments. Using the multi-node SageMaker HyperPod’s distributed training capabilities, they conducted 100 large scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating SageMaker HyperPod’s ability to successfully deliver complex projects on time and at budget.

H.AI

" With Amazon SageMaker HyperPod, we built and deployed the foundation models behind our agentic AI platform using the same high-performance compute. This seamless transition from training to inference streamlined our workflow, reduced time to production, and ensured consistent performance in live environments. HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency. "

Laurent Sifre, Co-founder & CTO, H.AI

Datology AI

" We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing expensive GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics - from task-specific GPU utilization to file system (FSx for Lustre) performance - without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems. "

Josh Wills, Member of Technical Staff, Datology AI

Splash Music

“With SageMaker HyperPod and Trainium, our researchers experiment as fast as our community creates. We’re not just keeping up with music trends—we’re setting them.”

Randeep Bhatia, Chief Technology Officer, Splash Music

Read the blog

Amazon SageMaker HyperPod partners

Drive innovation and unlock greater business value with AWS partners that have deep technical knowledge and proven customer success

Accenture

" We are extending our partnership with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Our collaboration with AWS will allow us to guide customers towards the latest technological breakthroughs while helping to reduce generative AI application costs. By bringing together centralized governance capabilities in SageMaker HyperPod, and our experience in generative AI projects, we can help companies realize the value of generative AI even faster, improving customer experience and increasing return on investment. "

Jennifer Jackson, Global Lead for Accenture AWS Business Group & Senior Managing Director

Slalom

" We are thrilled to collaborate with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Working with AWS, we can now help our customers rapidly adopt the latest technological advancements and reduce the costs of their generative AI applications. By bringing together centralized governance capabilities in SageMaker HyperPod, with Slalom’s extensive AI and cloud experience, we can deliver exceptional customer experiences alongside increased return on investment. "

Jeff Kempiners, Managing Director of Slalom’s Amazon Center of Excellence (CoE)

Rackspace Technology

" We are excited to collaborate with AWS as a launch partner for SageMaker HyperPod task governance. Together, we can help our customers reduce the costs of generative AI applications, while keeping up with the latest technological advancements. By combining SageMaker HyperPod’s centralized governance capabilities with Rackspace’s deep AI and cloud expertise, we can transform customer experiences and improve their return on investment simultaneously. "

Srini Koushik, President, AI, Technology and Sustainability at Rackspace Technology

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Amazon SageMaker HyperPod customers

WRITER

Salesforce

Luma AI

Amazon Nova

Hugging Face

Perplexity AI

Articul8 AI

Coastal Carbon

EvolutionaryScale

Noetik

Latent Labs

TwelveLabs

Arcee AI

Intercom

Bayer

Sony Honda Mobility

Thomson Reuters

Stability AI

Recursal AI

Hippocratic AI

NinjaTech

OpenBabylon

H.AI

Datology AI

Splash Music

Amazon SageMaker HyperPod partners

Accenture

Slalom

Rackspace Technology

Did you find what you were looking for today?

Learn

Resources

Developers

Help