Skip to main content

Amazon SageMaker HyperPod customers

Top AI start-ups and organizations of all sizes are training and deploying foundation models at scale on SageMaker HyperPod

Hugging Face

Hugging Face used SageMaker HyperPod to create new open foundation models like StarCoder, IDEFICS, and Zephyr. SageMaker HyperPod purpose-built resiliency and performance capabilities have enabled their open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure.

Perplexity AI

Perplexity built and fine-tuned the LLMs that power their conversational answer engine, which answers questions along with references provided in the form of citations. With SageMaker HyperPod, they perform model training 40% faster and run experiments twice as fast.

Articul8 AI

Articul8 increased up to 35% productivity using SageMaker HyperPod. 

Coastal Carbon

Coastal Carbon is revolutionizing environmental conservation through artificial intelligence and the cloud. With SageMaker HyperPod, they process thousands of petabytes of historical satellite data in order to create a digital twin and foundation model of the natural world.

EvolutionaryScale

EvolutionaryScale is a pioneering AI startup that enables scientists to understand, imagine, and create proteins. With SageMaker HyperPod, they trained on over 2 billion protein sequences, pushing the limits of protein engineering and drug discovery.

Writer

Writer is pioneering a new era of LLM development. They trained their industry-leading models on HyperPod with faster model training, reduced latency, and optimized AI performance.

Noetik

Noetik is an AI-native biotechnology company leveraging SageMaker HyperPod to discover and develop cancer therapeutics.

Hugging Face

Hugging Face has been using SageMaker HyperPod to create important new open foundation models like StarCoder, IDEFICS, and Zephyr which have been downloaded millions of times. SageMaker HyperPod’s purpose-built resiliency and performance capabilities have enabled our open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure. We especially liked how SageMaker HyperPod is able to detect ML hardware failure and quickly replace the faulty hardware without disrupting ongoing model training. Because our teams need to innovate quickly, this automated job recovery feature helped us minimize disruption during the foundation model training process, helping us save hundreds of hours of training time in just a year.

Jeff Boudier, head of Product at Hugging Face

Missing alt text value

Perplexity AI

We were looking for the right ML infrastructure to increase productivity and reduce costs in order to build high-performing large language models. After running a few successful experiments, we switched to AWS from other cloud providers in order to use Amazon SageMaker HyperPod. We have been using HyperPod for the last four months to build and fine-tune the LLMs to power the Perplexity conversational answer engine that answers questions along with references provided in the form of citations. Because SageMaker HyperPod automatically monitors cluster health and remediates GPU failures, our developers are able to focus on model building instead of spending time on managing and optimizing the underlying infrastructure. SageMaker HyperPod’s built-in data and model parallel libraries helped us optimize training time on GPUs and double the training throughput. As a result, our training experiments can now run twice as fast, which means our developers can iterate more quickly, accelerating the development of new generative AI experiences for our customers.

Aravind Srinivas, co-founder and CEO at Perplexity AI

Missing alt text value

Articul8 AI

Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our GenAI operations. As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and finetuning workloads in a more streamlined manner.

Arun Subramaniyan, Founder and CEO of Articul8 AI

Missing alt text value

Thomson Reuters

Thomson Reuters has been at the forefront of AI development for over 30 years, and we are committed to providing meaningful solutions that help our customers deliver results faster, with better access to trusted information. To accelerate our innovation in generative AI, in addition to partnering with LLM providers, we also are exploring training custom models more efficiently with our unique and proprietary content and human expertise. SageMaker HyperPod’s distributed training libraries helps us improve large scale model training performance. And its resiliency feature saves time as we monitor and manage infrastructure. Training our foundation models on SageMaker HyperPod will increase our speed to market and help us provide quality solutions for our customers at pace.

Joel Hron, Head of AI and Labs, Thomson Reuters and John Duprey, Distinguished Engineer, Thomson Reuters Labs

Missing alt text value

Stability AI

As the leading open-source generative AI company, our goal is to maximize the accessibility of modern AI. We are building foundation models with tens of billions of parameters, which require the infrastructure that can scale optimized training performance. With SageMaker HyperPod’s managed infrastructure and optimization libraries, we can reduce training time and costs by over 50%. It makes our model training more resilient and performant to build state-of-the-art models faster.

Emad Mostaque, Founder and CEO, Stability AI

Missing alt text value

Recursal AI

The whole process was streamlined. Using SageMaker HyperPod, we can take advantage of cluster resiliency features that identify and automatically recover training jobs from the last saved checkpoint in the event of a hardware failure. We run very diverse workloads - from application, inference and training - with Kubernetes as the common thread. For us, Amazon EKS with SageMaker HyperPod just works: the nodes just drop into our cluster.

Nathan Wilce, Infrastructure/data lead, Recursal

Missing alt text value

Hippocratic AI

Hippocratic AI, an AI company that develops the first safety-focused Large Language Model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. Amazon SageMaker HyperPod flexible training plans made it easier for them to gain access to Amazon Elastic Compute Cloud (Amazon EC2) P5 Instances. HippocraticAI is also leveraging AWS services such as Grafana to track important GPU utilization metrics. Using Amazon EC2 P5 Instances, Hippocratic AI has increased model training speed by four times and scales its solution to accommodate hundreds of use cases. It helped them to secure the required compute resources and train models quickly.

Missing alt text value

NinjaTech

NinjaTech AI, a generative AI company that provides an all-in-one SuperAgent for unlimited productivity, used Amazon SageMaker HyperPod flexible training plans to accelerate fine-tuning of various internal models including the Llama 3.1 405B model to reduce model training costs, and automate the process. The company aims to provide a seamless experience to its users wanting access to various AI agents powering their SuperAgent Technology. To achieve this, they needed a model that could automatically predict user intention and determine which AI agent would be a good fit for it. This mechanism required making frequent updates to the model by incorporating customer feedback and new features iteratively, involving 10m-100m tokens at each round of LoRA fine-tuning. As a startup, acquiring and operating high-performance compute resources is challenging due to its steep cost and bandwidth issues, specifically in multi-node clusters which involve fast network and fast storage in addition to accelerated computing. In addition, the training process is time-consuming, involving steps like model downloading, distributed training, checkpoint, monitoring, auto remediation, merging, and quantization. HyperPod’s flexible training plans provided the company with reliable and affordable compute in advance of the training run, matching their specific compute and timeline requirements, while ensuring efficient model training.

Missing alt text value

OpenBabylon

Developers and data scientists at OpenBabylon, an AI company that customizes large language models for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large scale experiments. Using the multi-node SageMaker HyperPod’s distributed training capabilities, they conducted 100 large scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating SageMaker HyperPod’s ability to successfully deliver complex projects on time and at budget.

Missing alt text value

Salesforce

Researchers at Salesforce were looking for ways to quickly get started with foundational model training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With Amazon SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Missing alt text value

H.AI

" With Amazon SageMaker HyperPod, we built and deployed the foundation models behind our agentic AI platform using the same high-performance compute. This seamless transition from training to inference streamlined our workflow, reduced time to production, and ensured consistent performance in live environments. HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency. "

Laurent Sifre, Co-founder & CTO, H.AI

Missing alt text value

Datology AI

" We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing expensive GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics - from task-specific GPU utilization to file system (FSx for Lustre) performance - without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems. "

Josh Wills, Member of Technical Staff, Datology AI

Missing alt text value

Amazon SageMaker HyperPod partners

Drive innovation and unlock greater business value with AWS partners that have deep technical knowledge and proven customer success

Accenture

" We are extending our partnership with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Our collaboration with AWS will allow us to guide customers towards the latest technological breakthroughs while helping to reduce generative AI application costs. By bringing together centralized governance capabilities in SageMaker HyperPod, and our experience in generative AI projects, we can help companies realize the value of generative AI even faster, improving customer experience and increasing return on investment. "

Jennifer Jackson, Global Lead for Accenture AWS Business Group & Senior Managing Director

Missing alt text value

Slalom

" We are thrilled to collaborate with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Working with AWS, we can now help our customers rapidly adopt the latest technological advancements and reduce the costs of their generative AI applications. By bringing together centralized governance capabilities in SageMaker HyperPod, with Slalom’s extensive AI and cloud experience, we can deliver exceptional customer experiences alongside increased return on investment. "

Jeff Kempiners, Managing Director of Slalom’s Amazon Center of Excellence (CoE)

Missing alt text value

Rackspace Technology

" We are excited to collaborate with AWS as a launch partner for SageMaker HyperPod task governance. Together, we can help our customers reduce the costs of generative AI applications, while keeping up with the latest technological advancements. By combining SageMaker HyperPod’s centralized governance capabilities with Rackspace’s deep AI and cloud expertise, we can transform customer experiences and improve their return on investment simultaneously. "

Srini Koushik, President, AI, Technology and Sustainability at Rackspace Technology

Missing alt text value