Amazon SageMaker HyperPod features
Scale and accelerate generative AI model development across thousands of AI accelerators
Task governance
Flexible training plans
Optimized recipes to customize models
SageMaker HyperPod recipes help data scientists and developers of all skill sets benefit from state-of-the-art performance while quickly getting started training and fine-tuning publicly available generative AI models, including Llama, Mixtral, Mistral, and DeepSeek models. In addition, you can customize Amazon Nova foundation models (FMs), including Nova Micro, Nova Lite, and Nova Pro using a suite of techniques, which includes Supervised Fine-Tuning (SFT), Knowledge Distillation, Direct Preference Optimization (DPO), Proximal Policy Optimization, and Continued Pre-Training—with support for both parameter-efficient and full-model training options across SFT, Distillation, and DPO. Each recipe includes a training stack that has been tested by AWS, removing weeks of tedious work testing different model configurations. You can switch between GPU-based and AWS Trainium–based instances with a one-line recipe change, enable automated model checkpointing for improved training resiliency, and run workloads in production on SageMaker HyperPod.
High-performing distributed training
Advanced observability and experimentation tools
SageMaker HyperPod observability provides a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Prometheus workspace. You can see real-time performance metrics, resource utilization, and cluster health in a single view, allowing teams to quickly spot bottlenecks, prevent costly delays, and optimize compute resources. SageMaker HyperPod is also integrated with Amazon CloudWatch Container Insights, providing deeper insights into cluster performance, health, and use. Managed TensorBoard in SageMaker helps you save development time by visualizing the model architecture to identify and remediate convergence issues. Managed MLflow in SageMaker helps you efficiently manage experiments at scale.

Workload scheduling and orchestration
Automatic cluster health check and repair
Accelerate open-weights model deployments from SageMaker Jumpstart
SageMaker HyperPod automatically streamlines the deployment of open-weights FMs from SageMaker JumpStart and fine-tuned models from Amazon S3 and Amazon FSx. SageMaker HyperPod automatically provisions the required infrastructure and configures endpoints, eliminating manual provisioning. With SageMaker HyperPod task governance, endpoint traffic is continuously monitored and dynamically adjusts compute resources while simultaneously publishing comprehensive performance metrics to the observability dashboard for real-time monitoring and optimization.

Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages