What is SageMaker Model Training?
Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and Amazon SageMaker AI can automatically scale infrastructure up or down, from one to thousands of GPUs. To train deep learning models faster, SageMaker AI helps you select and refine datasets in real time. SageMaker distributed training libraries can automatically split large models and training datasets across AWS GPU instances, or you can use third-party libraries, such as DeepSpeed, Horovod, or Megatron. Train foundation models (FMs) for weeks and months without disruption by automatically monitoring and repairing training clusters.
Benefits of cost effective training
Train models at scale
Fully managed training jobs
SageMaker training jobs offer a fully managed user experience for large distributed FM training, removing the undifferentiated heavy lifting around infrastructure management. SageMaker training jobs automatically spins up a resilient distributed training cluster, monitors the infrastructure, and auto-recovers from faults to ensure a smooth training experience. Once the training is complete, SageMaker spins down the cluster and you are billed for the net training time. In addition, with SageMaker training jobs, you have the flexibility to choose the right instance type to best fits an individual workload (for example, pretrain a large language model (LLM) on a P5 cluster or fine tune an open source LLM on p4d instances) to further optimize your training budget. In addition, SagerMaker training jobs offers a consistent user experience across ML teams with varying levels of technical expertise and different workload types.
SageMaker HyperPod
Amazon SageMaker HyperPod is a purpose-built infrastructure to efficiently manage compute clusters to scale foundation model (FM) development. It enables advanced model training techniques, infrastructure control, performance optimization, and enhanced model observability. SageMaker HyperPod is preconfigured with SageMaker distributed training libraries, allowing you to automatically split models and training datasets across AWS cluster instances to help efficiently utilize the cluster’s compute and network infrastructure. It enables a more resilient environment by automatically detecting, diagnosing, and recovering from hardware faults, allowing you to continually train FMs for months without disruption, reducing training time by up to 40%.
High-performance distributed training
SageMaker AI makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS accelerators. It helps you optimize your training job for AWS network infrastructure and cluster topology. It also streamlines model checkpointing through the recipes by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training.
Customize generative AI and ML models efficiently
Amazon SageMaker AI enables customization of both Amazon proprietary and publicly available foundation models using custom datasets eliminating the need to train them from scratch. Data scientists and developers of all skill sets can quickly get started with training and fine-tuning of public as well as proprietary generative AI models using optimized recipes. Each recipe is tested by AWS, removing weeks of tedious work testing different model configurations to achieve state-of-the-art performance. With recipes, you can fine-tune popular publicly available model families including Llama, Mixtral, and Mistral. In addition, you can customize Amazon Nova foundation models, including Nova Micro, Nova Lite, and Nova Pro for your business-specific use cases on Amazon SageMaker AI using a suite of techniques across all stages of model training. Available as ready-to-use SageMaker recipes, these capabilities allow customers to adapt Nova models across the entire model lifecycle, including supervised fine-tuning, alignment, and pre-training.
Built-in tools for interactivity and monitoring
Amazon SageMaker with MLflow
Use MLflow with SageMaker training to capture input parameters, configurations, and results, helping you quickly identify the best-performing models for your use case. The MLflow UI allows you to analyze model training attempts and effortlessly register candidate models for production in one quick step.

Amazon SageMaker with TensorBoard
Amazon SageMaker with TensorBoard helps you to save development time by visualizing the model architecture to identify and remediate convergence issues, such as validation loss not converging or vanishing gradients.
