Navigating GPU Challenges: Cost Optimizing AI Workloads on AWS

Introduction

The insatiable demand for Artificial intelligence (AI), Machine learning (ML), and Generative artificial intelligence (GenAI) workloads is straining GPU resources globally due to supply chain imbalances and chip shortages. This scarcity causes procurement delays and potential roadblocks for deployment of GenAI workloads. This blog explores strategies to optimize AI workloads on AWS amidst GPU constraints. It covers:

GPU instance procurement strategies
Managed services like Amazon SageMaker
AWS purpose-built AI accelerators
Alternative compute options
Strategies to maximize GPU utilization through sharing
Cost monitoring and optimization practices

By implementing these approaches, organizations can efficiently execute AI workloads even during periods of GPU resource constraints.

Implementing GPU instance procurement strategies

Managing accelerated computing capacity for AI and ML workloads

Today’s AI and machine learning workloads demand high-performance computing power. Amazon EC2 Accelerated Computing instances meet this need by combining powerful GPUs and custom-designed chips like AWS Inferentia and AWS Trainium. These purpose-built instances significantly outperform traditional CPU-based options, enabling faster model training, improved inference speed, and efficient processing of large datasets. For example, Amazon EC2 UltraClusters provide up to 512 NVIDIA H100 GPUs, delivering massive parallel processing power for demanding tasks like training large language models.

To ensure reliable access to these resources, AWS offers two key capacity reservation options. On-Demand Capacity Reservations (ODCR) allow teams to reserve compute capacity in specific Availability Zones, mitigating capacity constraints for mission-critical workloads. Additionally, Amazon EC2 Capacity Blocks for ML enable short-term reservations of high-performance GPU clusters for 1-14 days, perfect for intensive training runs or burst inference demands.

Leveraging Savings Plans and Reserved Instances for AI and ML workloads

AWS offers discounts for long-term commitments through Compute Savings Plans and Reserved Instances, which can lead to significant savings compared to On-Demand pricing for Amazon EC2 Accelerated Computing instances. Long-term commitments involve a 1 or 3-year agreement for specified compute power or instances. There are 4 types of savings options: 1) Compute Savings Plans, 2) EC2 Instance Savings Plans, 3) Standard Reserved Instances, and 4) Convertible Reserved Instances. This cost-saving approach can be beneficial for AI and ML workloads that require sustained compute resources.

Using Amazon EC2 Spot Instances

Amazon EC2 Spot Instances provide access to unused EC2 capacity at discounts of up to 90% compared to On-Demand pricing, making them an attractive option for cost-sensitive AI workloads. When seeking cost-effective compute for AI workloads through the Spot market, consider expanding your options beyond traditional GPU instances to include AWS purpose-built accelerators such as AWS Trainium and Inferentia (explored in detail later). These accelerators are also available as Spot Instances and can provide significant cost savings for machine learning training and inference workloads. To learn more about using Spot Instances with Inferentia and Trainium, see the AWS re:Post article “Inferentia and Trainium Service Quotas“. This diversified approach to Spot Instances usage can help reduce costs while maintaining access to specialized AI computing resources for your projects.

Optimizing through consolidated purchasing

To maximize resource utilization and capitalize on economies of scale, consider aggregating GPU demand across multiple teams, accounts, or hybrid cloud environments. By consolidating GPU requirements and procuring Amazon EC2 Accelerated Computing instances collectively, organizations can share volume discounts and maximize utilization of the acquired GPU resources. AWS Organizations facilitates consolidated billing, cost optimization, and resource sharing. Purchase Compute Savings Plans and Reserved Instances in the management account or member accounts, enable sharing, and automatically apply savings across the organization.

Additionally, organizations can optimize costs by consolidating GPU demand across on-premises environments and other cloud providers. This holistic approach offers a comprehensive view of total GPU consumption, enabling better forecasting and resource utilization. Steps to consider for consolidation:

Assessment: Evaluate GPU usage across all environments to identify patterns and redundancies.
Cost modeling: Create a cost model that considers total GPU consumption across all environments to make informed resource allocation decisions.
Unified procurement: Use aggregated data to obtain optimized pricing from cloud providers, utilizing their savings offers.

Using Amazon SageMaker for managed machine learning

Amazon SageMaker is a fully managed service for GPU-accelerated AI training and inference, with built-in algorithms and managed infrastructure. It offers cost optimization through model optimization and managed spot training. SageMaker has a separate ML instance pool, providing potential availability when Amazon EC2 GPU instances are constrained.

Utilizing Amazon SageMaker HyperPod

Amazon SageMaker HyperPod offers a flexible approach to GPU resource management for large machine learning workloads. Rather than requiring large, monolithic GPU instances, HyperPod effectively utilizes clusters of smaller GPUs working together. This architecture provides enhanced resiliency through features like checkpointing, failure recovery, distributed model training, and auto-remediation. It efficiently splits large models and datasets across multiple GPUs for parallel training, significantly improving processing speed. Pre-configured for distributed training across thousands of accelerators, HyperPod can enhance performance by up to 20%. The platform enables efficient high-cost compute utilization through its built-in resiliency features and hardware remediation capabilities, ultimately reducing overall training time while maximizing resource efficiency.

Implementing Amazon SageMaker’s Managed Spot Training

Amazon SageMaker’s Managed Spot Training simplifies using Spot Instances for optimizing and scaling machine learning jobs. It automatically handles interruptions, supports model tuning and checkpointing for reliable execution, and provides flexible configuration options, offering a cost-optimized solution for training on AWS. For example, Cinnamon AI, a Japan-based startup specializing in AI-powered document analysis, achieved a 70% reduction in training costs and 40% increase in daily training jobs using Managed Spot Training. They successfully integrated this feature with popular frameworks like TensorFlow and PyTorch, while eliminating the complexity of managing Spot Instance interruptions. See Cinnamon AI saves 70% on ML model training costs with Amazon SageMaker Managed Spot Training for implementation details.

Using AWS purpose-built AI accelerated computing instances for optimized performance and cost

AWS offers specialized accelerated computing instances powered by custom silicon chips, AWS Trainium and AWS Inferentia, which are purpose-built AI accelerators for machine learning workloads. These instances optimize performance, cost savings, energy efficiency, and seamless integration with AWS services. Depending on workload requirements and resource constraints, Trainium and Inferentia instances may provide additional compute options complementing traditional GPU instances.

Leveraging AWS Trainium: optimized for training large deep learning models

Trainium instances are optimized for high-performance and cost-effective training of large deep learning models used in NLP, computer vision, and recommender systems. For models with over 100 billion parameters, Trainium can offer up to 50% lower training costs compared to GPUs. Recent examples where Trainium demonstrated cost-effective fine-tuning include Llama and GPT NeoX language models. For more details, you can check out the blog on scaling distributed training with AWS Trainium and Amazon EKS and another on training large language models using Hugging Face and AWS Trainium.

Deploying AWS Inferentia: high-performance inference at scale

Inferentia accelerators deliver industry-leading performance and cost-efficiency for deep learning and generative AI inference on Amazon EC2. Inferentia2 provides up to 4x higher throughput and 10x lower latency than the previous generation. Inferentia2 EC2 instances are optimized for deploying complex models like LLMs and diffusion models at scale, with ultra-high-speed interconnects enabling distributed inference. For more information, you can read about the advancements in AWS Inferentia2 and how it is used in scaling Rufus, the Amazon generative AI-powered conversational shopping assistant.

Combining AWS AI chips and GPU infrastructure for flexible AI acceleration

Organizations increasingly deploy AI workloads across multiple environments – from public clouds to private data centers. While AWS AI chips (Trainium and Inferentia) operate exclusively on AWS, organizations can still benefit from hybrid approaches by strategically combining these specialized accelerators with other compute resources across environments. Organizations can either use AWS Trainium for cost-effective training (up to 50% savings) and infer models elsewhere, or train models on existing NVIDIA GPU infrastructure and leverage AWS Inferentia for efficient, cost-effective inference (up to 70% saving). Financial services firms, for example, can perform initial model training on AWS Trainium, then deploy the optimized models to their private data centers where data residency requires it. Alternatively, they might train models on NVIDIA GPUs in their data centers and infer on AWS Inferentia for high throughput, cost-effective inference.

The AWS Neuron SDK facilitates this flexibility by enabling model portability. Teams can develop models using familiar frameworks like PyTorch and TensorFlow, and move them between environments as needed. Models trained on NVIDIA GPUs can be optimized for AWS Inferentia, while those trained on AWS Trainium can be exported for inference across different environments.

To implement this flexible approach, organizations should consider:

Using AWS Neuron SDK to develop models that can be exported for cross-platform inference
Leveraging AWS Trainium for cost-effective training when possible
Considering AWS Inferentia for efficient inference of models trained elsewhere
Deploying models flexibly across different environments according to business requirements and constraints.

Exploring alternative compute options to GPU

For AI workloads that don’t require the processing capabilities of GPU instances, alternative compute options like CPUs or AWS Graviton instances can be more cost-optimized choices, delivering similar performance at a lower infrastructure cost. Regularly evaluating the suitability of different compute choices based on workload characteristics, performance needs, and cost considerations can help optimize overall infrastructure costs.

Deploying CPUs for inference workloads
When choosing between CPUs and GPUs for inference workloads, consider these guidelines:

Choose CPU instances for:

Smaller to mid-sized models (typically models with fewer than 1 billion parameters, though this threshold varies by architecture and optimization)
Batch processing without strict latency requirements
Cost-sensitive operations
Sequential processing tasks

Choose GPU instances for:

Larger, more complex models (often exceeding 1 billion parameters)
Low latency inference
Complex deep learning with heavy matrix operations
High-throughput parallel processing
GPU-optimized computer vision and NLP models

AWS offers a range of CPU instances, including the latest AMD and Intel processors, as well as AWS Graviton processors (explored in the next section), optimized for various workloads. CPU instances are well-suited for batch inference tasks or real-time inference with relatively small models. For example, ThirdAI’s BOLT engine has been benchmarked against TensorFlow and PyTorch on Nvidia T4 GPU, Intel Ice Lake CPUs, and AWS Graviton processors, demonstrating that properly optimized CPU implementations can efficiently handle neural network training tasks that traditionally required GPUs.

Running AWS Graviton instances for machine learning inference

AWS Graviton processors, which are Arm-based CPUs, are engineered for high performance and energy efficiency, making them ideal for a variety of workloads, including AI and machine learning. By using Graviton instances, such as the M6g family, users can achieve significant cost savings and improved power efficiency compared to traditional x86-based instances. This makes Graviton instances particularly well-suited for AI inference workloads, including natural language processing, recommendation systems, and fraud detection, where specific performance and cost requirements are critical. Integrating Graviton instances with Amazon SageMaker allows developers and data scientists to build, train, and deploy machine learning models efficiently, benefiting from lower inference latency and reduced costs. Additionally, the example of accelerated PyTorch inference with Torch Compile highlights how AWS optimized the PyTorch feature for Graviton3 processors. This optimization results in up to 2x better performance for Hugging Face model inference and up to 1.35x better performance for TorchBench model inference compared to the default eager mode.

Maximizing GPU utilization through sharing

AWS Batch, EKS, and ECS enable running multiple GPU-accelerated containers on one instance, with dynamic provisioning and auto-scaling. For more details on this approach, see AWS blog “Maximizing GPU utilization with NVIDIA’s Multi-Instance GPU (MIG) on Amazon EKS: Running more pods per GPU for enhanced performance”. In addition, AWS supports NVIDIA GPU Time-slicing for Bottlerocket, allowing multiple tasks to access a single GPU concurrently by dividing GPU processing time into slices. This enables running multiple AI and ML models on a single GPU for improved utilization and scalability.

Implementing cost monitoring and optimization

Effective monitoring of GPU utilization, performance metrics, and costs is crucial. Use Amazon CloudWatch for data-driven optimization. Implement cost governance with AWS Budgets, AWS Cost Explorer, and AWS Cost Anomaly Detection to set budgets, track trends, and receive overspending alerts for proactive cost management.

Conclusion

Navigating GPU resource constraints requires a multi-faceted approach spanning procurement strategies, leveraging AWS AI accelerators, exploring alternative compute options, utilizing managed services like SageMaker, and implementing best practices for GPU sharing, containerization, monitoring, and cost governance. By adopting these techniques holistically, organizations can efficiently and cost-effectively execute AI, ML, and GenAI workloads on AWS, even amidst GPU scarcity. Importantly, these optimization strategies will remain valuable long after GPU supply chains recover, as they establish foundational practices for sustainable AI infrastructure that maximizes performance while controlling costs—an enduring priority for organizations scaling their AI initiatives into the future.

AWS Cloud Financial Management