Amazon SageMaker AI

Amazon SageMaker Inference

Easily deploy and manage machine learning (ML) models for inference

What is Amazon SageMaker Inference?

Amazon SageMaker AI simplifies deploying foundation and machine learning models to deliver optimal price performance for any use case. SageMaker inference auto-provisions from a prioritized instance pool when capacity is constrained. Additionally, it recommends optimal inference configurations — shrinking manual optimization and benchmarking cycles from weeks to hours. You define your cost, throughput, and latency requirements — let SageMaker AI do the rest, so your team can focus on building better models instead of managing infrastructure and performance.

YouTube thumbnail image featuring the Amazon SageMaker Inference title, with a stylized lightbulb and pointer icon on a gradient background.

Easily deploy and manage machine learning (ML) models for inference

Deploy models in production for inference for any use case

SageMaker AI caters to a wide range of inference requirements, from low latency (a few milliseconds) and high throughput (millions of transactions per second) scenarios to long-running inference for use cases such as multilingual text processing, text-image processing, multi-modal understanding, natural language processing, and computer vision. SageMaker AI provides a robust and scalable solution for all your inference needs.

Achieve optimal inference performance and cost

Amazon SageMaker AI offers more than 100 instance types with varying levels of compute and memory to suit different performance needs. To better utilize the underlying accelerators and reduce deployment cost, you can deploy multiple models to the same instance.

To better utilize the underlying accelerators and reduce deployment costs, inference recommendations apply goal-aligned optimizations, you can deploy multiple models to the same instance. For further cost optimization, autoscaling automatically adjusts the number of instances based on traffic, shutting down instances when there is no usage to minimize inference costs.

Reduce operational burden using SageMaker MLOps capabilities

As a fully managed service, Amazon SageMaker AI takes care of setting up and managing instances, software version compatibilities, and patching versions. With built-in integration with MLOps features, it helps off-load the operational overhead of deploying, scaling, and managing ML models while getting them to production faster.

Scalable and cost-effective inference options

Single-model endpoints

One model on a container hosted on dedicated instances or serverless for low latency and high throughput.

Learn more

Multiple models on a single endpoint

Host multiple models to the same instance to better utilize the underlying accelerators, reducing deployment costs by up to 50%. You can control scaling policies for each FM separately, making it easier to adapt to model usage patterns while optimizing infrastructure costs.

Learn more

Serial inference pipelines

Multiple containers sharing dedicated instances and executing in a sequence. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks.

Learn more

Serverless Inference

Deploy SageMaker models to Amazon Bedrock for serverless inference and eliminate infrastructure management completely. Bedrock's next-generation inference engine automatically optimizes for cost and latency and enforces Zero Operator Access security so no one—not even AWS operators—can access your data in transit. All without requiring you to manage a single GPU.

Support for most machine learning frameworks and model servers

Amazon SageMaker inference supports built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks such as TensorFlow, PyTorch, ONNX, and XGBoost. If none of the pre-built Docker images serve your needs, you can build your own container for use with CPU backed multi-model endpoints. SageMaker inference supports most popular model servers such as TensorFlow Serving, TorchServe, NVIDIA Triton, AWS multi-model server.

Amazon SageMaker AI offers specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI), to help you improve performance of foundational models. With these options, you can deploy models including foundation models (FMs) quickly for virtually any use case.

Learn More

Achieve high inference performance at low cost

SageMaker inference recommendations eliminate manual benchmarking and optimization to deliver optimal inference performance. Inference recommendations analyze your model architecture, apply goal-aligned optimizations like speculative decoding and kernel tuning, and benchmark on real GPU infrastructure using NVIDIA AIPerf to deliver deployment-ready configurations with validated performance metrics, which can reduce time to deploy models in production from weeks to hours. For more information click here.

Get started now

Deploy models on the most high-performing infrastructure or go serverless

Amazon SageMaker AI offers more than 70 instance types with varying levels of compute and memory, including Amazon EC2 Inf1 instances based on AWS Inferentia, high-performance ML inference chips designed and built by AWS, and GPU instances such as Amazon EC2 G4dn. Or, choose Amazon SageMaker Serverless Inference to easily scale to thousands of models per endpoint, millions of transactions per second (TPS) throughput, and sub10 millisecond overhead latencies.

Learn more

Shadow test to validate performance of ML models

Amazon SageMaker AI helps you evaluate a new model by shadow testing its performance against the currently SageMaker-deployed model using live inference requests. Shadow testing can help you catch potential configuration errors and performance issues before they impact end users. With SageMaker AI, you don’t need to invest weeks of time building your own shadow testing infrastructure. Just select a production model that you want to test against, and SageMaker AI automatically deploys the new model in shadow mode and routes a copy of the inference requests received by the production model to the new model in real time.

Autoscaling for elasticity

You can use scaling policies to automatically scale the underlying compute resources to accommodate fluctuations in inference requests. You can control scaling policies for each ML model separately to handle the changes in model usage easily, while also optimizing infrastructure costs.

Learn more

Latency improvement and Intelligent routing

You can reduce inference latency for ML models by intelligently routing new inference requests to instances that are available instead of randomly routing requests to instances that are already busy serving inference requests, allowing you to achieve 20% lower inference latency on average.

Reduce operational burden and accelerate time to value

Fully managed model hosting and management

As a fully managed service, Amazon SageMaker AI takes care of setting up and managing instances, software version compatibilities, and patching versions. It also provides built-in metrics and logs for endpoints that you can use to monitor and receive alerts.

Learn more

Built-in integration with MLOps features

Amazon SageMaker AI model deployment features are natively integrated with MLOps capabilities, including SageMaker Pipelines (workflow automation and orchestration), SageMaker Projects (CI/CD for ML), SageMaker Feature Store (feature management), SageMaker Model Registry (model and artifact catalog to track lineage and support automated approval workflows), SageMaker Clarify (bias detection), and SageMaker Model Monitor (model and concept drift detection). As a result, whether you deploy one model or tens of thousands, SageMaker AI helps off-load the operational overhead of deploying, scaling, and managing ML models while getting them to production faster.

Learn more

Capacity-Aware Inference

SageMaker AI eliminates manual retries when instance capacity is unavailable to accelerate your time to deploy models in production. It automatically falls back to your next preferred instance and delivers endpoints in minutes, for both new and existing ones. Priority is honored across scale-up and scale-down.

Learn more

Get started now

Resources for SageMaker Inference

Video

Package and deploy classical ML and LLMs easily with Amazon SageMaker AI, part 1: PySDK Improvements

Read the blog

Blog

Package and deploy classical ML and LLMs easily with Amazon SageMaker AI, part 2: Interactive User Experiences in SageMaker Studio

Read the blog

Blog

Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker AI

Read the blog

Blog

Boost inference performance for LLMs with new Amazon SageMaker AI containers

Read the blog

What's new

Announcing price reductions for Amazon SageMaker AI GPU-accelerated instances

Amazon SageMaker Catalog adds AI recommendations for descriptions of custom assets

Amazon SageMaker contributes a custom transport to OpenLineage community and offers additional lineage capabilities

Amazon SageMaker now supports automatic synchronization from Git to S3

Amazon SageMaker AI now supports M7i, C7i, and R7i for SageMaker Model Training and SageMaker Processing

How to get started

Documentation

Get started with Amazon SageMaker AI developer guide

Read the document

Tutorial

Follow this step-by-step tutorial to deploy a model for inference using Amazon SageMaker AI

Read the tutorial

Amazon SageMaker Inference

What is Amazon SageMaker Inference?

Easily deploy and manage machine learning (ML) models for inference

Deploy

Optimize

Observe

Scalable and cost-effective inference options

Single-model endpoints

Multiple models on a single endpoint

Serial inference pipelines

Serverless Inference

Support for most machine learning frameworks and model servers

Achieve high inference performance at low cost

Achieve high inference performance at low cost

Deploy models on the most high-performing infrastructure or go serverless

Shadow test to validate performance of ML models

Autoscaling for elasticity

Latency improvement and Intelligent routing

Reduce operational burden and accelerate time to value

Fully managed model hosting and management

Built-in integration with MLOps features

Capacity-Aware Inference

Resources for SageMaker Inference

Deploy FMs on Amazon SageMaker AI for price performance

Scaling FM inference to hundreds of models with Amazon SageMaker AI

Deploy large foundation model at scale with high performance

Amazon SageMaker AI Now Supports Auto Inference Acceleration

Package and deploy classical ML and LLMs easily with Amazon SageMaker AI, part 1: PySDK Improvements

Package and deploy classical ML and LLMs easily with Amazon SageMaker AI, part 2: Interactive User Experiences in SageMaker Studio

Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker AI

Boost inference performance for LLMs with new Amazon SageMaker AI containers

What's new

Amazon SageMaker Catalog adds AI recommendations for descriptions of custom assets

How to get started

Get started with Amazon SageMaker AI developer guide

Follow this step-by-step tutorial to deploy a model for inference using Amazon SageMaker AI

Learn

Resources

Developers

Help