Overview
The Modular Stack
A unified platform for AI development and deployment, including MAX and Mojo.
The Modular Stack
Write once, deploy everywhere
MAX Deployment Instructions
The Modular Platform is an open and fully-integrated suite of AI libraries and tools that accelerates model serving and scales GenAI deployments. It abstracts away hardware complexity so you can run the most popular open models with industry-leading GPU and CPU performance without any code changes.
Our ready-to-deploy Docker container removes the complexity of deploying your own GenAI endpoint. And unlike other serving solutions, Modular enables customization across the entire stack. You can customize everything from the serving pipeline and model architecture all the way down to the metal by writing custom ops and GPU kernels in Mojo. Most importantly, Modular is hardware-agnostic and free from vendor lock-in no CUDA require so your code runs seamlessly across diverse systems.
MAX is a high-performance AI serving framework tailored for GenAI workloads. It provides low-latency, high-thoughput inference via advanced model serving optimizations like prefix caching and speculative decoding. An OpenAI-compatible serving endpoint executes native MAX and PyTorch models across GPUs and CPUs, and can be customized at the model and kernel level.
The MAX Container (max-nvidia-full) is a Docker image that packages the MAX Platform, pre-configured to serve hundreds of popular GenAI models on NVIDIA GPUs. This container is ideal for users seeking a fully optimized, out-of-the-box solution for deploying AI models.
Key capabilities include:
- High-performance serving: Serve 500+ AI models from Hugging Face with industry-leading performance across NVIDIA GPUs
- Flexible, portable serving: Deploy with a single Docker container across various GPUs (B200, H200, H100, A100, A10, L40 and L4) and compute services (EC2, EKS, AWS Batch, etc.) without compatibility issues.
- OpenAI API Compatibility: Seamlessly integrate with applications adhering to the OpenAI API specification.
For detailed information on container contents and instance compatibility, refer to the MAX Containers Documentation (https://docs.modular.com/max/container ).
To access our full Modular platform, check out https://www.modular.com/Â
Highlights
- 500+ Pre-Optimized Models: Deploy popular models like Llama 3.3, Deepseek, Qwen2.5, and Mistral with individual optimizations for maximum performance
- OpenAI API Compatible: Drop-in replacement for OpenAI API with full compatibility for existing applications and tools
- Advanced GPU Acceleration: Optimized performance across NVIDIA B200, H200, H100, A100, A10, L40 and L4 GPUs with intelligent batching and memory management
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Vendor refund policy
Please refer to our licensing agreement for more details.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
max-nvidia
- Amazon ECS
- Amazon EKS
- Amazon ECS Anywhere
- Amazon EKS Anywhere
Container image
Containers are lightweight, portable execution environments that wrap server application software in a filesystem that includes everything it needs to run. Container applications run on supported container runtimes and orchestration services, such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). Both eliminate the need for you to install and operate your own container orchestration software by managing and scheduling containers on a scalable cluster of virtual machines.
Version release notes
What's new
Documentation Added instructions on profiling MAX kernels (see max/kernels/README.md). MAX models GGUF quantized Llamas (q4_0, q4_k, and q6_k) are now supported with paged KVCache strategy. MAX framework Serving & inference engine Inflight batching no longer requires chunked prefill.
The naive KVCache has been deleted.
Removed support for torchscript and torch MLIR models
Continuous KVCache strategy is deprecated. Please use Paged KVCache strategy instead.
max CLI Added --use-subgraphs flag to max generate to allow for the use of subgraphs in the model. Python APIs Added add_subgraph method to Graph class. This method allows for the addition of a subgraph to a graph.
Added the call operation which allows for the execution of a subgraph.
Added fold op for combining sliding blocks into a larger tensor.
Removed server setting from llm.py entrypoint for offline inference. Server is now automatically configured in background without consuming an HTTP port.
Added a strict parameter to the load_state_dict method in max.nn.Module. When strict=True (default), an error is raised if the state_dict contains unused keys. When strict=False, extra keys are ignored. This helps model developers identify missing implementations in their models.
Added the new max.torch module for using custom Mojo kernels from PyTorch. This module replaces the previously deprecated max.torch module.
Additional details
Usage instructions
Follow these steps within your EC2 instance to launch an LLM using the MAX container on an NVIDIA GPU.
-
Install Docker. If Docker is not already installed, go to https://docs.docker.com/get-started/get-docker/Â and follow the instructions for your operating system.
-
Start the MAX container. Use the docker run command to launch the container. For example commands and configuration options, see https://docs.modular.com/max/container/#get-startedÂ
-
Test the endpoint. After the container starts, it serves an OpenAI-compatible endpoint on port 8000. You can send a request to the /v1/chat/completions endpoint using a tool like cURL. Be sure to replace any placeholder values such as the model ID and message content.
Next steps. For more configuration options, troubleshooting help, or performance tuning tips, see the full documentation at https://docs.modular.com/max/containerÂ
Note: The MAX container is not currently compatible with macOS.
Support
Vendor support
Standard Support (Included): Professional support for deployment, configuration, and optimization questions through our dedicated AWS Marketplace support channel at aws-marketplace@modular.com . Our support team includes AI infrastructure specialists with deep expertise in production deployments.
Enterprise Premium Support: Our expert services team provides end-to-end assistance for large-scale AI infrastructure projects including architecture design, performance optimization, and integration with existing enterprise systems.
To access Enterprise Premium Support or Professional Services, book a call with us: https://modul.ar/talk-to-us . Our enterprise team will design a custom support package tailored to your organization's specific requirements and scale.
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products
