Listing Thumbnail

    Modular Platform: High-Performance GenAI Serving

     Info
    Sold by: Modular 
    Deployed on AWS
    Serve the latest GenAI models with MAX Container - a GPU-accelerated serving environment with support for 500+ optimized models (https://builds.modular.com/), OpenAI API compatibility (https://docs.modular.com/max/api/serve), and enterprise-grade performance across diverse hardware and compute services.

    Overview

    Open image

    The Modular Platform is an open and fully-integrated suite of AI libraries and tools that accelerates model serving and scales GenAI deployments. It abstracts away hardware complexity so you can run the most popular open models with industry-leading GPU and CPU performance without any code changes.

    Our ready-to-deploy Docker container removes the complexity of deploying your own GenAI endpoint. And unlike other serving solutions, Modular enables customization across the entire stack. You can customize everything from the serving pipeline and model architecture all the way down to the metal by writing custom ops and GPU kernels in Mojo. Most importantly, Modular is hardware-agnostic and free from vendor lock-in no CUDA require so your code runs seamlessly across diverse systems.

    MAX is a high-performance AI serving framework tailored for GenAI workloads. It provides low-latency, high-thoughput inference via advanced model serving optimizations like prefix caching and speculative decoding. An OpenAI-compatible serving endpoint executes native MAX and PyTorch models across GPUs and CPUs, and can be customized at the model and kernel level.

    The MAX Container (max-nvidia-full) is a Docker image that packages the MAX Platform, pre-configured to serve hundreds of popular GenAI models on NVIDIA GPUs. This container is ideal for users seeking a fully optimized, out-of-the-box solution for deploying AI models.

    Key capabilities include:

    • High-performance serving: Serve 500+ AI models from Hugging Face with industry-leading performance across NVIDIA GPUs
    • Flexible, portable serving: Deploy with a single Docker container across various GPUs (B200, H200, H100, A100, A10, L40 and L4) and compute services (EC2, EKS, AWS Batch, etc.) without compatibility issues.
    • OpenAI API Compatibility: Seamlessly integrate with applications adhering to the OpenAI API specification.

    For detailed information on container contents and instance compatibility, refer to the MAX Containers Documentation (https://docs.modular.com/max/container ).

    To access our full Modular platform, check out https://www.modular.com/ 

    Highlights

    • 500+ Pre-Optimized Models: Deploy popular models like Llama 3.3, Deepseek, Qwen2.5, and Mistral with individual optimizations for maximum performance
    • OpenAI API Compatible: Drop-in replacement for OpenAI API with full compatibility for existing applications and tools
    • Advanced GPU Acceleration: Optimized performance across NVIDIA B200, H200, H100, A100, A10, L40 and L4 GPUs with intelligent batching and memory management

    Details

    Delivery method

    Supported services

    Delivery option
    max-nvidia

    Latest version

    Operating system
    Linux

    Deployed on AWS

    Unlock automation with AI agent solutions

    Fast-track AI initiatives with agents, tools, and solutions from AWS Partners.
    AI Agents

    Features and programs

    Financing for AWS Marketplace purchases

    AWS Marketplace now accepts line of credit payments through the PNC Vendor Finance program. This program is available to select AWS customers in the US, excluding NV, NC, ND, TN, & VT.
    Financing for AWS Marketplace purchases

    Pricing

    Modular Platform: High-Performance GenAI Serving

     Info
    This product is available free of charge. Free subscriptions have no end date and may be canceled any time.
    Additional AWS infrastructure costs may apply. Use the AWS Pricing Calculator  to estimate your infrastructure costs.

    Vendor refund policy

    Please refer to our licensing agreement for more details.

    How can we make this page better?

    We'd like to hear your feedback and ideas on how to improve this page.
    We'd like to hear your feedback and ideas on how to improve this page.

    Legal

    Vendor terms and conditions

    Upon subscribing to this product, you must acknowledge and agree to the terms and conditions outlined in the vendor's End User License Agreement (EULA) .

    Content disclaimer

    Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

    Usage information

     Info

    Delivery details

    max-nvidia

    Supported services: Learn more 
    • Amazon ECS
    • Amazon EKS
    • Amazon ECS Anywhere
    • Amazon EKS Anywhere
    Container image

    Containers are lightweight, portable execution environments that wrap server application software in a filesystem that includes everything it needs to run. Container applications run on supported container runtimes and orchestration services, such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). Both eliminate the need for you to install and operate your own container orchestration software by managing and scheduling containers on a scalable cluster of virtual machines.

    Version release notes

    What's new

    Documentation Added instructions on profiling MAX kernels (see max/kernels/README.md). MAX models GGUF quantized Llamas (q4_0, q4_k, and q6_k) are now supported with paged KVCache strategy. MAX framework Serving & inference engine Inflight batching no longer requires chunked prefill.

    The naive KVCache has been deleted.

    Removed support for torchscript and torch MLIR models

    Continuous KVCache strategy is deprecated. Please use Paged KVCache strategy instead.

    max CLI Added --use-subgraphs flag to max generate to allow for the use of subgraphs in the model. Python APIs Added add_subgraph method to Graph class. This method allows for the addition of a subgraph to a graph.

    Added the call operation which allows for the execution of a subgraph.

    Added fold op for combining sliding blocks into a larger tensor.

    Removed server setting from llm.py entrypoint for offline inference. Server is now automatically configured in background without consuming an HTTP port.

    Added a strict parameter to the load_state_dict method in max.nn.Module. When strict=True (default), an error is raised if the state_dict contains unused keys. When strict=False, extra keys are ignored. This helps model developers identify missing implementations in their models.

    Added the new max.torch module for using custom Mojo kernels from PyTorch. This module replaces the previously deprecated max.torch module.

    Additional details

    Usage instructions

    Follow these steps within your EC2 instance to launch an LLM using the MAX container on an NVIDIA GPU.

    1. Install Docker. If Docker is not already installed, go to https://docs.docker.com/get-started/get-docker/  and follow the instructions for your operating system.

    2. Start the MAX container. Use the docker run command to launch the container. For example commands and configuration options, see https://docs.modular.com/max/container/#get-started 

    3. Test the endpoint. After the container starts, it serves an OpenAI-compatible endpoint on port 8000. You can send a request to the /v1/chat/completions endpoint using a tool like cURL. Be sure to replace any placeholder values such as the model ID and message content.

    Next steps. For more configuration options, troubleshooting help, or performance tuning tips, see the full documentation at https://docs.modular.com/max/container 

    Note: The MAX container is not currently compatible with macOS.

    Support

    Vendor support

    Standard Support (Included): Professional support for deployment, configuration, and optimization questions through our dedicated AWS Marketplace support channel at aws-marketplace@modular.com . Our support team includes AI infrastructure specialists with deep expertise in production deployments.

    Enterprise Premium Support: Our expert services team provides end-to-end assistance for large-scale AI infrastructure projects including architecture design, performance optimization, and integration with existing enterprise systems.

    To access Enterprise Premium Support or Professional Services, book a call with us: https://modul.ar/talk-to-us . Our enterprise team will design a custom support package tailored to your organization's specific requirements and scale.

    AWS infrastructure support

    AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

    Similar products

    Customer reviews

    Ratings and reviews

     Info
    0 ratings
    5 star
    4 star
    3 star
    2 star
    1 star
    0%
    0%
    0%
    0%
    0%
    0 AWS reviews
    No customer reviews yet
    Be the first to review this product . We've partnered with PeerSpot to gather customer feedback. You can share your experience by writing or recording a review, or scheduling a call with a PeerSpot analyst.