AWS Big Data Blog

Amazon EMR Serverless eliminates local storage provisioning, reducing data processing costs by up to 20%

At AWS re:Invent 2025, Amazon Web Services (AWS) announced serverless storage for Amazon EMR Serverless, a new capability that eliminates the need configure local disks for Apache Spark workloads. This reduces data processing costs by up to 20% while eliminating job failures from disk capacity constraints.

With serverless storage, Amazon EMR Serverless automatically handles intermediate data operations, such as shuffle, on your behalf. You pay only for compute and memory—no storage charges. By decoupling storage from compute, Spark can release idle workers immediately, reducing costs throughout the job lifecycle. The following image shows the serverless storage for EMR Serverless announcement from the AWS re:Invent 2025 keynote:

The challenge: Sizing local disk storage

Running Apache Spark workloads requires sizing local disk storage for shuffle operations—where Spark redistributes data across executors during joins, aggregations, and sorts. This requires analyzing job histories to estimate disk requirements, leading to two common problems: overprovisioning wastes money on unused capacity, and under provisioning causes job failures when disk space runs out. Most customers overprovision local storage to ensure jobs complete successfully in production.

Data skew compounds this further. When one executor handles a disproportionately large partition, that executor takes significantly longer to complete while other workers sit idle. If you didn’t provision enough disk for that skewed executor, the job fails entirely—making data skew one of the top causes of Spark job failures. However, the problem extends beyond capacity planning. Because shuffle data couples tightly to local disks, Spark executors pin to worker nodes even when compute requirements drop between job stages. This prevents Spark from releasing workers and scaling down, inflating compute costs throughout the job lifecycle. When a worker node fails, Spark must recompute the shuffle data stored on that node, causing delays and inefficient resource usage.

How it works

Serverless storage for Amazon EMR Serverless addresses these challenges by offloading shuffle operations from individual compute workers onto a separate, elastic storage layer. Instead of storing critical data on local disks attached to Spark executors, serverless storage automatically provisions and scales high-performance remote storage as your job runs.

The architecture provides several key benefits. First, compute and storage scale independently—Spark can acquire and release workers as needed across job stages without worrying about preserving locally stored data. Second, shuffle data is evenly distributed across the serverless storage layer, eliminating data skew bottlenecks that occur when some executors handle disproportionately large shuffle partitions. Third, if a worker node fails, your job continues processing without delays or reruns because data is reliably stored outside individual compute workers.

Serverless storage is provided at no additional charge, and it eliminates the cost associated with local storage. Instead of paying for fixed disk capacity sized for maximum potential I/O load—capacity that often sits idle during lighter workloads—you can use serverless storage without incurring storage costs. You can focus your budget on compute resources that directly process your data, not on managing and overprovisioning disk storage.

Technical innovation brings three breakthroughs

Serverless storage introduces three fundamental innovations that solve Spark’s shuffle bottlenecks: multi-tier aggregation architecture, purpose-built networking, and true storage-compute decoupling. Apache Spark’s shuffle mechanism has a core constraint: each mapper independently writes output as small files, and each reducer must fetch data from potentially thousands of workers. In a large-scale job with 10,000 mappers and 1,000 reducers, this creates 10 million individual data exchanges. Serverless storage aggregates early and intelligently—mappers stream data to an aggregation layer that consolidates shuffle data in memory before committing to storage. Whereas individual shuffle write and fetch operations might show slightly higher latency due to network round-trips compared to local disk I/O, the overall job performance improves by transforming millions of tiny I/O operations into a smaller number of large, sequential operations.

Traditional Spark shuffle creates a mesh network where each worker maintains connections to potentially hundreds of other workers, spending significant CPU on connection management rather than data processing. We built a custom networking stack where each mapper opens a single persistent remote procedure call (RPC) connection to our aggregator layer, eliminating the mesh complexity. Although individual shuffle operations might show slightly higher latency due to network round trips compared to local disk I/O, overall job performance improves through better resource utilization and elastic scaling. Workers no longer run a shuffle service—they focus entirely on processing your data.

Traditional Amazon EMR Serverless jobs store shuffle data on local disks, coupling data lifecycle to worker lifecycle—idle workers can’t terminate without losing shuffle data. Serverless storage decouples these entirely by storing shuffle data in AWS managed storage with opaque handles tracked by the driver. Workers can terminate immediately after completing tasks without data loss, enabling elastic scaling. In funnel-shaped queries where early stages require massive parallelism that narrows as data aggregates, we’re seeing up to 80% compute cost reduction in benchmarks by releasing idle workers instantly. The following diagram illustrates instant worker release in funnel-shaped queries.

Our aggregator layer integrates directly with AWS Identity and Access Management (IAM), AWS Lake Formation, and fine-grained access control systems, providing job-level data isolation with access controls that match source data permissions.

Getting started

Serverless storage is available in multiple AWS Regions. For the current list of supported Regions, refer to the Amazon EMR User Guide.

New applications

Serverless storage can be enabled for new applications starting with Amazon EMR release 7.12. Follow these steps:

  1. Create an Amazon EMR Serverless application with Amazon EMR 7.12 or later:
aws emr-serverless create-application \
  --type "SPARK" \
  --name my-application \
  --release-label emr-7.12.0 \
  --runtime-configuration '[{
      "classification": "spark-defaults",
        "properties": {
          "spark.aws.serverlessStorage.enabled": "true"
        }
    }]' \
  --region us-east-1
  1. Submit your Spark job:
aws emr-serverless start-job-run \
  --application-id <application-id> \
  --execution-role-arn <execution-role-arn> \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://<bucket>/<your_script.py>",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=10"
    }
  }'

Existing applications

You can enable serverless storage for existing applications on Amazon EMR 7.12 or later by updating your application settings.

To enable serverless storage using AWS Command Line Interface (AWS CLI), enter the following command:

aws emr-serverless update-application \
  --application-id <application-id> \
  --runtime-configuration '[{
      "classification": "spark-defaults",
        "properties": {
          "spark.aws.serverlessStorage.enabled": "true"
        }
    }]'

To enable serverless storage using Amazon EMR Studio UI, navigate to your application in Amazon EMR Studio, go to Configuration, and add the Spark property spark.aws.serverlessStorage.enabled=true in the spark-defaults classification.

Job-level configuration

You can also enable serverless storage for specific jobs, even when it’s not enabled at the application level:

aws emr-serverless start-job-run \
  --application-id <application-id> \
  --execution-role-arn <execution-role-arn> \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://<bucket>/<your_script.py>",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=20g --conf spark.aws.serverlessStorage.enabled=true"
    }
  }'

(Optional) Disabling serverless storage

If you prefer to continue using local disks, you can disable serverless storage by omitting the spark.aws.serverlessStorage.enabled configuration or setting it to false at either the application or job level:

spark.aws.serverlessStorage.enabled=falseTo use traditional local disk provisioning, configure the appropriate disk type and size for your application workers.

Monitoring and cost tracking

You can monitor elastic shuffle usage through standard Spark UI metrics and track costs at the application level in AWS Cost Explorer and AWS Cost and Usage Reports. The service automatically handles performance optimization and scaling, so you don’t need to tune configuration parameters.

When to use serverless storage

Serverless storage delivers the most value for workloads with substantial shuffle operations—typically jobs that shuffle more than 10 GB of data (and less than 200 G per job, the limitation as of this writing). These include:

  • Large-scale data processing with heavy aggregations and joins
  • Sort-heavy analytics workloads
  • Iterative algorithms that repeatedly access the same datasets

Jobs with unpredictable shuffle sizes benefit particularly well because serverless storage automatically scales capacity up and down based on real-time demand. For workloads with minimal shuffle activity or very short duration (under 2–3 minutes), the benefits might be limited. In these cases, the overhead of remote storage access might outweigh the advantages of elastic scaling.

Security and data lifecycle

Your data is stored in serverless storage only while your job is running and is automatically deleted when your job is completed. Because Amazon EMR Serverless batch jobs can run for up to 24 hours, your data will be stored for no longer than this maximum duration. Serverless storage encrypts your data both in transit between your Amazon EMR Serverless application and the serverless storage layer and at rest while temporarily stored, using AWS managed encryption keys. The service uses an IAM based security model with job-level data isolation, which means that one job can’t access the shuffle data of another job. Serverless storage maintains the same security standards as Amazon EMR Serverless, with enterprise-grade security controls throughout the processing lifecycle.

Conclusion

Serverless storage represents a fundamental shift in how we approach data processing infrastructure, eliminating manual configuration, aligning costs to actual usage, and improving reliability for I/O intensive workloads. By offloading shuffle operations to a managed service, data engineers can focus on building analytics rather than managing storage infrastructure.

To learn more about serverless storage and get started, visit the Amazon EMR Serverless documentation.


About the authors

Karthik Prabhakar

Karthik Prabhakar

Karthik is a Data Processing Engines Architect for Amazon EMR at AWS. He specializes in distributed systems architecture and query optimization, working with customers to solve complex performance challenges in large-scale data processing workloads. His focus spans engine internals, cost optimization strategies, and architectural patterns that enable customers to run petabyte-scale analytics efficiently.

Ravi Kumar

Ravi Kumar

Ravi is a Senior Product Manager Technical at Amazon Web Services, specializing in exabyte-scale data infrastructure and analytics platforms. He helps customers unlock insights from structured and unstructured data using open-source technologies and cloud computing. Outside of work, Ravi enjoys exploring emerging trends in data science and machine learning.

Matt Tolton

Matt Tolton

Matt is a Senior Principal Engineer at Amazon Web Services.

author name

Neil Mukerje

Neil is a Principal Product Manager at Amazon Web Services.