Skip to main content

Amazon EMR

Apache Spark on Amazon EMR

Why Apache Spark on EMR?

Amazon EMR enables you to build open, transactional data lakes with Apache Spark and Apache Iceberg. Our performance-optimized runtime is 100% API-compatible with open-source Spark, executing up to 4.5x faster than open-source equivalents while delivering 2.7x faster Iceberg write performance.

EMR supports Apache Iceberg v3 and Spark 4.0 (preview), allowing you to leverage capabilities like ACID transactions and schema evolution with features like the VARIANT data type or semi-structured data at scale and ANSI SQL compliance for data integrity. Whether you require the granular control of EC2, the containerized scale of EKS, or the simplicity of EMR Serverless, Amazon EMR provides speed, reliability, and data integrity.  

Features and benefits

    Amazon EMR's performance-optimized Apache Spark runtime accelerates data lake workloads with up to 4.5x faster execution than open-source equivalents while maintaining 100% API compatibility. This optimization extends to Apache Iceberg operations, delivering 2.7x faster write performance for transactional data lakes that demand both speed and reliability.

    With support for Apache Iceberg v3 and Spark 4.0 (preview), EMR enables advanced capabilities including ACID transactions, schema evolution, the VARIANT data type for semi-structured data processing, and ANSI SQL compliance.

    Amazon EMR runtime for Spark optimizes your query plans to run entirely in-memory, maximizing the utilization of your hardware. By streamlining how intermediate data is handled, EMR reduces the time-to-result for your most resource-intensive machine learning workloads, allowing you to iterate faster.

    Modernize your workflow with SageMaker Unified Studio and EMR Studio, which provide integrated environments for SQL, Python, and Scala. Leverage Amazon Q Developer to generate optimized PySpark code and troubleshoot complex execution plans (DAGs) in real-time. Unlike standard Spark, EMR provides a Persistent Spark UI, allowing you to analyze and debug job logs even after your serverless applications or ephemeral clusters have terminated. This persistence is critical for auditing and continuous performance tuning in production environments.

    EMR Serverless removes operational friction by providing an instant-on notebook experience. You no longer need to provision, scale, or manage clusters. You attach your preferred development environment, like Amazon SageMaker Unified Studio or JupyterLab, to an EMR Serverless application and start querying. The EMR runtime for Spark ensures that your interactive code performs with the same enterprise-grade speed as your production pipelines. Whether you are performing ad-hoc data discovery on petabytes of S3 data or running complex feature engineering tasks, Amazon EMR provides the seamless, high-performance environment required to accelerate your most critical data science workflows.

    The Apache Spark upgrade agent automatically identifies API changes and behavioral modifications across PySpark and Scala applications. Engineers can initiate upgrades directly from SageMaker Unified Studio or the IDE of their choice with the help of MCP (Model Context Protocol) compatibility. During the upgrade process, the agent analyzes existing code and suggests specific changes, and engineers can review and approve before implementation. The agent validates functional correctness through data quality validations. The agent currently supports upgrades from Spark 2.4 to 3.5 and maintains data processing accuracy throughout the upgrade process.

Use cases

    Consume and process real-time data from Amazon KinesisApache Kafka, or other data streams with Spark Streaming on EMR. Perform streaming analytics in a fault-tolerant way and write results to S3 or on-cluster HDFS.

    Apache Spark on EMR includes MLlib for a variety of scalable machine learning algorithms, or you can use your own libraries. By storing datasets in-memory during a job, Spark has great performance for iterative queries common in machine learning workloads. You can enhance Amazon SageMaker capabilities by connecting the notebook instance to an Apache Spark cluster running on Amazon EMR, with Amazon SageMaker Spark for easily training models and hosting models.

    Use Spark SQL for low-latency, interactive queries with SQL or HiveQL. Spark on EMR can leverage EMRFS, so you can have ad hoc access to your datasets in S3. Also, you can utilize EMR Studio, EMR Notebooks, Zeppelin notebooks, or BI tools via ODBC and JDBC connections.

Customer success

Yelp

Missing alt text value Yelp’s advertising targeting team makes prediction models to determine the likelihood of a user interacting with an advertisement. By using Apache Spark on Amazon EMR to process large amounts of data to train machine learning models, Yelp increased revenue and advertising click-through rate.

The Washington Post

Missing alt text value The Washington Post uses Apache Spark on Amazon EMR to build models powering its website’s recommendation engine to boost reader engagement and satisfaction. They leverage Amazon EMR's performant connectivity with Amazon S3 to update models in near real-time.

Krux

Missing alt text value As part of its Data Management Platform for customer insights, Krux runs many machine learning and general processing workloads using Apache Spark. Krux utilizes ephemeral Amazon EMR clusters with Amazon EC2 Spot Capacity to save costs and uses Amazon S3 with EMRFS as a data layer for Apache Spark. Read more

GumGum

Missing alt text value GumGum, an in-image and in-screen advertising platform, uses Spark on Amazon EMR for inventory forecasting, processing of clickstream logs, and ad hoc analysis of unstructured data in Amazon S3. Spark’s performance enhancements saved GumGum time and money for these workflows. Read more

Hearst Corporation

Missing alt text value Hearst Corporation, a large diversified media and information company, has customers viewing content on over 200 web properties. Using Apache Spark Streaming on Amazon EMR, Hearst’s editorial staff can keep a real-time pulse on which articles are performing well and which themes are trending. Read more

CrowdStrike

Missing alt text value CrowdStrike provides endpoint protection to stop breaches. They use Amazon EMR with Spark to process hundreds of terabytes of event data and roll it up into higher-level behavioral descriptions on the hosts. From that data, CrowdStrike can pull event data together and identify the presence of malicious activity. Read more