Introducing AWS Glue 5.1 for Apache Spark

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. AWS recently announced Glue 5.1, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.1 upgrades the Spark engines to Apache Spark 3.5.6, giving you newer Spark release along with the newer dependent libraries so you can develop, run, and scale your data integration workloads and get insights faster.

In this post, we describe what’s new in AWS Glue 5.1, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.1.

What’s new in AWS Glue 5.1

The following updates are in AWS Glue 5.1:

Runtime and library upgrades

AWS Glue 5.1 upgrades the runtime to Spark 3.5.6, Python 3.11, and Scala 2.12.18 with new improvements from the open source version. AWS Glue 5.1 also updates support for open table format libraries to Apache Hudi 1.0.2, Apache Iceberg 1.10.0, and Delta Lake 3.3.2 so you can solve advanced use cases around performance, cost, governance, and privacy in your data lakes.

Support for new Apache Iceberg features

AWS Glue 5.1 adds support for Apache Iceberg Materialized View, and Apache Iceberg format version 3.0. AWS Glue 5.1 also adds support for data writes into Iceberg and Hive tables with Spark-native fine-grained access control with AWS Lake Formation.

Apache Iceberg Materialized View is especially useful in cases where you need to accelerate frequently run queries on large data sets by pre-computing expensive aggregations. If you would like to learn more about Apache Iceberg materialized views, refer to Introducing Apache Iceberg materialized views in AWS Glue Data Catalog.

Apache Iceberg format version 3.0 is the latest Iceberg format version defined in Iceberg Table Spec. Following features are supported:

New data types: nanosecond timestamp (tz), unknown, geometry, geography
Default value support for columns
Multi-argument transforms for partitioning and sorting
Row Lineage tracking
Binary deletion vectors (Learn more in the Unlock the power of Apache Iceberg v3 deletion vectors on Amazon EMR blog post)
Table encryption keys

Create an Iceberg V3 format table

To create an Iceberg V3 format table, specify the format-version to 3 when creating the table. The following is a sample PySpark script: (replace amzn-s3-demo-bucket with your S3 bucket name):

from pyspark.sql import SparkSession

s3bucket = "amzn-s3-demo-bucket" 
database = "glue51_blog_demo" 
table_name = "iceberg_v3_table_demo"

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.defaultCatalog", "glue_catalog")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.type", "glue")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://{s3bucket}/{database}/{table_name}/")
    .getOrCreate()
)

spark.sql(f"CREATE DATABASE IF NOT EXISTS {database}")

# Create Iceberg table with V3 format-version
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {database}.{table_name} (
        id int,
        name string,
        age int,
        created_at timestamp
    ) USING iceberg
    TBLPROPERTIES (
        'format-version'='3',
        'write.delete.mode'='merge-on-read'
    )
""")

To migrate from V2 format to V3, use ALTER TABLE ... SET TBLPROPERTIES to update the format-version. The following is a sample PySpark script:

spark.sql(f"ALTER TABLE {database}.{table_name} SET TBLPROPERTIES ('format-version'='3')")

You cannot rollback from V3 to V2, so you need to be careful to verify that all your Iceberg clients support Iceberg V3 format version. Once upgraded, older versions cannot correctly read newer format versions, as Iceberg table format versions are not forward-compatible.

Create a table with Row Lineage tracking enabled

To create a table with Row Lineage tracking enabled, set the table property row-lineage to true. The following is a sample PySpark script:

# Create Iceberg table with row-lineage-tracking
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {database}.{table_name} (
        id int,
        name string,
        age int,
        created_at timestamp
    ) USING iceberg
    TBLPROPERTIES (
        'format-version'='3',
        'row-lineage'='true',
        'write.delete.mode'='merge-on-read'
    )
""")

In tables with Row Lineage tracking enabled, row IDs are managed at the metadata level for tracking row modifications over time and auditing.

Extended support for AWS Lake Formation permissions

Fine-grained access control with Lake Formation has been supported through native Spark DataFrames and Spark SQL in Glue 5.0 for read operations. Glue 5.1 extends fine-grained access control for write operations.

Full-Table Access (FTA) control in Apache Spark were introduced for Apache Hive and Iceberg tables in Glue 5.0. Glue 5.1 extends FTA support for Apache Hudi tables and Delta Lake tables.

S3A by default

AWS Glue 5.1 uses S3A as the default S3 connector. This change aligns with the recent Amazon EMR adoption of S3A as the default connector and brings enhanced performance and advanced features to Glue workloads. For more details about the S3A connector’s capabilities and optimizations, see Optimize Amazon EMR runtime for Apache Spark with EMR S3A.

Note when migrating from Glue 5.0 to Glue 5.1, If both spark.hadoop.fs.s3a.endpoint and spark.hadoop.fs.s3a.endpoint.region are not set, the default region used by S3A is us-east-2. This may cause issues. To mitigate the issues caused by this change, set the spark.hadoop.fs.s3a.endpoint.region Spark configuration when using the S3A file system in AWS Glue 5.1.

Dependent library upgrades

AWS Glue 5.1 upgrades the runtime to Spark 3.5.6, Python 3.11, and Scala 2.12.18 with upgraded dependent libraries.

The following table lists dependency upgrades:

Dependency	Version in AWS Glue 5.0	Version in AWS Glue 5.1
Spark	3.5.4	3.5.6
EMRFS	2.69.0	2.73.0
Iceberg	1.7.1	1.10.0
Python	3.11	3.11
Hudi	0.15.0	1.0.2
Delta Lake	3.3.0	3.3.2
boto3	1.34.131	1.40.61
AWS SDK for Java	2.29.52	2.35.5
AWS Glue Data Catalog Client	4.5.0	4.9.0
EMR DynamoDB Connector	5.6.0	5.7.0

The following are Spark connector upgrades:

Driver	Connector version in AWS Glue 5.0	Connector version in AWS Glue 5.1
Amazon Redshift	6.4.0	6.4.2
Snowflake	3.0.0	3.1.1

Get started with AWS Glue 5.1

You can start using AWS Glue 5.1 through AWS Glue Studio, the AWS Glue console, the latest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start using AWS Glue 5.1 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Details tab, choose the version Glue 5.1 – Supports Spark 3.5, Scala 2, Python 3.

To start using AWS Glue 5.1 on an AWS Glue Studio notebook or an interactive session through a Jupyter notebook, set 5.1 in the %glue_version magic:

%%glue_version 5.1

The following output shows that the session is set to use AWS Glue 5.1:

Setting Glue version to: 5.1

Spark Troubleshooting with Glue 5.1

To accelerate Apache Spark troubleshooting and job performance optimization for your Glue 5.1 ETL jobs, you can use the newly introduced Apache Spark troubleshooting agent. Traditional Spark troubleshooting requires extensive manual analysis of logs, performance metrics, and error patterns to identify root causes and optimization opportunities. The agent simplifies this process through natural language prompts, automated workload analysis, and intelligent code recommendations. The agent has three main components: an MCP-compatible AI assistant in your development environment for interaction, the MCP proxy for AWS that handles secure communication between your client and the MCP server, and an Amazon SageMaker Unified Studio managed MCP Server (preview) that provides specialized Spark troubleshooting and upgrade tools for Glue 5.1 jobs.

To set up the agent, follow the instructions to set up the resources and MCP configuration: Setup for Apache Spark Troubleshooting agent. Then, you can launch your preferred MCP client and use conversation to interact with the tools for troubleshooting.

The following is a demonstration on how you can use the Apache Spark troubleshooting agent with Kiro CLI to debug a Glue 5.1 job run.

For more information and video walkthroughs for how to use the Apache Spark troubleshooting agent, please refer to Apache Spark Troubleshooting agent for Amazon EMR.

Conclusion

In this post, we discussed the key features and benefits of AWS Glue 5.1. You can create new AWS Glue jobs on AWS Glue 5.1 or migrate your existing AWS Glue jobs to benefit from the improvements.

We would like to thank the support of numerous engineers and leaders who helped build Glue 5.1 to support customers with a performance optimized Spark runtime and deliver new capabilities.

AWS Big Data Blog

Introducing AWS Glue 5.1 for Apache Spark

What’s new in AWS Glue 5.1

Runtime and library upgrades

Support for new Apache Iceberg features

Create an Iceberg V3 format table

Create a table with Row Lineage tracking enabled

Extended support for AWS Lake Formation permissions

S3A by default

Dependent library upgrades

Get started with AWS Glue 5.1

Spark Troubleshooting with Glue 5.1

Conclusion

About the authors

Resources

Follow

Learn

Resources

Developers

Help