AWS Big Data Blog

Enforce table level access control on data lake tables using AWS Glue 5.0 with AWS Lake Formation

AWS Glue 5.0 now supports Full-Table Access (FTA) control in Apache Spark based on your policies defined in AWS Lake Formation. This new feature enables read and write operations from your AWS Glue 5.0 Spark jobs on Lake Formation registered tables when the job role has full table access. This level of control is ideal for use cases that need to comply with security regulations at the table level. In addition, you can now use Spark capabilities including Resilient Distributed Datasets (RDDs), custom libraries, and user-defined functions (UDFs) with Lake Formation tables. This capability enables Data Manipulation Language (DML) operations including CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application. Data teams can run complex, interactive Spark applications through Amazon SageMaker Unified Studio in compatibility mode while maintaining the table-level security boundaries provided by Lake Formation. This simplifies security and governance of your data lakes.

In this post, we show you how to enforce FTA control on AWS Glue 5.0 through Lake Formation permissions.

How access control works on AWS Glue

AWS Glue 5.0 supports two features that achieve access control through Lake Formation:

  • Full-Table Access (FTA) control
  • Fine-grained access control (FGAC)

At a high level, FTA supports access control at the table level whereas FGAC can support access control at the table, row, column, and cell levels. To support more granular access control, FGAC uses a tight security model based on user/system space isolation. By maintaining this extra level of security, only a subset of Spark core classes are allowlisted. Additionally, there is extra setup for enabling FGAC, such as passing the --enable-lakeformation-fine-grained-access parameter to the job. For more information about FGAC, see Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation.

While this level of granular control is essential for organizations that need to comply with data governance, security regulations, or deal with sensitive data, it’s excessive for organizations that only need table level access control. To provide customers with a way to enforce table level access without the performance, cost, and setup overhead introduced by the tighter security model in FGAC, AWS Glue introduced FTA. Let’s dive into FTA, the main topic of this post.

How Full-Table Access (FTA) works in AWS Glue

Until AWS Glue 4.0, Lake Formation-based data access worked through GlueContext class, the utility class provided by AWS Glue. With the launch of AWS Glue 5.0, Lake Formation-based data access is available through native Spark SQL and Spark DataFrames.

With this launch, when you have full table access to your tables through Lake Formation permissions, you don’t need to enable fine-grained access mode for your AWS Glue jobs or sessions. This eliminates the need to spin up a system driver and system executors, because they’re designed to allow fine-grained access, resulting in lower performance overhead and lower cost. In addition, although Lake Formation fine-grained access mode supports read operations, FTA supports not only read operations, but also write operations through CREATE, ALTER, DELETE, UPDATE, and MERGE INTO commands.

To use FTA mode, you must allow third-party query engines to access data without the AWS Identity and Access Management (IAM) session tag validation in Lake Formation. To do this, follow the steps in Application integration for full table access.

Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA

The high-level steps to enable the Spark native FTA feature are documented in Using AWS Glue with AWS Lake Formation for Full Table Access. However, in this section, we will go through an end-to-end example of how to migrate an AWS Glue 4.0 job that uses FTA through GlueContext to read an Iceberg table to an AWS Glue 5.0 job that uses Spark native FTA.

Prerequisites

Before you get started, make sure that you have the following prerequisites:

For this post, we use the us-east-1 AWS Region, but you can integrate it in your preferred Region if the AWS services included in the architecture are available in that Region.

You will walk through setting up test data and an example AWS Glue 4.0 job using GlueContext, but if you already have these and are only interested in how to migrate, proceed to Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA. With the prerequisites in place, you’re ready start the implementation steps.

Create an S3 bucket and upload a sample data file

To create an S3 bucket for the raw input datasets and Iceberg table, complete the following steps:

  1. On the AWS Management Console for Amazon S3, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Enter the bucket name (for example, glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), and leave the remaining fields as default.
  4. Choose Create bucket.
  5. On the bucket details page, choose Create folder.
  6. Create two subfolders: raw-csv-input and iceberg-datalake.

  1. Upload the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Create an AWS Glue database and AWS Glue tables

To create input and output sample tables in the Data Catalog, complete the following steps:

  1. On the Athena console, navigate to the query editor.
  2. Run the following queries in sequence (provide your S3 bucket name):
-- Create database for the demo
CREATE DATABASE glue5_fta_demo;

-- Create external table in input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE glue5_fta_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 category string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3:///raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
 
-- Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name
CREATE TABLE glue5_fta_demo.iceberg_datalake WITH (
  table_type='ICEBERG',
  format='parquet',
  write_compression = 'SNAPPY',
  is_external = false,
  partitioning=ARRAY['category', 'bucket(product_id, 16)'],
  location='s3:///iceberg-datalake/'
) AS SELECT * FROM glue5_fta_demo.raw_csv_input;
  1. Run the following query to validate the raw CSV input data:

SELECT * FROM glue5_fta_demo.raw_csv_input;The following screenshot shows the query result:

  1. Run the following query to validate the Iceberg table data:

SELECT * FROM glue5_fta_demo.iceberg_datalake;The following screenshot shows the query result:

This step used DDL to create table definitions. Alternatively, you can use a Data Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.

The next step is to configure Lake Formation permissions on the iceberg_datalake table.

Configure Lake Formation permissions

To validate the capability, you need to define FTA permissions for the iceberg_datalake Data Catalog table you created. To start, enable read access to iceberg_datalake.

To configure Lake Formation permissions for the iceberg_datalake table, complete the following steps:

  1. On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
  2. Choose Register location.
  3. For Amazon S3 path, enter the path of your S3 bucket to register the location.
  4. For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
  5. For Permission mode, select Lake Formation.

  1. Choose Register location.

Grant permissions on the Iceberg table

The next step is to grant table permissions on the iceberg_datalake table to the AWS Glue job role.

  1. On the Lake Formation console, choose Data permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. For Principals, choose IAM users and roles.
  4. For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
  5. For LF-Tags or catalog resources, choose Named Data Catalog resources.
  6. For Catalogs, choose your account ID (the default catalog).
  7. For Databases, choose glue5_fta_demo.
  8. For Tables, choose iceberg_datalake.
  9. For Table permissions, choose Select and Describe.
  10. For Data permissions, choose All data access.

Next, create the AWS Glue PySpark job to process the input data.

Query the Iceberg table through an AWS Glue 4.0 job using GlueContext and DataFrames

Next, create a sample AWS Glue 4.0 job to load data from the iceberg_datalake table. You will use this sample job as a source of migration. Complete the following steps:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. For Create job, choose Script Editor.
  3. For Engine, choose Spark.
  4. For Options, choose Start fresh.
  5. Choose Create script.
  6. For Script, replace the following parameters:
    1. Replace aws_region with your Region.
    2. Replace aws_account_id with your AWS account ID.
    3. Replace warehouse_path with your Amazon S3 warehouse path for the Iceberg table.

For more information about how to use Iceberg in AWS Glue 4.0 jobs, see Using the Iceberg framework in AWS Glue.

from awsglue.context import GlueContext
from pyspark.sql import SparkSession

catalog_name = "glue_catalog"
aws_region = "us-east-1"
aws_account_id = "123456789012"
warehouse_path = "s3:///warehouse/"

# Initialize Spark and Glue contexts
spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.lakeformation-enabled","true") \
    .config(f"spark.sql.catalog.{catalog_name}.client.region",f"{aws_region}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.id",f"{aws_account_id}") \
    .getOrCreate()
glueContext = GlueContext(spark.sparkContext)

database_name = "glue5_fta_demo"
table_name = "iceberg_datalake"

# Read the Iceberg table
df = glueContext.create_data_frame.from_catalog(
    database=database_name,
    table_name=table_name,
)
df.show()
  1. On the Job details tab, for Name, enter glue-fta-demo-iceberg.
  2. For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
  3. For Glue version, choose Glue 4.0 – Supports spark 3.3, Scala 2, Python 3.
  4. For Job parameters, add the following parameters:
    1. Key: --conf
    2. Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    3. Key: --datalake-formats
    4. Value: iceberg
  5. Choose Save and then Run.
  6. When the job is complete, on the Run details tab, choose Output logs.

You’re redirected to the Amazon CloudWatch console to validate the output.

The output table is shown in the following screenshot. You see the same output that you saw in Athena when you verified that the Iceberg table was populated. This is because the AWS Glue job execution role has full table access from the Lake Formation permissions that you granted:

If you were to run this same AWS Glue job with another IAM role that wasn’t granted access to the table in Lake Formation, you would see an error Insufficient Lake Formation permission(s) on iceberg_datalake. Use the following steps to replicate this behavior:

  1. Create a new IAM role that’s identical to the AWS Glue job execution role you already used, but don’t grant permissions to this clone in Lake Formation.
  2. Change the role in the AWS Glue console for glue-fta-demo-iceberg to the new cloned role.
  3. Rerun the job. You should see the error.
  4. For the purposes of this post, change the role back to the original job execution role that’s registered in Lake Formation so you can use it in the next steps.

You now have an FTA setup in AWS Glue 4.0 that uses GlueContext DataFrames for an Iceberg table. You saw how roles that are granted permission in Lake Formation can read, and how roles that aren’t granted permission in Lake Formation cannot read. In the next section, we show you how to migrate from AWS Glue 4.0 GlueContext FTA to AWS Glue 5.0 native Spark FTA.

Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA

The Lake Formation permission granting experience is identical regardless of the AWS Glue version and Spark data structures used. Therefore, assuming you have a working Lake Formation setup for your AWS Glue 4.0 job, you don’t need to modify those permissions during migration. Here are the migration steps using the AWS Glue 4.0 example from the previous sections:

  1. Allow third-party query engines to access data without the IAM session tag validation in Lake Formation. Follow the step-by-step guide in Application integration for full table access.
  2. You shouldn’t need to change the job runtime role if you have AWS Glue 4.0 FTA working (see the example permissions in the prerequisites). The main IAM permission to verify is that the AWS Glue job execution role has lakeformation:GetDataAccess.
  3. Modify the Spark session configurations in the script. Verify that the following Spark configurations are present:
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3:///warehouse/
--conf spark.sql.catalog.spark_catalog.client.region=REGION
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.lakeformation-enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true

For more info about the above three steps, see Using AWS Glue with AWS Lake Formation for Full Table Access.

  1. Update the script so that GlueContext DataFrames are changed to native Spark DataFrames. For example, the updated script for the previous AWS Glue 4.0 job would now look like:
from pyspark.sql import SparkSession

catalog_name = "spark_catalog"
aws_region = "us-east-1"
aws_account_id = ""
warehouse_path = "s3:///warehouse/"

spark = SparkSession.builder \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.defaultCatalog", f"{catalog_name}") \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.client.region", f"{aws_region}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.account-id", f"{aws_account_id}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.lakeformation-enabled", "true") \
    .config(f"spark.sql.catalog.dropDirectoryBeforeTable.enabled", "true") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()

database_name = "glue5_fta_demo"
table_name = "iceberg_datalake"

df = spark.sql(f"select * from {database_name}.{table_name}")
df.show()
  • You can remove the --conf job argument that was added in the AWS Glue 4.0 job because it’s set in the script itself now.
  1. For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.

To verify that roles that don’t have Lake Formation permissions granted for them aren’t able to access the Iceberg table, you can repeat the same exercise you did in AWS Glue 4.0 and reuse the clone job execution role to rerun the job. You should see the error message: AnalysisException: Insufficient Lake Formation permission(s) on glue5_fta_demo

You’ve completed the migration and now have an FTA setup in AWS Glue 5.0 that uses native Spark and reads from an Iceberg table. You saw that roles that are granted permission in Lake Formation can read and that roles that aren’t granted permission in Lake Formation cannot read.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the AWS Glue job glue-fta-demo-iceberg.
  2. Delete the Lake Formation permissions.
  3. Delete the bucket that you created for the input datasets, which might have a name similar to glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}.

Conclusion

This post explained how you can enable Spark native FTA in AWS Glue 5.0 jobs that will enforce access control defined using Lake Formation grant commands. For previous AWS Glue versions, you needed to integrate AWS Glue DataFrames to enforce FTA in AWS Glue jobs or migrate to AWS Glue 5.0 FGAC, which has relatively limited functionality. With this release, if you don’t need fine-grained control, you can enforce FTA through Spark DataFrame or Spark SQL for more flexibility and performance. This capability is currently supported for Iceberg and Hive tables.

This feature can save you effort and encourage portability while migrating Spark scripts to different serverless environments such as AWS Glue and Amazon EMR.


About the authors

Layth Yassin is a Software Development Engineer on the AWS Glue team. He’s passionate about tackling challenging problems at a large scale, and building products that push the limits of the field. Outside of work, he enjoys playing/watching basketball, and spending time with friends and family.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.