AWS Storage Blog

Analyzing Amazon S3 Metadata with Amazon Athena and Amazon QuickSight

UPDATE (1/27/2025): Amazon S3 Metadata is generally available.

UPDATE (7/15/2025): Amazon S3 Metadata releases live inventory tables. This blog post has been edited to reflect the current process.


Object storage provides virtually unlimited scalability, but managing billions, or even trillions, of objects can pose significant challenges. How do you know what data you have? How can you find the right datasets at the right time?

By implementing a robust metadata management strategy, you can answer these questions, gain better control over massive data lakes, improve data discoverability, accessibility, and extract more value from your data. AWS introduced Amazon S3 Metadata at re:Invent 2024, which delivers queryable object metadata in near real-time to organize your data and accelerate data discovery. S3 Metadata helps you to curate, identify, and use your S3 data for business analytics, real-time inference applications, and more.

In this post, we walk through how to set up and use S3 Metadata, how to derive actionable insights using simple SQL queries from Amazon Athena, and then how to visualize your data from Amazon QuickSight. We show you some example queries and dashboards that you can create to gain insights into and create visualizations of your S3 Metadata. You can use these queries, or create your own, to understand what data you have and to find the right datasets when you need them.

Amazon S3 Metadata

S3 Metadata delivers queryable object metadata in near real-time to organize your data and accelerate data discovery. S3 Metadata includes new fields which aren’t available today from other sources such as S3 Inventory reports or ListObjectsV2 API. For a detailed comparison, refer to the following table.

Metadata Available in S3 Inventory Available in ListObjectsV2 Available in S3 Metadata
bucket Y Y Y
key Y Y Y
sequence_number N N Y
record_type N N Y journal table only
record_timestamp N N Y journal table only
version_id Y Y Y
is_delete_marker Y N Y
size Y Y Y
last_modified_date Y Y Y
e_tag Y Y Y
storage_class Y Y Y
is_multipart Y N Y
encryption_status Y Y Y
is_bucket_key_enabled Y N Y
kms_key_arn N N Y
checksum_algorithm Y Y Y
object_tags N N Y
user_metadata N N Y
requester N N Y journal table only
source_ip_address N N Y journal table only
request_id N N Y journal table only
object_lock_retain_until_date Y N N
object_lock_mode Y N N
object_lock_legal_hold_status Y N N
intelligent_tiering_access_tier Y N N
bucket_key_status Y N N
object_access_control_list Y N N
object_owner N Y N
replication_status Y N N
restore_status N Y N

S3 Metadata helps you track and find information about your objects in S3, making it easier to manage data for analytics, content retrieval, and AI/ML use cases. When you set up S3 Metadata for your bucket, you can work with two types of metadata tables:

1. A journal table (required):

  • Provides queryable metadata for new objects and changes to existing objects
  • Refreshed in near real time
  • Required to enable live inventory tables
  • Cannot be removed from your configuration

2. A live inventory table (optional):

  • Provides queryable metadata of all objects in your bucket
  • Updated on an hourly basis
  • Backfills metadata for all existing objects

These metadata tables are stored in AWS managed S3 table buckets. By integrating with AWS analytics services, you can easily query your metadata using common tools like Athena, Amazon EMR, Amazon Redshift, Apache Spark, Trino, or any application compatible with the Apache Iceberg REST Catalog standard.

When you want to track information about your S3 objects, you need to create a metadata table configuration for your general purpose S3 bucket. Your new journal and live inventory tables will be created in an AWS managed S3 table bucket called aws-s3. The namespace for this table bucket will be b_<bucket name>, where bucket name is the general purpose S3 bucket that you have configured S3 Metadata for. After that, S3 automatically keeps your metadata tables up to date with any changes to your objects. You don’t need to manually update anything– it’s all managed for you. S3 continuously works in the background to optimize your metadata tables, keeping storage costs down, and helping your queries run faster. This automatic maintenance ensures your metadata stays current and efficient without requiring additional work from you.

Prerequisites

There are a few prerequisites before you get started:

1. Have or create an IAM role to use with S3, Glue, Lake Formation, Athena, and QuickSight. For more information, see IAM roles.

2. Ensure that you have access to Athena from your account.

3. Have or create an S3 general purpose bucket that has traffic. Since your S3 Metadata journal table collects mutations, you will need an S3 bucket that has PUT and/or DELETE traffic.

Walkthrough

From a high level, we:

1. Enable S3 Metadata on your general purpose S3 bucket

2. Run SQL queries on Athena to gain insights on S3 Metadata

3. Configure QuickSight to visualize S3 Metadata

1. Enable S3 Metadata on your general purpose S3 bucket

In this section, we walk through the following:

1.1. Verify that S3 Tables have the AWS analytics integration enabled in your Region.

1.2. Configure Key Management Service (AWS KMS) key for S3 Metadata (optional).

1.3. Configure S3 Metadata to collect data on a general purpose S3 bucket.

1.1. Enable integration with AWS analytics services

To get started, navigate to S3 in the AWS Management Console. On the left navigation menu, select Table buckets, and then verify that the Integration with AWS analytics services is Enabled. If not, select the Enable integration button.

S3 Tables console showing integration with AWS analytics services

1.2. Configure KMS key for S3 Metadata (optional)

Table buckets have a default encryption configuration that automatically encrypts tables by using server-side encryption with S3 managed keys (SSE-S3). If you need more control over your encryption keys, such as managing key rotation and access policy grants, you can configure your table buckets to use server-side encryption with KMS keys (SSE-KMS). If you do not want to use KMS, you may skip to section 1.3. When you use SSE-KMS with S3 Metadata, you need to grant permissions for different identities in your account. In this section, we walk through the following:

1.2.1 Create a new KMS key.

Navigate to the KMS in the AWS Management Console. On the left menu, select Customer managed key, and then Create key.

KMS console showing steps for creating a new customer managed key

At Step 1 – Configure key, leave the defaults values, and then select Next.

At Step 2 – Add labels, enter “S3MetadataBlog” as Alias, and then select Next.

KMS console with boxes around the key Alias and the Next button

In Step 3 – Define key administrative permissions, select the IAM user/role you are logged in with, and then select Next.

In Step 4 – Define key usage permissions, leave the default values and then select Next.

In Step 5 – Edit key policy, leave the default values and then select Next.

In Step 6 – Review, finally, review the configuration and select Next to create the KMS key.

1.2.2 Modify the KMS key policy.

When you use server-side encryption with AWS Key Management Service (AWS KMS) keys (SSE-KMS) for tables in S3 table buckets you need to grant permissions for different identities in your account:

    • The S3 Tables maintenance principal (maintenance.s3tables.amazonaws.com)
    • The S3 Metadata service principal (metadata.s3.amazonaws.com)
    • Your service role for AWS analytics services (S3TablesRoleForLakeFormation)

Select the KMS key you just created (alias S3MetadataBlog), and then select Switch to policy view.

KMS console Key policy screen with box around the Switch to policy view button

Select Edit, add the following two additional statements, and then select Save changes. Make sure to replace the placeholder values with the correct ones:

{
  "Sid": "EnableKeyUsage",
  "Effect": "Allow",
  "Principal": {
    "Service": [
      "maintenance.s3tables.amazonaws.com",
      "metadata.s3.amazonaws.com"
    ]
  },
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "<KMS_KEY_ARN>",
  "Condition": {
    "StringLike": {
      "kms:EncryptionContext:aws:s3:arn": "arn:aws:s3tables:<AWS_REGION>:<AWS_ACCOUNT_ID>:bucket/aws-s3/*"
    }
  }
},
{
  "Sid": "AllowTableRoleAccess",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::<AWS_ACCOUNT_ID>:role/service-role/S3TablesRoleForLakeFormation"
  },
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "<KMS_KEY_ARN>"
}

1.3. Configure S3 Metadata to collect data on a general purpose S3 bucket

You can now navigate to a general purpose bucket to configure S3 Metadata to start producing a metadata table. You can now navigate to a general purpose S3 bucket to configure S3 Metadata to start producing a metadata table. This is the bucket that you are interested in having S3 Metadata automatically surface object metadata. Once you have navigated to the bucket in the S3 console, select the Metadata tab. Then, select Create metadata configuration.

S3 console with boxes around the Metadata tab and the Create metadata configuration button

In the Journal table section, select Specify encryption type, select Server-side encryption with AWS Key Management Service keys (SSE-KMS), and then choose the key with alias S3MetadataBlog you created in the previous step. In the Record expiration – new section, enable the record expirations, set the expiration after 7 days, and acknowledge the warning about record expiration. By default, the records in the journal table don’t expire. To help minimize the storage costs for your journal table, you can enable journal table record expiration.

Steps for configuring the S3 Metadata journal table

In the Live inventory table – new section, select Enabled for the Configuration status, and then follow the same steps to configure the server-side encryption. Finally, choose Create metadata configuration.

Steps for configuring the S3 Metadata Live inventory table

Initially, both Journal table and Live Inventory table will be in Creating status.

S3 Console Metadata tab showing Journal and Live inventory tables in Creating status

After a few minutes, the Journal table status should change to Active and the Live inventory table status should change to Backfilling. In this status, S3 scans your general purpose bucket to retrieve the initial metadata for all objects that exist in the bucket. Depending on the number of objects in your bucket, this process can take several hours.

S3 Console Metadata tab showing Journal table in Active status and Live inventory table in Backfilling status.

When the backfilling process is finished, the status of your Live inventory table changes from Backfilling to Active.

Console Metadata tab showing Journal and Live inventory tables in Active status

2. Run SQL queries on Amazon Athena to gain insights on S3 Metadata

In this section, we walk through the following:

2.1. Verify access to S3 Metadata in Athena.

2.2. Explore your S3 Metadata with Athena. You will try some sample queries to better understand what kinds of insights are possible.

2.1. Verify access to S3 Metadata in Athena

After you created the metadata configuration, from the Metadata tab, choose Query table with Athena, and then select one of the pre-configured queries.

S3 Console Metadata tab with a box around the Query table with Athena button

On the left side of the editor, in the Data section, review the configuration of the Catalog and Database, and then select the Run button to run the query to verify that everything is working correctly.

Athena Query editor with box around Catalog and Database and box around the Run button

2.2. Explore your S3 Metadata with Athena

Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. Here is a series of sample queries you can use to analyze your S3 Metadata from Athena. First, we perform queries on the journal table to analyze how things are changing in the S3 bucket.

Find objects that match a pattern.

SELECT key from "journal" where key LIKE ‘%.jpg’

What objects were deleted in the last 24 hours?

SELECT bucket, key, version_id, last_modified_date, record_timestamp, requester
from "journal"
where requester = 's3.amazonaws.com'
and record_type = 'DELETE'
and record_timestamp > (current_date - interval '1' day)

How many objects has S3 Lifecycle expired in the last 7 days?

SELECT bucket, key, version_id, last_modified_date, record_timestamp from "journal" where requester = 's3.amazonaws.com' and record_type = 'DELETE' and record_timestamp > (current_date - interval '7' day);

What IP address(es) have recent requests come from?

SELECT source_ip_address, count(source_ip_address) from "journal" group by source_ip_address;

What KMS keys have been used for PUTs in the last 7 days?

SELECT DISTINCT kms_key_arn FROM s3_metadata_primary where record_timestamp > (current_date - interval '7' day);

Now, let’s perform some queries on the live inventory table.

What objects are not encrypted with KMS?

SELECT key, encryption_status FROM "inventory" WHERE encryption_status != 'SSE-KMS';

Create a distribution of object tags.

SELECT object_tags, count(object_tags)
FROM "inventory"
group by object_tags;

Are there any objects without tags?

SELECT key, last_modified_date, object_tags FROM "inventory" WHERE cardinality(object_tags) = 0 ORDER BY last_modified_date DESC;

Find objects in a given prefix in the S3 Glacier Deep Archive storage class and create a CSV file that could be used for a batch job to restore those objects.

SELECT bucket, key, version_id FROM "inventory" where key LIKE ‘legal/%’ and storage_class = ‘DEEP_ARCHIVE’;

What is the storage distribution among my S3 prefixes?

SELECT
    CASE
        WHEN STRPOS(key, '/') > 0 THEN SUBSTR(key, 1, STRPOS(key, '/') - 1)
        ELSE key
    END AS first_level_prefix,
    SUM(CAST(size AS BIGINT)) AS total_size_bytes,
    SUM(CAST(size AS BIGINT))/1024/1024 AS total_size_mb,
    SUM(CAST(size AS BIGINT))/1024/1024/1024 AS total_size_gb,
    COUNT(*) AS object_count
FROM 
    inventory
GROUP BY
    CASE
        WHEN STRPOS(key, '/') > 0 THEN SUBSTR(key, 1, STRPOS(key, '/') - 1)
        ELSE key
    END
ORDER BY 
    total_size_bytes DESC;

3. Visualize your S3 Metadata in QuickSight

Amazon QuickSight is a powerful cloud-based business intelligence (BI) service that empowers organizations of all sizes to unlock the value of their data. You can use QuickSight to create dashboards and visualizations using data stored in S3 Metadata tables. The following are some examples of dashboards you can create:

  • What percentage of your storage data is small objects compared to large object?
  • What is the percentage of object uploads compared to deletions?
  • What percentage of images files are in my bucket?

In this section, we walk through the following:

3.1. Configure permissions for the QuickSight service role. You need to give the QuickSight service role the Glue:GetCatalog permission.

3.2. Configure Lake Formation permissions to the QuickSight admin user. You will grant your QuickSight user with permissions to your database and the tables within your database.

3.3. Create a QuickSight dashboard. You will use custom SQL to load your S3 Metadata table into QuickSight. This will allow you visualize your S3 Metadata in customer Amazon QuickSight dashboards.

3.1. Configure permissions for the QuickSight service role

1. Navigate to Identity and Access Management (IAM) from the AWS Management Console.

2. Under Roles, identify the QuickSight service role, i.e., aws-quicksight-service-role-v0.

3. Under the Permissions tab, add an inline policy for the Glue GetCatalog permission to access the S3 table catalog from QuickSight. Use the following JSON to add the inline policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "glue:GetCatalog",
            "Resource": "*"
        }
    ]
}

3.2. Configure permissions for the QuickSight admin user

Identify the QuickSight admin user and its Amazon Resource Name (ARN) to grant Lake Formation permissions to access the S3 table catalog. Run the following CLI command from AWS CloudShell (or any terminal you choose) and note down the QuickSight admin user’s ARN.

$aws quicksight list-users --aws-account-id <AWS Account ID> --namespace default --region <AWS Region>

AWS CloudShell CLI output to capture the Amazon QuickSight user ARN

Navigate to the Lake Formation service from the AWS Management Console. Select Data permissions > Grant.

AWS Lake Formation data permissions

In the Principals section, select SAML users and groups and add the QuickSight admin user’s ARN. Then, select the Named Data Catalog Resources and grant the database permission and table permissions separately, as shown in the following pictures.

First, we will add permissions at the database level.

Grant database permissions to SAML users and groups

Select Named Data Catalog resources and choose the S3 Tables catalog and database.

Lake Formation console with catalog and database selected

Select the database permissions.

Lake Formation console with catalog, database, and tables selected

We now need to do the same thing, but this time at the table level. From the Lake Formation Console, select Data Permissions > Grant again.

Grant table permissions to SAML users and groups

Select Named Data Catalog resources and choose the S3 Tables catalog, database, and table.

Select Named Data Catalog resources and choose the S3 Tables catalogs, databases, and tables

Select the table permissions.

Select and Describe check boxes ticked for Table and Grantable permissions. All data access radio box selected for Data permissions

Navigate to the AWS Lake Formation service from the AWS Management Console. Select Data Permissions and verify that you have the permissions on the database, table, and column resources.

Lake Formation console showing the created permissions with boxes around the resource type, permissions, and Grantable

3.3. Create a QuickSight dashboard

Use the following steps to create a dashboard:

1. Sign in to the QuickSight console. From the user profile on the top right of the QuickSight console, choose the same AWS Region where you have created the S3 table bucket. For example, if you created the S3 table bucket in the Ohio (us-east-2) Region, select the same Region in QuickSight.

Selecting the Region in Amazon QuickSight

2. In the QuickSight dashboard, click on New analysis. On the New analysis page, click on New dataset.
3. In the Create a Dataset modal, select Athena as the data source, enter a name for your dataset, and click Create data source.

Create Athena data source

4. On the Choose your table session, select Use custom SQL. Please note, you will not be able to see the table in the dropdown list.

Choose your table

5. Enter a custom SQL query based on the columns you would like to visualize, and then select Confirm Query. In this example, you have selected all the columns by providing the following query. Replace <bucket-name> with your general purpose S3 bucket where you configured S3 Metadata.

SELECT * from "s3tablescatalog/aws-s3"."b_<bucket-name>"."inventory";
SELECT * from "s3tablescatalog/aws-s3"."b_<bucket-name>"."journal";
Screenshot_Quicksight_custom_SQL

6. Click Visualize to analyze data and creating dashboards. Please note, you won’t see the ‘s3tablecatalog’ and ‘aws_s3_metadata’ as databases. The data source will be the same name that you have given in step 3.

Selecting 'Visualize' to start creating the QuickSight dashboard

7. Select Create in the QuickSight to get started. To analyze and create dashboards, refer to the Visualizing data and Exploring dashboards sections in the QuickSight User Guide.

Selecting 'Create' to start work on the first Amazon QuickSight chart

8. Select the columns you would like to analyze and create dashboards.

Selecting columns to create charts for analysis

Examples of QuickSight dashboards:

Examples of Amazon QuickSight charts created using S3 Metadata

Cleaning up

When your work is complete in Athena and QuickSight, make sure you disable the S3 Metadata configuration and delete the table to ensure you do not continue to incur costs. To delete the S3 Metadata configuration, go to the S3 Metadata table for the bucket you created the configuration for and click Delete.

Screenshot of S3 Metadata with Delete button highlighted with orange box

To delete the tables, go to the Athena console as you did previous and run the follow queries to delete the tables:

DROP "journal";
DROP "inventory";

Conclusion

In this post, we walked through a step-by-step process to configure S3 Metadata and enable integration with Athena and QuickSight. We provided examples of SQL queries that you can use to find data insights into your S3 data.

Amazon S3 gives you the ability to store virtually unlimited objects in a single bucket for use cases like archive, analytics and, data lakes. While this makes it a powerful tool for storing data, there are situations where you want to manage individual files intelligently for reporting and automation to achieve business goals. Typical business goals are to understand usage, verify security, increase efficiency, drive down costs, or migrate data to other environments selectively. By bringing in AWS Analytics services to help analyze your object metadata, you can manage your data and achieve your business objectives.

To learn more about S3 Metadata, read the AWS News Blog.

Thanks for taking the time to read this post. If you have any feedback or questions, please don’t hesitate to leave them in the comments section.

Lokesh AP

Lokesh AP

Lokesh AP is a Sr. Solutions Architect specializing in AWS Storage services. Prior to joining AWS, he worked as a technologist for companies such as NetApp and Western Digital. Based in California, he enjoys playing cricket.

Tom Bailey

Tom Bailey

Tom Bailey is a Senior Technical Account Manager at AWS, where he supports Financial Services customers. He has over 15 years experience in IT consultancy, presales, and architecture roles. Outside of work, he enjoys spending time with his family, walking his Labrador, running, and cycling.

Huey Han

Huey Han

Huey Han is a Senior Product Manager for Amazon S3. He focuses on data lake, analytics, and data governance at S3. He is based in New York City. In his spare time, Huey enjoys martial arts.

Lee Kear

Lee Kear

Lee Kear has been working in IT since she received her Master’s Degree in Computer Science from the Georgia Institute of Technology in 1999. She started working at AWS in 2012 as a systems engineer on the Amazon S3 team. She became the first Storage Specialist Solutions Architect specializing in S3 in 2016 and she still enjoys this role today. She loves to help customers use S3 in the most efficient, performant, and cost effective way possible for their use case. Outside of work, she enjoys traveling with her wife.

Fabio Lattanzi

Fabio Lattanzi

Fabio is a Sr. Solution Architect focused on AWS Transfer Family and Amazon S3. He enjoys helping customers build the most durable, scalable, performant, and cost-effective storage solutions for their use case. He is based in Utah, loves traveling with his wife, cuddling with his two dogs, and playing drums.

Roohi Sood

Roohi Sood

Roohi is a Senior Product Manager at Amazon S3. She is passionate about driving innovation in cloud object storage, helping shape the future of data storage and retrieval at scale. Beyond her professional pursuits, she finds joy in exploring music and cherishing quality time with her family.