AWS Partner Network (APN) Blog
Build real-time data lakes with Snowflake and Amazon S3 Tables
![]() |
Snowflake |
![]() |
By Nidhi Gupta, Sr Partner Solutions Architect – AWS
By Andries Engelbrecht, Principal Partner Solutions Architect – Snowflake
In today’s data-driven landscape, real-time analysis of massive data streams is crucial for business competitiveness. Organizations need robust architectures that can efficiently ingest, process, and analyze data in real-time. This facilitates quick decision-making in areas like fraud detection, predictive maintenance, personalized marketing, and real-time analytics.
While Apache Parquet and Apache Iceberg help manage large-scale data workloads, they face challenges with frequent updates creating small files. Efficient file compaction is necessary, but complex to scale. Advanced solutions are needed that can handle high-volume, continuous data streams, while integrating seamlessly with existing environments. In this blog, you will learn how to build scalable, real-time data lakes by combining Amazon S3 Tables with Snowflake.
Amazon S3 Tables and Snowflake
Amazon Simple Storage Service (Amazon S3) Tables are purpose-built for storing and managing tabular data at scale, with built-in Apache Iceberg support. They automatically perform maintenance tasks including compaction, snapshot management, and unreferenced file removal. With optimizations specific to Iceberg workloads, S3 Tables continuously optimizes storage for improved query performance and reduced costs while providing streamlined security controls. Organizations can stream and query S3 tables through AWS analytics services like Amazon Data Firehose by integrating table buckets with AWS Glue Data Catalog and AWS Lake Formation.
S3 Tables now offers table management APIs that are compatible with the Iceberg REST Catalog standard, enabling any Iceberg-compatible application to query data in-place in an S3 table bucket. You can also integrate S3 Tables with Amazon SageMaker Lakehouse, making it quick to query and join S3 Tables with data in S3 data lakes, Amazon Redshift data warehouses, and third-party data sources. The next generation of Amazon SageMaker is built on an open lakehouse architecture, fully compatible with Apache Iceberg. This gives you the flexibility to access and query data in-place with all Iceberg-compatible tools and engines. You can secure data in the lakehouse by defining fine-grained permissions that are enforced across all analytics and machine learning (ML) tools and engines.
Snowflake makes enterprise AI easy, connected and trusted, so enterprises can innovate faster and get more value from data. Thousands of companies around the globe use Snowflake’s AI Data Cloud to unify data across teams, regions, formats and catalogs to build, use and share data, applications and AI. The integration between Snowflake and Amazon S3 Tables enables efficient data lake management through open standards. It allows you to leverage the strengths of Snowflake—including its zero-tuning engine, and fully-managed, native support for open standards. Snowflake, alongside Amazon S3 Tables, builds high-performance data architectures for analytics and AI workloads.
Use case
The growing need for real-time data processing has created high demand for scalable data lake architectures that can effectively manage streaming data, helping businesses maintain their competitive edge in today’s market. Businesses can use scalable data lake solutions with Amazon S3 Tables and Snowflake to:
- Build data lakes by ingesting streaming data from diverse sources like Internet of Things (IoT) devices, system telemetry, electric vehicle, geolocation data, utility service usage, and more
- Deliver up to 10x higher transactions each second for streaming use cases and other high frequency use cases
- Analyze your entire Iceberg data estate through a single, connected view in Snowflake
- Build a managed transactional data lake with built-in table optimization to reduce overhead on the ingesting applications
- Access third-party Iceberg-compatible analytics engines
We’ll illustrate how manufacturing, energy, and other industries can accelerate and improve real-time analytics of their IoT data. By combining Amazon S3 Tables, Snowflake, and SageMaker Lakehouse with an Iceberg REST endpoint, you can build an efficient data lake architecture for streaming analytics.
Solution overview
Let’s explore how a manufacturing company can transform their manufacturing operations by implementing a cloud-based industrial data solution. The architecture in Figure 1 below streams sensor data from global manufacturing facilities directly into Amazon S3 Tables. This enables near real-time monitoring and analytics without the need to manage and maintain the Iceberg tables. Through S3 Tables integration with Snowflake, and SageMaker Lakehouse with an Iceberg REST endpoint, manufacturing companies can combine operational insights with predictive maintenance capabilities.
This can revolutionize industrial efficiency through data analytics and machine learning. Additionally, Snowflake’s collaboration and data sharing capabilities allow for improved latency in supply chain operations.
Figure 1 – Real-time data lake using Snowflake and S3 Tables
Let’s dive deep into the different parts of the overall solution.
Ingest streaming data from IoT devices by using Amazon Data Firehose directly into S3 table buckets:
- Create an S3 table bucket and integrate it with AWS Glue Data Catalog and AWS Lake Formation.
- Stream payloads from IoT devices to the AWS IoT Core message broker, use a specific MQTT topic device/data/DEVICE_ID.
- An AWS IoT rule is triggered when there is a payload in its topic. It is configured with an Amazon Data Firehose action in this use case.
- Amazon Data Firehose buffers the device payloads before delivering them to the data store based on the size or the time, whichever happens first. Amazon Data Firehose delivers near real-time streaming data to Amazon S3 Tables as a destination for storing or processing.
Set up Snowflake to access S3 Tables
Setup Apache Iceberg REST catalog integration to access Amazon S3 Tables through SageMaker Lakehouse with an Iceberg REST endpoint. Use AWS Lake Formation to manage storage credential vending and AWS Signature Version 4 (SigV4) authentication in Snowflake. For detailed setup steps on accessing S3 Tables from Snowflake, refer to Connect Snowflake to S3 Tables using the SageMaker Lakehouse Iceberg REST endpoint.
Analyze data in S3 Tables using Snowflake analytics
Snowflake can directly analyze the Iceberg format S3 Tables in near real-time with managed and improved query performance. Additionally, the S3 Tables data can further be leveraged in Snowflake for near real-time ML use cases with Amazon SageMaker, Snowflake Notebooks and Snowpark Python, as well as for generative AI use cases with Snowflake Cortex AI and Amazon Bedrock.
In addition, the S3 Tables data can be leveraged for near real-time ML use cases with Amazon SageMaker, Snowflake Notebooks and Snowpark Python, as well as using Snowflake Cortex AI and Amazon Bedrock for generative AI use cases.
Conclusion
Organizations today struggle with managing high-volume data streams, ensuring low latency processing, and maintaining performance at scale. This solution addresses these challenges by combining Amazon S3 Tables with Snowflake to create a robust architecture that efficiently ingests, processes, and analyzes data in near real-time.
We discussed how you can build scalable data lakes with Amazon S3 Tables and Snowflake. We demonstrated how manufacturing and other industries can overcome challenges with analyzing real-time streaming data from sensors, transforming their operations by implementing a cloud-based industrial data solution.
Organizations benefit from streamlined data governance with robust security controls, allowing teams to focus on extracting business insights rather than managing complex infrastructure. This comprehensive solution transforms data lakes from unwieldy repositories into streamlined, high-performance analytics platforms that drive better business decisions.
Connect with Snowflake and AWS to learn how we can help accelerate your business.
Further reading
- Get started with AWS
- Explore AWS Analytics services
- Learn about S3 Tables by reading the user guide
- Get Started with Amazon SageMaker Lakehouse
- Get started with Snowflake in AWS Marketplace
Snowflake – AWS Partner Spotlight
Snowflake is an AWS Advanced Technology Partner and AWS Competency Partner that provides an advanced monitoring solution for cloud apps and modern infrastructure that aggregates metrics across distributed services to alert you on service-wide issues and trends in real-time.