AWS Big Data Blog
Category: Intermediate (200)
Incremental refresh for Amazon Redshift materialized views on data lake tables
Amazon Redshift now provides the ability to incrementally refresh your materialized views on data lake tables including open file and table formats such as Apache Iceberg. In this post, we will show you step-by-step what operations are supported on both open file formats and transactional data lake tables to enable incremental refresh of the materialized view.
Fine-grained access control in Amazon EMR Serverless with AWS Lake Formation
In this post, we discuss how to implement fine-grained access control in EMR Serverless using Lake Formation. With this integration, organizations can achieve better scalability, flexibility, and cost-efficiency in their data operations, ultimately driving more value from their data assets.
How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone
In this post, we discuss how Volkswagen Autoeuropa used Amazon DataZone to build a data marketplace based on data mesh architecture to accelerate their digital transformation. The data mesh, built on Amazon DataZone, simplified data access, improved data quality, and established governance at scale to power analytics, reporting, AI, and machine learning (ML) use cases. As a result, the data solution offers benefits such as faster access to data, expeditious decision making, accelerated time to value for use cases, and enhanced data governance.
Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more
Amazon DataZone now launched authentication support through the Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more. This integration empowers data users to access and analyze governed data within Amazon DataZone using familiar tools, boosting both productivity and flexibility.
Control your AWS Glue Studio development interface with AWS Glue job mode API property
The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. To improve customer experience with the AWS Glue Jobs API, we added a new property describing the job mode corresponding to script, visual, or notebook. In this post, we explore how the updated AWS Glue Jobs API works in depth and demonstrate the new experience with the updated API.
Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation
Amazon Redshift now offers enhanced query performance with optimizations such as Enhanced Histograms for Selectivity Estimation in the absence of fresh statistics by relying on metadata statistics gathered during ingestion. In this post, we cover new performance optimizations in Redshift data warehouse query processing and how elastic histogram statistics help enhance selectivity estimation and the overall quality of query plans for Amazon Redshift data warehouse queries in the absence of fresh table statistics.
Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job
Adoption of data lakes and the data mesh framework emerges as a powerful approach. By decentralizing data ownership and distribution, enterprises can break down silos and enable seamless data sharing. In this post, we discuss how to choose the right tool for building an enterprise data platform and enabling data sharing, collaboration and access within your organization and with third-party providers. We address three business use cases using AWS Glue, AWS Data Exchange, AWS Clean Rooms, and Amazon DataZone through three different use cases.
Single sign-on SSO for Amazon OpenSearch Service using SAML and Keycloak
In this post, we walk you through how to configure service provider-initiated authentication for OpenSearch Dashboards by using OpenSearch Service and Keycloak. We also discuss how to set up users, groups, and roles in Keycloak and configure their access to OpenSearch Dashboards.
Enriching metadata for accurate text-to-SQL generation for Amazon Athena
In this post, we demonstrate the critical role of metadata in text-to-SQL generation through an example implemented for Amazon Athena using Amazon Bedrock. We discuss the challenges in maintaining the metadata as well as ways to overcome those challenges and enrich the metadata.
Enhance Amazon EMR scaling capabilities with Application Master Placement
Starting with the Amazon EMR 7.2 release, Amazon EMR on EC2 introduced a new feature called Application Master (AM) label awareness, which allows users to enable YARN node labels to allocate the AM containers within On-Demand nodes only. In this post, we explore the key features and use cases where this new functionality can provide significant benefits, enabling cluster administrators to achieve optimal resource utilization, improved application reliability, and cost-efficiency in your EMR on EC2 clusters.