AWS Big Data Blog

Category: Intermediate (200)

A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful […]

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

November 2024: This post was reviewed and updated for accuracy. Amazon Redshift is a petabyte-scale, enterprise-grade cloud data warehouse service delivering the best price-performance. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift to cost-effectively and quickly analyze their data using standard SQL and existing business intelligence (BI) tools. Amazon Redshift now […]

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

The GDPR (General Data Protection Regulation) right to be forgotten, also known as the right to erasure, gives individuals the right to request the deletion of their personally identifiable information (PII) data held by organizations. This means that individuals can ask companies to erase their personal data from their systems and any third parties with […]

Near-real-time analytics using Amazon Redshift streaming ingestion with Amazon Kinesis Data Streams and Amazon DynamoDB

Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the widely used cloud data warehouse. You can run and […]

Improved scalability and resiliency for Amazon EMR on EC2 clusters

Amazon EMR is the cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto. Customers asked us for features that would further improve the resiliency and scalability of their Amazon EMR on EC2 clusters, including their large, long-running clusters. We have […]

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

Data is a key enabler for your business. Many AWS customers have integrated their data across multiple data sources using AWS Glue, a serverless data integration service, in order to make data-driven business decisions. To grow the power of data at scale for the long term, it’s highly recommended to design an end-to-end development lifecycle […]

Build data integration jobs with AI companion on AWS Glue Studio notebook powered by Amazon CodeWhisperer

Data is essential for businesses to make informed decisions, improve operations, and innovate. Integrating data from different sources can be a complex and time-consuming process. AWS offers AWS Glue to help you integrate your data from multiple sources on serverless infrastructure for analysis, machine learning (ML), and application development. AWS Glue provides different authoring experiences […]

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

Amazon Redshift Serverless makes it simple to run and scale analytics in seconds. It automatically provisions and intelligently scales data warehouse compute capacity to deliver fast performance, and you pay only for what you use. Just load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business […]

Query your Apache Hive metastore with AWS Lake Formation permissions

Apache Hive is a SQL-based data warehouse system for processing highly distributed datasets on the Apache Hadoop platform. There are two key components to Apache Hive: the Hive SQL query engine and the Hive metastore (HMS). The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, […]

Dimensional modeling in Amazon Redshift

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model. Amazon Redshift […]