AWS Big Data Blog

Category: Intermediate (200)

Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark

October 2024: This post has been updated along with Interactive Sessions support for AWS Glue Auto scaling. June 2023: This post was reviewed and updated for accuracy. Data created in the cloud is growing fast in recent days, so scalability is a key factor in distributed data processing. Many customers benefit from the scalability of […]

Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3

June 2024: This post was reviewed for accuracy and updated to cover Apache Iceberg. June 2023: This post was reviewed and updated for accuracy. Customers are increasingly building data lakes to store data at massive scale in the cloud. It’s common to use distributed computing engines, cloud-native databases, and data warehouses when you want to […]

Use unsupervised training with K-means clustering in Amazon Redshift ML

June 2023: This post was reviewed and updated for accuracy. Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price–performance. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train […]

Simplify data integration pipeline development using AWS Glue custom blueprints

June 2023: This post was reviewed and updated for accuracy. August 2021: AWS Glue custom blueprints are now generally available. Please visit https://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html to learn more. Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with […]

Get started with the Amazon Redshift Data API

June 2023: This post was reviewed and updated for accuracy. The GitHub repository mentioned in this post is now updated with examples for serverless. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Tens of thousands of customers use Amazon Redshift to […]

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

June 2023: This post was reviewed and updated for accuracy. July 2021: This post was reviewed and updated to include multi-statement and parameterization support. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL […]