AWS Big Data Blog
Category: Analytics
Submitting User Applications with spark-submit
Francisco Oliveira is a consultant with AWS Professional Services Customers starting their big data journey often ask for guidelines on how to submit user applications to Spark running on Amazon EMR. For example, customers ask for guidelines on how to size memory and compute resources available to their applications and the best resource allocation model […]
Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile
Michael Wallman is a senior consultant with AWS ProServ Have you ever had to copy a huge Amazon S3 bucket to another account or region? Or create a list based on object name or size? How about mapping a function over millions of objects? Amazon EMR to the rescue! EMR allows you to deploy large […]
Agile Analytics with Amazon Redshift
Nick Corbett is a Big Data Consultant for AWS Professional Services What makes outstanding business intelligence (BI)? It needs to be accurate and up-to-date, but this alone won’t differentiate a solution. Perhaps a better measure is to consider the reaction you get when your latest report or metric is released to the business. Good BI […]
Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming
Amo Abeyaratne is a Big Data consultant with AWS Professional Services Introduction What if you could use your SQL knowledge to discover patterns directly from an incoming stream of data? Streaming analytics is a very popular topic of conversation around big data use cases. These use cases can vary from just accumulating simple web transaction […]
Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR
Dominic Murphy is an Enterprise Solution Architect with Amazon Web Services Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results. Zeppelin notebooks can be shared among several users, […]
Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL
This post was last reviewed and updated August, 2022 with a section on Deploying pgbouncer in Elastic Kubernetes Service (EKS). NOTE: You can now use federated queries in Amazon Redshift to query and analyze data across operational databases, data warehouses, and data lakes. For more information, please review the Amazon Redshift documentation article, “Querying Data […]
Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet
Ben Snively is a Solutions Architect with AWS Private subnets allow you to limit access to deployed components, and to control security and routing of the system. You can also use a private subnet to connect an on-premises local network to AWS through a VPN or AWS Direct Connect. Amazon EMR allows customers to launch […]
Migrating Metadata when Encrypting an Amazon Redshift Cluster
NOTE: Amazon Redshift now supports enabling and disabling encryption with 1-click. For more information, please review this “What’s New” post. ————————————— John Loughlin is a Solutions Architect with Amazon Web Services. A customer came to us asking for help expanding and modifying their Amazon Redshift cluster. In the course of responding to their request, we […]
Building a Near Real-Time Discovery Platform with AWS
September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more. Assaf Mentzer is a Senior Consultant for AWS Professional Services In the spirit of the U.S presidential […]
Integrating Splunk with Amazon Kinesis Streams
Prahlad Rao is a Solutions Architect wih AWS It is important to not only be able to stream and ingest terabytes of data at scale, but to quickly get insights and visualize data using available tools and technologies. The Amazon Kinesis platform of managed services enables continuous capture and stores terabytes of data per hour from […]


