AWS Big Data Blog

Category: Learning Levels

Optimize Amazon EMR runtime for Apache Spark with EMR S3A

With the Amazon EMR 7.10 runtime, Amazon EMR has introduced EMR S3A, an improved implementation of the open source S3A file system connector. In this post, we showcase the enhanced read and write performance advantages of using Amazon EMR 7.10.0 runtime for Apache Spark with EMR S3A as compared to EMRFS and the open source S3A file system connector.

Amazon OpenSearch Serverless monitoring: A CloudWatch setup guide

In this post, we explore commonly used Amazon CloudWatch metrics and alarms for OpenSearch Serverless, walking through the process of selecting relevant metrics, setting appropriate thresholds, and configuring alerts. This guide will provide you with a comprehensive monitoring strategy that complements the serverless nature of your OpenSearch deployment while maintaining full operational visibility.

Use Apache Airflow workflows to orchestrate data processing on Amazon SageMaker Unified Studio

Orchestrating machine learning pipelines is complex, especially when data processing, training, and deployment span multiple services and tools. In this post, we walk through a hands-on, end-to-end example of developing, testing, and running a machine learning (ML) pipeline using workflow capabilities in Amazon SageMaker, accessed through the Amazon SageMaker Unified Studio experience. These workflows are powered by Amazon Managed Workflows for Apache Airflow.

Trellix achieved 35% cost savings and enhanced security with Amazon OpenSearch Service

Trellix, a global leader in cybersecurity solutions, emerged in 2022 from the merger of McAfee Enterprise and FireEye. To address exponential log growth across their multi-tenant, multi-Region infrastructure, Trellix used Amazon OpenSearch Service, Amazon OpenSearch Ingestion, and Amazon S3 to modernize their log infrastructure. In this post, we share how, by adopting these AWS solutions, Trellix enhanced their system’s performance, availability, and scalability while reducing operational overhead.

Announcing cross-account ingestion for Amazon OpenSearch Service

Amazon OpenSearch Ingestion is a powerful data ingestion pipeline that AWS customers use for many different purposes, such as observability, analytics, and zero-ETL search. Many customers today push logs, traces, and metrics from their applications to OpenSearch Ingestion to store and analyze this data. Today, we are happy to announce that OpenSearch Ingestion pipelines now […]

Integrate Tableau and PingFederate with Amazon Redshift using AWS IAM Identity Center

In this post, we outline a comprehensive guide for setting up single sign-on from Tableau desktop to Amazon Redshift using integration with IAM Identity Center and PingFederate as the identity provider (IdP) with an LDAP based data store, AWS Directory Service for Microsoft Active Directory.

Tailor Amazon SageMaker Unified Studio project environments to your needs using custom blueprints

Amazon SageMaker Unified Studio is a single data and AI development environment that brings together data preparation, analytics, machine learning (ML), and generative AI development in one place. By unifying these workflows, it saves teams from managing multiple tools and makes it straightforward for data scientists, analysts, and developers to build, train, and deploy ML […]

Unlock the power of Apache Iceberg v3 deletion vectors on Amazon EMR

As modern data architectures expand, Apache Iceberg has become a widely popular open table format, providing ACID transactions, time travel, and schema evolution. In table format v2, Iceberg introduced merge-on-read, improving delete and update handling through positional delete files. These files improve write performance but can slow down reads when not compacted, since Iceberg must […]

Get started with Amazon OpenSearch Service: T-shirt size your domain for log analytics

When you’re spinning up your Amazon OpenSearch Service domain, you need to figure out the storage, instance types, and instance count; decide the sharding strategies and whether to use a cluster manager; and enable zone awareness. Generally, we consider storage as a guideline for determining instance count, but not other parameters. In this post, we […]