AWS Big Data Blog
Zero-copy, Coordination-free approach to OpenSearch Snapshots
In this blog post, we tell you how we enhanced the snapshot efficiency in Amazon OpenSearch Service while carefully maintaining these critical operational aspects. These snapshot optimizations are enabled for all OpenSearch optimized instance family (OR1, OR2, OM2) domains from version 2.17 onwards.
Enhance governance with asset type usage policies in Amazon SageMaker
In this post, we introduce authorization policies for custom asset types—a new governance capability in Amazon SageMaker that gives organizations fine-grained control over who can create and manage assets using specific templates. This feature enhances data governance by allowing teams to enforce usage policies that align with business and security requirements across the organization.
Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless
In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.
Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark
In this post, we show you how to share an Amazon Redshift table and Amazon S3 based Iceberg table from the account that owns the data to another account that consumes the data. In the recipient account, we run a join query on the shared data lake and data warehouse tables using Spark in AWS Glue 5.0. We walk you through the complete cross-account setup and provide the Spark configuration in a Python notebook.
Introducing Amazon Q Developer in Amazon OpenSearch Service
today we introduced Amazon Q Developer support in OpenSearch Service. With this AI-assisted analysis, both new and experienced users can navigate complex operational data without training, analyze issues, and gain insights in a fraction of the time. In this post, we share how to get started using Amazon Q Developer in OpenSearch Service and explore some of its key capabilities.
Accelerate lightweight analytics using PyIceberg with AWS Lambda and an AWS Glue Iceberg REST endpoint
In this post, we demonstrate how PyIceberg, integrated with the AWS Glue Data Catalog and AWS Lambda, provides a lightweight approach to harness Iceberg’s powerful features through intuitive Python interfaces. We show how this integration enables teams to start working with Iceberg tables with minimal setup and infrastructure dependencies.
Save big on OpenSearch: Unleashing Intel AVX-512 for binary vector performance
With OpenSearch version 2.19, Amazon OpenSearch Service now supports hardware-accelerated enhanced latency and throughput for binary vectors. In this post, we discuss the improvements these advanced processors provide to your OpenSearch workloads, and how it can help you lower your total cost of ownership (TCO).
Automate replication of row-level security from AWS Lake Formation to Amazon QuickSight
This post outlines a solution to automatically replicate the entitlements for readers from the source (AWS Lake Formation) to Amazon QuickSight. This solution can be used even when the authentication method in Amazon QuickSight is not using IAM Identity Center and can work with both direct query and SPICE datasets in Amazon QuickSight.
Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation
The AI search flow builder is available in all AWS Regions that support OpenSearch 2.19+ on OpenSearch Service. In this post, we walk through a couple of scenarios to demonstrate the flow builder. First, we’ll enable semantic search on your old keyword-based OpenSearch application without client-side code changes. Next, we’ll create a multi-modal RAG flow, to showcase how you can redefine image discovery within your applications.
Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters
This post shows how to enhance the multi-cluster solution by integrating Amazon Managed Workflows for Apache Airflow (Amazon MWAA) with BPG. By using Amazon MWAA, we add job scheduling and orchestration capabilities, enabling you to build a comprehensive end-to-end Spark-based data processing pipeline.