Skip to main content

Guidance for Enhanced Document Search Using Content and Metadata Enrichment on AWS

Overview

This Guidance demonstrates how to use the Custom Document Enrichment feature with Amazon Kendra to improve search experiences. Documents with precise content and rich metadata are more searchable and yield more accurate results. Large repositories of raw documents can be improved for search by modifying the content or adding metadata before indexing, enhancing their search results.

How it works

Enhance search experiences by adding metadata to your document with custom document enrichment using Amazon Kendra.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Amazon Kendra uses Amazon CloudWatch logs to provide insight into the operation of data sources. Amazon Kendra logs process details for the documents as they are indexed. It also logs errors from the data source that occur while documents are being indexed. CloudWatch logs can be used to monitor, store, and access the log files. With minimal user intervention, CloudWatch logs can capture insights and anomaly detection to continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies. The AWS CloudFormationtemplate can be easily modified and extended to integrate changes.

Read the Operational Excellence whitepaper 

The CloudFormation infrastructure as code (IaC) automation deploys resources to the AWS Cloud securely. This reduces the risk of human and potential errors related to manual configuration or management. 

Lambda functions are configured through AWS Identity and Access Management (IAM) with least-privilege access, limiting access to just the required Amazon S3 data buckets.

Read the Security whitepaper 

The Kendra Enterprise Edition of Amazon Kendra is highly available by default within a Region and can handle Availability Zones failures. Lambda runs in multiple Availability Zones to ensure that it is available to process events in the case of a service interruption in a single zone. 

 Before extraction, Lambda is configured to run only for a maximum of 5 minutes. Text extraction from each audio and video file must complete in 5 minutes. Post extraction, Lambda is configured to run for a maximum 1 minute, so Amazon Comprehend has to detect entities from the text within that time. 

Amazon Kendra is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in Amazon Kendra. CloudTrail captures all API calls from Amazon Kendra as events, including calls from the Amazon Kendra console and from code calls to the Amazon Kendra APIs.

Read the Reliability whitepaper 

Services used in the Guidance are purpose built for this use. Amazon Transcribe is built to create a transcription of audio and video files. Amazon Textract extracts text from scanned image documents. Amazon Comprehend detects entities from within the text.

With CloudFormation IaC, this Guidance can be deployed to any supported AWS Region close to the user base to decrease latency and improve performance. 

The code is executed using Lambda functions that provide serverless compute capabilities without the infrastructure. The functions automatically scale in and out to meet the changes in demand.

Read the Performance Efficiency whitepaper 

Serverless architectures and services such as Lambda, Amazon Textract, Amazon Comprehend, and Amazon Transcribe provide a pay-as-you-go pricing model that is based on usage. And because they're serverless, these services scale based on demand. 

AWS Budgets can help to plan budgets for cost and usage. Lambda can be used with Compute Savings Plans to reduce cost.

Read the Cost Optimization whitepaper 

The Lambda functions' execution environment shuts down the application logic after it has been executed. This saves on infrastructure use and cost.

Read the Sustainability whitepaper 

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.