Skip to main content

Guidance for Generating Rule Recommendations for Entity Resolution on AWS

Overview

This Guidance demonstrates an automated approach for generating rule recommendations to match, link, and enhance related records using AWS Entity Resolution rule-based matching. It showcases an AWS Glue notebook that streamlines the process of creating effective matching rules. The Guidance reads input data from Amazon S3, performs data quality analysis, and harnesses the power of a large language model (LLM) on Amazon Bedrock to produce customized rule recommendations. Each recommendation comes with accompanying reasoning, providing insights into the suggested rules. Furthermore, the Guidance implements a sampling approach to test the generated rules and resolve entities.

How it works

Overview

This architecture diagram shows an overview of how to generate rule recommendations using an LLM hosted on Amazon Bedrock and an AWS Glue notebook and how to use these rules in a rule-based matching workflow in AWS Entity Resolution.

Diagram of an AWS cloud workflow for entity resolution, showing data flow from Amazon S3 through AWS Glue, Amazon Bedrock, and AWS Step Functions for rule-based matching.

Incremental rule-based workflow

This architecture diagram shows how to run an incremental rule-based matching workflow in AWS Entity Resolution using an AWS Step Functions workflow.

Diagram of an AWS data processing workflow using EventBridge, AWS Glue, Lambda functions, and S3 buckets for pre-processing, rule-based matching, and post-processing, with outputs stored in S3 tables.

Deploy with confidence

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs. 

Go to sample code

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

AWS Glue is a managed service that runs workloads and provides monitoring metrics for jobs. It offers fault tolerance with support for retries in case of failures. AWS Glue Crawler automates the discovery of data schematics. These features create a scalable, fault-tolerant system that provides insights into runtime metrics of jobs.

Read the Operational Excellence whitepaper 

AWS Identity and Access Management (IAM) policies are scoped down to the minimum permissions required for services to function properly. Data stored in Amazon S3 uses encryption at rest. These measures limit unauthorized access to resources and protect data integrity. By implementing tight access controls and encrypting data at rest, the Guidance enhances overall security posture and helps meet compliance requirements.

Read the Security whitepaper 

As managed services, AWS Glue, AWS Entity Resolution, Amazon Bedrock, and Step Functions reduce the operational burden of maintaining reliability, allowing the system to recover from failures automatically. These services support retries for recovery from failures and integrate with Amazon CloudWatch to provide operational insights.

Read the Reliability whitepaper 

AWS Glue offers a serverless architecture that scales compute resources up or down based on workload demands. It provides different instance types for users to choose based on their specific workload requirements. AWS Glue connects with other AWS services through AWS networking services and can run within a virtual private cloud (VPC). This flexibility in resource selection and automatic scaling helps ensure that the system can efficiently handle varying workload intensities.

Read the Performance Efficiency whitepaper 

This Guidance uses managed services that follow a pay-as-you-go pricing model, meaning you only pay for the resources you use. AWS Glue is serverless, providing scaling capabilities that help optimize costs. AWS Entity Resolution charges based on the volume of ingested data. Amazon S3 costs depend on data storage and access patterns. Step Functions charges based on the number of state transitions. This usage-based pricing across services helps ensure that costs align closely with actual resource consumption.

Read the Cost Optimization whitepaper 

As a serverless service, AWS Glue only consumes resources when actively processing data. It offers features like data partitioning and compression, which reduce storage and compute resource requirements for data processing pipelines. AWS Glue offers automatic scaling based on workload helps optimize resource utilization and reduce energy consumption.

Read the Sustainability whitepaper 

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.