Amazon Comprehend features
Why Amazon Comprehend?
Custom entity recognition
Custom entity recognition allows you to customize Amazon Comprehend to identify terms that are specific to your domain. Using AutoML, Amazon Comprehend will learn from a small set of examples (for example, a list of policy numbers, claim numbers, or SSN), and then train a private, custom model to recognize these terms such as claim numbers in any other block of text within PDFs, plain text, or Microsoft Word documents—no ML required. Refer to this documentation page for more details.
Example: In this example, an insurance company would like to analyze text documents for entities specific to their business, policy numbers
Sample Text: Hi, my name is Sam Ford and I am filing a claim for a car accident. My policy code is 456-YQT.
Example
Entity
|
Category
|
Count
|
Confidence
|
---|---|---|---|
456-YQT
|
Policy_ID |
1 |
0.95 |
Custom classification
The custom classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use custom classification to automatically categorize inbound requests by problem type based on how the customer has described the issue. With your custom model, it is easy to moderate website comments, triage customer feedback, and organize workgroup documents. Refer to this documentation page for more details.
Example: Let’s say you want to organize your customer support feedback at an airline company. You want to organize each piece of feedback into Account Questions, Ticket Refunds and Flight Complaints. To train the service, you create a CSV file that contains example text from each issue, and label each sample with one of the three labels that applies. The service will automatically train a custom model on your behalf. To use your model to analyze all of the calls the next day, you submit each text file to the service and receive the labeled results along with a confidence of the label match.
Example
Text
|
Label
|
Confidence score
|
---|---|---|
Line 0
|
Account Question |
0.92 |
Line 1
|
Ticket Refund |
1 |
Line 2
|
Flight Complaint |
1 |
Line 3
|
Flight Complaint |
0.91 |
Doc5.csv
|
Ticket Refund |
1 |
Entity recognition
The entity recognition API returns the named entities ("People," "Places," "Locations," etc.) that are automatically categorized based on the provided text. Refer to this documentation page for more details.
Example: In this example, we are looking at the description of a company. The API identifies entities like Organization, Date, Location, and returns a confidence score.
Sample Text: Amazon.com, Inc. is located in Seattle, WA and was founded July 5th, 1994 by Jeff Bezos, allowing customers to buy everything from books to blenders. Seattle is north of Portland and south of Vancouver, BC. Other notable Seattle-based companies are Starbucks and Boeing.
Example
Entity
|
Category
|
Confidence
|
---|---|---|
Amazon.com, Inc.
|
Organization |
0.96 |
Seattle, WA
|
Location |
0.96 |
July 5th, 1994
|
0.99 |
|
Jeff Bezos
|
Person |
0.99 |
Seattle
|
Location |
0.98 |
Portland
|
Location |
0.99 |
Vancouver, BC
|
Location |
0.97 |
Starbucks
|
Organization |
0.91 |
Boeing
|
Organization |
0.99 |
Sentiment analysis
The sentiment analysis API returns the overall sentiment of a text (Positive, Negative, Neutral, or Mixed). Refer to this documentation page for more details.
Example: In this example, a customer is posting his feedback on a pair of shoes. The API identifies the sentiment expressed by the customer along with a confidence score.
Sample Text: I ordered a small and expected it to fit just right but it was a little bit more like a medium-large. It was great quality. It's a lighter brown than pictured but fairly close. Would be ten times better if it was lined with cotton or wool on the inside.
Example
Sentiment
|
Score
|
---|---|
Mixed
|
0.89 |
Positive
|
0.09 |
Negative
|
0.01 |
Neutral
|
0.00 |
Targeted sentiment
Targeted sentiment provides more granular sentiment insights by identifying the sentiment (positive, negative, neutral, or mixed) towards entities within text. Refer to this documentation page for more details.
Example: In this example, a restaurant is reviewing a customer review to understand where they can improve their business.
Sample Text: I loved the burger, but the service was slow.
Example
Text
|
Entity Type
|
Entity Confidence Score
|
Sentiment
|
Score
|
---|---|---|---|---|
I
|
Person |
0.99 |
Neutral |
0.99 |
Burger
|
Other |
0.99 |
Positive |
0.99 |
Service
|
Attribute |
0.99 |
Negative |
0.99 |
PII identification and redaction
Use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in customer emails, support tickets, product reviews, social media, and more. No ML experience required. For example, you can analyze support tickets and knowledge articles to detect PII entities and redact the text before you index the documents in the search solution. After that, search solutions are free of PII entities in documents. Redacting PII entities helps you protect privacy and comply with local laws and regulations. Refer to this documentation page for more details.
Example: In this example, a customer is wants to redact personal and financial data from a bank statement. The PII redaction API will identify and redact PII along with a confidence score.
Sample Text: Hello John Smith. Your AnyCompany Financial Services, LLC credit card account 1111-0000-1111-0008 has a minimum payment of $24.53 that is due by July 31st. Based on your autopay settings, we will withdraw your payment on the due date from your bank account number XXXXXX1111 with the routing number XXXXX0000.
Example
Entity
|
Type
|
Score
|
---|---|---|
John Smith
|
Name |
0.99+ |
1111-0000-111-0008
|
Credit debit number |
0.99+ |
July 31st
|
Date time |
0.99+ |
XXXXXX111
|
Bank account number |
0.99+ |
XXXXX0000
|
Bank routing number |
0.99+ |
Toxicity detection
Comprehend toxicity detection provides a simple, NLP-based solution for toxic content detection in text-based documents. The capability is available out of the box to moderate peer-to-peer conversation in online platforms and generative AI inputs and outputs. Refer to this documentation page for more details.
Prompt safety classification
Comprehend provides a pre-trained binary classifier that can classify the input prompt as harmful or not. This can be integrated to allow LLMs to only respond to harmless content. Refer to this documentation page for more details
Keyphrase extraction
The keyphrase extraction API returns the key phrases or talking points and a confidence score to support that this is a key phrase. Refer to this documentation page for more details.
Example: In this example, a customer is comparing a DSLR camera to an instant film camera. The API extracts key phrases and returns a confidence score about the results.
Sample text: I'm an avid photographer, and I'm primarily found shooting with my DSLR or my instant film camera that I carry around for casual use. While nothing beats my DSLR in power and convenience, there's something magical about my instant film camera. Perhaps it's that you're shooting on actual film, or maybe it's that every shot you take is a unique physical artifact (which is special in today's world of Instagram and Facebook, where photos are a dime a dozen). All I know for sure is that they are incredibly fun to use and peoples' eyes light up when you pull one of these out at a party.
Example
Keyphrase
|
Confidence
|
---|---|
an avid photographer
|
0.99 |
my DSLR
|
0.97 |
my instant film camera
|
0.99 |
casual use
|
0.99 |
power and convenience
|
0.94 |
actual film
|
0.99 |
every shot
|
0.92 |
a unique physical artifact
|
0.99 |
today
|
0.91 |
world
|
0.99 |
Instagram and Facebook
|
0.99 |
Events detection
Comprehend events lets you extract the event structure from a document, distilling pages of text down to easily processed data for consumption by your AI applications or graph visualization tools. This API allows you to answer who-what-when-where questions over large document sets, at scale and without prior NLP experience. Use comprehend events to extract granular details about real-world events and associated entities expressed in unstructured text. Refer to this documentation page for more details.
Language detection
The language detection API automatically identifies text written in over 100 languages and returns the dominant language with a confidence score to support that a language is dominant. Refer to this documentation page for more details.
Example: In this example, the API parses the text and is able to identify the dominant language in the text as Italian along with a confidence score.
Sample Text: Amazon Elastic Compute Cloud (Amazon EC2) è un servizio Web che fornisce capacità di elaborazione sicura e scalabile nel cloud. È concepito per rendere più semplice il cloud computing su scala Web per gli sviluppatori.
Example
ISO-639-1 Language Code
|
Language
|
Confidence
|
---|---|---|
it
|
Italian |
1.0 |
Syntax analysis
The Amazon Comprehend syntax API enables customers to analyze text using tokenization and Parts of Speech (PoS) and identify word boundaries and labels like nouns and adjectives within the text. Refer to this documentation page for more details.
Example: In this example we will be analyzing a short document using the Comprehend Syntax API. The Syntax API tokenizes (defines word boundaries) text and labels each word with its associated part of speech e.g. noun and verb. In addition to noting begin and ending offset (so you know where the word is within the text), we also provide a confidence score.
Sample Text: I love my fast, new Kindle Fire!
Example
Text
|
Tag
|
---|---|
I
|
Pronoun |
Love
|
Verb |
My
|
Pronoun |
Fast
|
Adjective |
,
|
Punctuation |
New
|
Adjective |
Kindle
|
Proper noun |
Fire
|
Proper noun |
!
|
Punctuation |
Topic modeling
Topic modeling identifies relevant terms or topics from a collection of documents stored in Amazon S3. It will identify the most common topics in the collection and organize them in groups and then map which documents belong to which topic. Refer to this documentation page for more details.
Example: If your documents (Doc1.txt, Doc2.txt, Doc3.txt, and Doc4.txt) are stored in Amazon S3, and you point Amazon Comprehend to their location, Comprehend will analyze the documents and return two views:
1. Grouping of keywords that are topics
Each group of keywords is associated with a topic group. Weight refers to the prevalence of that keyword within the group. Keywords with the weight closest to 1 are most indicative of the topic group’s context.
Each group of keywords is associated with a topic group. Weight refers to the prevalence of that keyword within the group. Keywords with the weight closest to 1 are most indicative of the topic group’s context.
Example 1
Topic Group
|
Keywords
|
Weight
|
---|---|---|
1
|
Amazon |
0.87 |
1
|
Seattle |
0.65 |
2
|
Holidays |
0.78 |
2
|
Shopping |
0.67 |
2. Grouping of documents by topics
Each document is mapped to a topic group based on the proportion of the topic group’s weighted keywords that are present in the document.
Example
Document Name
|
Topic Group
|
Proportion
|
---|---|---|
Doc1.txt
|
1 |
0.87 |
Doc2.txt
|
0.65 |
|
Doc3.txt
|
0.78 |
|
Doc4.txt
|
2 |
0.67 |
Multiple language support
Amazon Comprehend can perform text analysis on German, English, Spanish, Italian, Portuguese, French, Japanese, Korean, Hindi, Arabic, Chinese (simplified), Chinese (traditional) text. To build applications in other languages, customers can use Amazon Translate to convert the text into a language supported by Amazon Comprehend and then use Amazon Comprehend to perform text analysis. For more details on language support, see the documentation page.