Data Inspector

Find erroneous values in any column of a tabular dataset

Overview

Data Inspector is an AI tool to automatically identify entries in any tabular dataset (CSV file) that are likely incorrect.

Simply provide any data table (including columns that are: text, numeric, or categorical), and ML models will be trained to flag any entry (cell value) that is likely erroneous. Data Inspector returns 3 CSV files with quality assessments about each entry (cell value) in your dataset, stating: whether its value appears corrupted, how likely this entry is correct vs an erroneous/corrupted value, plus an alternative predicted/imputed value expected for this entry.

The Data Inspector audit is especially useful to catch errors in applications involving: data entry, measurement errors (surveys, sensor noise, etc), or a Quality Assurance team that spends time reviewing data. AI can inspect your data more systematically to detect issues with consistent coverage -- all in a fully automated way!

Documentation and examples: https://github.com/cleanlab/aws-marketplace/

Highlights

Data Inspector works for any standard tabular dataset (including columns that are: text, numeric, or categorical — with missing values allowed). It trains state-of-the-art ML models to automatically detect any erroneous values in the dataset.
Documentation and example usage notebooks for the latest version are available here: https://github.com/cleanlab/aws-marketplace/
Cleanlab invents novel solutions to assess and improve data quality for applications with messy real-world data. Many of our algorithms are published in top-tier venues for transparency: https://cleanlab.ai/research/ We have created the most popular library for Data-Centric AI: https://github.com/cleanlab/cleanlab

Details

Sold by

Cleanlab

Unlock automation with AI agent solutions

Fast-track AI initiatives with agents, tools, and solutions from AWS Partners.

Explore AI agent solutions

Features and programs

Financing for AWS Marketplace purchases

AWS Marketplace now accepts line of credit payments through the PNC Vendor Finance program. This program is available to select AWS customers in the US, excluding NV, NC, ND, TN, & VT.

View financing details

Pricing

Data Inspector

Info

View purchase options

Pricing is based on actual usage, with charges varying according to how much you consume. Subscriptions have no end date and may be canceled any time.

Additional AWS infrastructure costs may apply. Use the AWS Pricing Calculator to estimate your infrastructure costs.

Usage costs (12)

Info

Dimension	Description	Cost/host/hour
ml.m5.xlarge Inference (Batch) Recommended	Model inference on the ml.m5.xlarge instance type, batch mode	$5.00
ml.m5.xlarge Inference (Real-Time) Recommended	Model inference on the ml.m5.xlarge instance type, real-time mode	$5.00
ml.m5.xlarge Training Recommended	Algorithm training on the ml.m5.xlarge instance type	$5.00
ml.p3.2xlarge Inference (Batch)	Model inference on the ml.p3.2xlarge instance type, batch mode	$5.00
ml.p3.16xlarge Inference (Batch)	Model inference on the ml.p3.16xlarge instance type, batch mode	$5.00
ml.m5.24xlarge Inference (Batch)	Model inference on the ml.m5.24xlarge instance type, batch mode	$5.00
ml.p3.2xlarge Inference (Real-Time)	Model inference on the ml.p3.2xlarge instance type, real-time mode	$5.00
ml.p3.16xlarge Inference (Real-Time)	Model inference on the ml.p3.16xlarge instance type, real-time mode	$5.00
ml.m5.24xlarge Inference (Real-Time)	Model inference on the ml.m5.24xlarge instance type, real-time mode	$5.00
ml.p3.2xlarge Training	Algorithm training on the ml.p3.2xlarge instance type	$5.00

Vendor refund policy

We do not currently support refunds, but you can cancel your subscription to the service at any time.

How can we make this page better?

We'd like to hear your feedback and ideas on how to improve this page.

Legal

Vendor terms and conditions

Upon subscribing to this product, you must acknowledge and agree to the terms and conditions outlined in the vendor's End User License Agreement (EULA) .

Content disclaimer

Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

Usage information

Info

Delivery details

Amazon SageMaker algorithm

An Amazon SageMaker algorithm is a machine learning model that requires your training data to make predictions. Use the included training algorithm to generate your unique model artifact. Then deploy the model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.

Deploy the model on Amazon SageMaker AI using the following options:

Algorithm training

Before deploying the model, train it with your data using the algorithm training process. You're billed for software and SageMaker infrastructure costs only during training. Duration depends on the algorithm, instance type, and training data size. When training completes, the model artifacts save to your Amazon S3 bucket. These artifacts load into the model when you deploy for real-time inference or batch processing. For more information, see Use an Algorithm to Run a Training Job .

Real-time inference

Deploy the model as an API endpoint for your applications. When you send data to the endpoint, SageMaker processes it and returns results by API response. The endpoint runs continuously until you delete it. You're billed for software and SageMaker infrastructure costs while the endpoint runs. AWS Marketplace models don't support Amazon SageMaker Asynchronous Inference. For more information, see Deploy models for real-time inference .

Batch transform

Deploy the model to process batches of data stored in Amazon Simple Storage Service (Amazon S3). SageMaker runs the job, processes your data, and returns results to Amazon S3. When complete, SageMaker stops the model. You're billed for software and SageMaker infrastructure costs only during the batch job. Duration depends on your model, instance type, and dataset size. AWS Marketplace models don't support Amazon SageMaker Asynchronous Inference. For more information, see Batch transform for inference with Amazon SageMaker AI .

Version release notes

Automatically detect potential errors in any column of your tabular dataset.

Additional details

Inputs

Summary: Your data should be in a CSV file with a header containing column names for your data. If your data contains an index column, it should be specified using the index_col hyperparameter, otherwise it is assumed that there is no index column.

By default, all categorical and numeric columns will be inspected for issues, if you want to inspect specific columns, pass those in as a list to the columns_to_inspect hyperpameter. Text columns that cannot be inspected will be skipped automatically.

Input MIME type: text/csv

Real-time inference sample input data

https://github.com/cleanlab/aws-marketplace/blob/main/label-inspector/data/input/dataset.csv

Batch transform sample input data

https://github.com/cleanlab/aws-marketplace/blob/main/label-inspector/data/input/dataset.csv

Input data descriptions

The following table describes supported input data fields for real-time inference and batch transform.

Field name	Description	Constraints	Required
Dataset	Each row in the input data must represent a single example. The columns of your data can contain either numeric, categorical, or text (arbitrary string) values, however data errors can only be detected on numeric or categorical data. Data with multiple text columns and missing values are supported.	Type: FreeText	Yes

Resources

Vendor resources

Documentation and example usage notebooks

Learn more about algorithms for data quality assessment

ML training may take time; try Cleanlab Studio for better ML (+ image support)

Support

Vendor support

For questions/support, please email: support@cleanlab.ai . Free Trials and Subscription Plans available! Email us for more details.

Get support

AWS infrastructure support

AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

Get support

Similar products

Hadoop on Debian 11 with support by AskforCloud LLC

By Askforcloud LLC

This product has charges associated with it for seller support. The Apache Hadoop framework develops open-source software for reliable, scalable, distributed computing.

View product

Apiable - API portal for Amazon API Gateway

By Apiable

Productize your APIs without the dev time

View product

Nightfall Developer Platform: Cloud-Native Data Loss Prevention API

By Nightfall

With the Nightfall Developer Platform, build data classification and protection into any application. Nightfall's APIs are a set of building blocks developers can use to discover, classify, and protect sensitive data.

View product

ClimateTracker Climate-related Disclosures - Aotearoa NZ CS 1-3

By ClimateTracker

ClimateTracker provides intelligent AI Climate Disclosure reporting tools that guide organisations through changing climate standards, help make impactful climate decisions and reduce compliance costs

View product

Customer reviews

Leave a review

Ratings and reviews

Info

0 ratings

5 star

4 star

3 star

2 star

1 star

0 AWS reviews

13 external reviews

Star ratings include only reviews from verified AWS customers. External reviews can also include a star rating, but star ratings from external reviews are not averaged in with the AWS customer star ratings.

Oil & Energy

A decent Large Language Model, if we are keen in tracking our responses with score basis

Reviewed on Aug 14, 2025

Review provided by G2

What do you like best about the product?

Its scoring mechanism for all the generated responses is a great feature in amongst of GPTs.

What do you dislike about the product?

Context mapping check would have been better, along with the scoring mechanism. And report genertaion and mechanisim would have been a great tool if its included.

What problems is the product solving and how is that benefiting you?

Its real-time tracking and scoring mechanisim and compatibility with varies of LLM, makes it more useful.

Ashish A.

CleanLab: Best ML Modules Optimizer

Reviewed on May 21, 2025

Review provided by G2

What do you like best about the product?

The best part of Cleanlab is it's AI models which optimizes any pretrained modules with great level of efficiency. Another best part is it's documentation, Any type of users can use Cleanlab by reading it's documentation. And TLM module is best, it optimizes any LLM. It's API feature helps the integration part much easier.

What do you dislike about the product?

As of now I find it a bit hard to dislike such great module. But still talking about it's dislike : It is expensive and some small startups may not afford it. Also, TLM doesn't do great with unstructured data.

What problems is the product solving and how is that benefiting you?

I work as a Data Manager in a Company which works with US Healthcare Data. We train modules on Healthcare datasets. Cleanlab helps us identifies and flags incorrect labels. The modules we train sometimes misinterpret the inputs. Here Cleanlab plays a vital role. This optimizes our ML modules and also helps to identifies outliers. In general Cleanlab helps us with optimizing our AI models.

Ritesh S.

Powerful label-cleaning with a slight learning curve

Reviewed on May 16, 2025

Review provided by G2

What do you like best about the product?

Accurate error detection. The ability to automatically spot mislabeled and low-confidence examples has saved me countless hours of manual review.

Seamless pandas integration. Working directly on DataFrames makes it trivial to plug Cleanlab into existing preprocessing pipelines.

Clear, example-driven docs. The step-by-step tutorials helped me get up and running in under an hour.

What do you dislike about the product?

Initial setup complexity. Installing all dependencies (and configuring environments) can feel a bit involved if you’re just experimenting.

Performance on very large datasets. Label-error detection can be slow without additional tuning or sampling.

What problems is the product solving and how is that benefiting you?

Cleanlab tackles the hidden “label noise” in your datasets—mislabeled, ambiguous or low-confidence examples that quietly drag down model accuracy. By automatically flagging and ranking these problematic records (and even suggesting which labels to trust), Cleanlab lets me:

Catch mistakes early, before they poison training, so my models learn from clean, reliable data.

Streamline data audits, turning hours of manual review into minutes of focused corrections.

Boost final performance, since models trained on higher-quality labels consistently deliver better accuracy and robustness.

Overall, Cleanlab empowers me to maintain a trustworthy, production-ready dataset with far less effort—and to iterate on models faster and with greater confidence.

Hemant R.

Best and easy to use AI

Reviewed on May 06, 2025

Review provided by G2

What do you like best about the product?

Easy to use. No much hardware setup is required and the way it helps in refining data & on the e-commerce side is wonderful.

What do you dislike about the product?

Nothing as such I can think of. need to look more into the product before making any statement.

What problems is the product solving and how is that benefiting you?

So we have a lot of customer data but it's mostly messy and not linked properly but with Cleanlab it gives us a properly formatted data.

nageen n.

The AI tools to easy my job to clean data from row to smart data set and help our team

Reviewed on Apr 19, 2025

Review provided by G2

What do you like best about the product?

The time we spent in dataset to significanty decrese after using cleanlab. i would say its save lots of time.

What do you dislike about the product?

sometime it getting slow on large dataset but we have not so frequnt those dataset but yes there is need to improvment.

What problems is the product solving and how is that benefiting you?

The problem with our existing dataset is clean by manully most of time and sometime heuristics so this is best suited for us to solve our problem. our 70-80% time and human affort are decrese.

View all reviews