
Overview
Data Inspector is an AI tool to automatically identify entries in any tabular dataset (CSV file) that are likely incorrect.
Simply provide any data table (including columns that are: text, numeric, or categorical), and ML models will be trained to flag any entry (cell value) that is likely erroneous. Data Inspector returns 3 CSV files with quality assessments about each entry (cell value) in your dataset, stating: whether its value appears corrupted, how likely this entry is correct vs an erroneous/corrupted value, plus an alternative predicted/imputed value expected for this entry.
The Data Inspector audit is especially useful to catch errors in applications involving: data entry, measurement errors (surveys, sensor noise, etc), or a Quality Assurance team that spends time reviewing data. AI can inspect your data more systematically to detect issues with consistent coverage -- all in a fully automated way!
Documentation and examples: https://github.com/cleanlab/aws-marketplace/Â
Highlights
- Data Inspector works for any standard tabular dataset (including columns that are: text, numeric, or categorical — with missing values allowed). It trains state-of-the-art ML models to automatically detect any erroneous values in the dataset.
- Documentation and example usage notebooks for the latest version are available here: https://github.com/cleanlab/aws-marketplace/
- Cleanlab invents novel solutions to assess and improve data quality for applications with messy real-world data. Many of our algorithms are published in top-tier venues for transparency: https://cleanlab.ai/research/ We have created the most popular library for Data-Centric AI: https://github.com/cleanlab/cleanlab
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost/host/hour |
|---|---|---|
ml.m5.xlarge Inference (Batch) Recommended | Model inference on the ml.m5.xlarge instance type, batch mode | $5.00 |
ml.m5.xlarge Inference (Real-Time) Recommended | Model inference on the ml.m5.xlarge instance type, real-time mode | $5.00 |
ml.m5.xlarge Training Recommended | Algorithm training on the ml.m5.xlarge instance type | $5.00 |
ml.p3.2xlarge Inference (Batch) | Model inference on the ml.p3.2xlarge instance type, batch mode | $5.00 |
ml.p3.16xlarge Inference (Batch) | Model inference on the ml.p3.16xlarge instance type, batch mode | $5.00 |
ml.m5.24xlarge Inference (Batch) | Model inference on the ml.m5.24xlarge instance type, batch mode | $5.00 |
ml.p3.2xlarge Inference (Real-Time) | Model inference on the ml.p3.2xlarge instance type, real-time mode | $5.00 |
ml.p3.16xlarge Inference (Real-Time) | Model inference on the ml.p3.16xlarge instance type, real-time mode | $5.00 |
ml.m5.24xlarge Inference (Real-Time) | Model inference on the ml.m5.24xlarge instance type, real-time mode | $5.00 |
ml.p3.2xlarge Training | Algorithm training on the ml.p3.2xlarge instance type | $5.00 |
Vendor refund policy
We do not currently support refunds, but you can cancel your subscription to the service at any time.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker algorithm
An Amazon SageMaker algorithm is a machine learning model that requires your training data to make predictions. Use the included training algorithm to generate your unique model artifact. Then deploy the model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
Automatically detect potential errors in any column of your tabular dataset.
Additional details
Inputs
- Summary
Your data should be in a CSV file with a header containing column names for your data. If your data contains an index column, it should be specified using the index_col hyperparameter, otherwise it is assumed that there is no index column.
By default, all categorical and numeric columns will be inspected for issues, if you want to inspect specific columns, pass those in as a list to the columns_to_inspect hyperpameter. Text columns that cannot be inspected will be skipped automatically.
- Input MIME type
- text/csv
Input data descriptions
The following table describes supported input data fields for real-time inference and batch transform.
Field name | Description | Constraints | Required |
|---|---|---|---|
Dataset | Each row in the input data must represent a single example. The columns of your data can contain either numeric, categorical, or text (arbitrary string) values, however data errors can only be detected on numeric or categorical data. Data with multiple text columns and missing values are supported. | Type: FreeText | Yes |
Support
Vendor support
For questions/support, please email: support@cleanlab.ai . Free Trials and Subscription Plans available! Email us for more details.
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products


Customer reviews
A decent Large Language Model, if we are keen in tracking our responses with score basis
CleanLab: Best ML Modules Optimizer
Powerful label-cleaning with a slight learning curve
Seamless pandas integration. Working directly on DataFrames makes it trivial to plug Cleanlab into existing preprocessing pipelines.
Clear, example-driven docs. The step-by-step tutorials helped me get up and running in under an hour.
Performance on very large datasets. Label-error detection can be slow without additional tuning or sampling.
Catch mistakes early, before they poison training, so my models learn from clean, reliable data.
Streamline data audits, turning hours of manual review into minutes of focused corrections.
Boost final performance, since models trained on higher-quality labels consistently deliver better accuracy and robustness.
Overall, Cleanlab empowers me to maintain a trustworthy, production-ready dataset with far less effort—and to iterate on models faster and with greater confidence.