Data Cleansing AI Agent

H2 jump links will appear on live website

H3 jump links will appear on live website

Overview

The Data Cleansing Agent is an AI-powered assistant that helps users profile, clean, and standardize their database data without writing code. Users describe what they want in plain English -- such as "find and fix postal code issues in the customers table" or "standardize date formats in the orders table to ISO" -- and the agent autonomously discovers issues, creates safe working copies of the data, applies the appropriate fixes, and verifies the results. It handles common supply chain data problems including mixed date formats, inconsistent country codes, Excel-corrupted postal codes, missing values, outliers, and messy text fields. It expects a connected database with one or more tables as input. The output is a set of cleaned copies of their tables in the database which users can immediately use for Cosmic Frog model building, reporting, or further analysis, while the original data is preserved untouched for comparison or rollback.

This documentation describes how this specific agent works and can be configured, including walking through multiple examples. Please see the “AI Agents: Architecture and Components” Help Center article if you are interested in understanding how the Optilogic AI Agents work at a detailed level.

Why It's Useful

Cleaning and standardizing data for supply chain modeling typically requires significant manual effort -- writing SQL queries, inspecting column values, fixing formatting issues one at a time, and verifying results. The Data Cleansing Agent streamlines this process by turning a single natural language prompt into a full profiling, cleaning, and verification workflow.

Enhances productivity by automating repetitive data cleaning tasks that would otherwise require custom SQL or manual inspection.
Reduces errors by using purpose-built tools that handle edge cases (e.g., Excel scientific notation corruption, leading-zero preservation in postal codes) that are easy to miss in manual cleanup.
Preserves data integrity by always working on copies -- original tables are never modified, so there's no risk of data loss.
Adapts to any database schema -- no fixed column name requirements. The agent discovers the structure automatically and applies the right tools.

What's Included

Key Capabilities:

Data Profiling -- Discovers tables, schemas, and column types. Detects date format inconsistencies, missing/null values, statistical outliers, location data problems, and postal code corruption.
Date Standardization -- Detects mixed date formats (ISO, US, European, and many variants) and converts them to a consistent target format.
Location Standardization -- Standardizes country names/codes to ISO 3166 (alpha-2, alpha-3, or full name), cleans city names, and applies smart title casing to addresses.
Postal Code Repair -- Repairs Excel-corrupted postal codes (scientific notation), restores leading zeros, removes placeholder values (N/A, XXXXX, etc.), cleans formatting, extracts postal codes from address strings, and validates against real postal code databases.
String Normalization -- Trims whitespace, standardizes casing, removes extra spaces, and handles punctuation and unicode cleanup.
Unit Conversion -- Converts measurement units (e.g., pounds to kilograms, inches to centimeters) using a reference table.
Missing Data Handling -- Fills missing values using strategies such as constant, mean, median, mode, forward fill, or backward fill. Supports grouped fills (e.g., median by customer segment).
Outlier Detection -- Detects outliers using IQR, Z-Score, or Modified Z-Score methods. Can handle outliers by nullifying, capping, replacing with statistics, or removing rows.
Type Casting -- Converts column data types (INTEGER, NUMERIC, TEXT, BOOLEAN, DATE, TIMESTAMP) with configurable error handling.
Custom SQL -- Executes any SQL statement for operations not covered by the specialized tools, such as creating views, computed columns, or complex joins.

Skills:

How To Use It

The agent can be accessed through the Run Utility task in DataStar, see also the screenshots below. The key inputs are:

Database -- Select the database to perform data cleansing on. All database types are supported (Cosmic Frog, DataStar, or Postgres).
Task Description -- Describe what you want the agent to do. This is a free-text field (up to 10,000 characters) where you write your request in natural language. Be as specific as possible about tables, columns, and desired outcomes.

The Task Description field includes placeholder examples to help you get started:

List all tables and analyze their data quality
Standardize date formats in the orders table to ISO
Find and fix postal code issues in the customers table
Normalize country codes to ISO alpha-2 format

Optionally, users can:

Enable Verbose Output (toggle to "Detailed") to see the full list of available tools and detailed execution information. Default is "Concise".

A DataStar macro with a Run Utility task added to it

In the Select Utility part of the Configuration tab of the Utility task, users can select which Agent/Utility they want to use. Here we use the Data Cleansing Agent.

After selecting the Utility, it needs to be configured in the Configure Utility part of the Configuration tab. Select the database to run the Agent on, and enter a task description. Optionally toggle the Verbose Output to Detailed.

Suggested workflow:

Start with a profiling prompt to understand your data quality issues (e.g., "List all tables and analyze their data quality").
Review the agent's findings to understand what needs to be fixed. This information can be found in the Job Log of the task run in the Run Manager application, see the next section "Reading the Job Log" for more details.
Run a transformation prompt targeting the specific issues found (e.g., "Standardize date formats in the orders table to ISO and fix postal code issues in the customers table").
Review the summary the agent provides, which is the Agent Response section in the Job Log (see next section for details) -- it reports what was changed, how many rows were affected, and any remaining issues.
Query the clean_* tables in the database to verify the results. The original tables remain untouched.

After the run, the agent produces a structured summary of everything it did, including metrics on rows affected, issues found, and issues fixed; see the next section where this Job Log is described in more detail. The cleaned data is persisted as clean_* tables in the database (e.g., clean_customers, clean_shipments).

Reading the Job Log

After a run completes, the Job Log in Run Manager provides a detailed trace of every step the agent took. Understanding the log structure helps users verify what happened and troubleshoot if needed. The log follows a consistent structure from start to finish.

The Run Manager application is opened and the Data Cleansing Agent run is selected in the list. On the right-hand side, the 4th icon can be clicked on to show the Job Log of the selected run. Use the hamburger icon at the right top to download the log to a .txt file.

Header

Every log begins with a banner showing the database name and the exact prompt that was submitted.

Connection & Setup

The agent validates the database connection and initializes itself with its full set of tools. If Verbose Output is set to "Detailed", the log also prints the system prompt and tool list at this stage.

Planning Phase

For non-trivial tasks, the agent creates a strategic execution plan before taking action. This appears as a PlanningSkill tool call, followed by an AI Response box containing a structured plan with numbered steps, an objective, approach, and skill mapping. The plan gives users visibility into the agent's intended approach before it begins working.

Tool Calls and Thinking

The bulk of the log shows the agent calling its specialized tools one at a time. Each tool call appears in a bordered box showing the tool name. Between tool calls, the agent's reasoning is shown in Thinking boxes -- explaining what it learned from the previous tool, what it plans to do next, and why. These thinking sections are among the most useful parts of the log for understanding the agent's decision-making.

The agent may call many tools in sequence depending on the complexity of the task. Profiling-only prompts typically involve discovery tools (schema, missing data, date issues, location issues, outliers). Cleanup prompts add transformation tools (ensure_clean_table, standardize_country_codes, standardize_date_column, etc.).

Occasionally a Memory Action Applied entry appears between steps -- this is the agent recording context for its own use and can be ignored.

Error Recovery

If the agent encounters a validation error on a tool call (e.g., a column stored as TEXT when a numeric type was expected, or a missing parameter), the log shows the error and the agent's automatic adjustment. The agent reasons about the failure in a Thinking block and retries with corrected parameters. Users do not need to intervene.

Agent Response

At the end of the run, the agent produces a structured summary of everything it discovered or changed. This is the most important section of the log for understanding outcomes:

For profiling prompts, this section reports what was found across all tables -- schema details, missing data percentages, date format inconsistencies, location quality issues, numeric anomalies, and recommendations for next steps. For cleanup prompts, it reports which tables were modified, what transformations were applied, how many rows were affected, and confirmation that originals are preserved.

Execution Summary

The log ends with runtime statistics and the full list of skills that were available to the agent:

Input Requirements

What the agent expects in your database:

The agent works with any tables in the selected database. There are no fixed column name requirements -- the agent discovers the schema automatically. However, for best results:

Tables should be loaded in the database before running the agent.
If you need unit conversion, include a uom (units of measure) reference table with columns for unit name, symbol, type, and conversion ratio.
Postal code tools work best when postal code columns are TEXT type. The agent will cast INT/BIGINT columns to TEXT automatically if needed.

Output Description

Tips & Notes

Start with profiling. Before asking the agent to fix anything, run a profiling prompt like "List all tables and analyze their data quality" to understand what's there. The agent will not auto-fix issues unless you explicitly ask it to.
Be specific. "Fix postal codes" is good; "Fix postal codes in the customers and shipments tables -- repair Excel corruption, remove placeholders, and clean formatting" is better. The more specific the prompt, the more precise the results.
Original data is safe. The agent always creates clean_* copies before making changes. Your source tables are never modified. You can re-run the agent as many times as needed.
Order matters for postal codes. The agent handles this automatically, but for reference: Excel corruption must be repaired before formatting cleanup, and formatting must be cleaned before placeholder removal.
Grouped fills are supported. For missing data, you can ask for strategies like "fill missing credit_limit using the median grouped by customer_segment" and the agent will handle the grouped calculation.
Custom SQL is available. If you need something the specialized tools don't cover -- like creating views, adding computed columns, or complex joins -- the agent can execute arbitrary SQL as a fallback.
Runtime varies based on the number of tables, columns, and the complexity of the prompt. Simple profiling tasks complete quickly; full database-wide cleanup across many tables will be longer. Expect at least a few minutes for multi-table operations.
Just like many other DataStar tasks, it is possible to run multiple tasks in parallel with the Data Cleansing Agent.
Additional info on the run can be found in Run Manager > Job Log after the run finishes. This includes steps that the agent takes, tools it calls, as well as a summary of work.
The Run Utility task also offers the ability for users to set Run Configuration (optional): Tags for easy filtering in Run Manager, Timeout for maximum run duration, and Resource Size for different memory/CPU allocations.

Examples

Example 1: Exploration (Simple)

A user wants to understand what data is in their database before deciding what to clean.

Database: Supply Chain Dataset

Task Description: List all tables in the database and show their schemas

What happens: The agent calls get_database_schema for all tables and exits with a structured report.

Output:

Requested: List all tables and show schemas.

Discovered (schema 'starburst'):

customers: 25 rows, 11 columns (customer_id, customer_name, email, country, city, ...)
inventory: 40 rows, 9 columns
orders: 35 rows, 9 columns
products: 30 rows, 11 columns
shipments: 37 rows, 13 columns
suppliers: 21 rows, 10 columns
warehouses: 25 rows, 10 columns
uom: 34 rows, 5 columns

...

Total: 12 tables, 405 rows, 112 columns

Example 2: Comprehensive Cleanup (Complex)

A user needs to clean up customer location data before using it in a Cosmic Frog network optimization model.

Database: Supply Chain Dataset

Task Description: Clean the customers table completely: standardize dates to ISO, fix postal codes (Excel corruption + placeholders), standardize country codes to alpha-2, clean city names, and normalize emails to lowercase

What the agent does:

Discovers the customers table schema (11 columns, 30 rows).
Profiles the data and finds:
- 6 different date formats in registration_date
- 11 different country representations (USA, US, United States, usa, etc.)
- 4 Excel-corrupted postal codes in scientific notation (e.g., 9.0021E+04)
- 5 placeholder postal codes (N/A, XXXXX, ?????, -----)
- Inconsistent city casing and whitespace (NEW YORK, los angeles, sydney )
- Mixed-case emails (GLOBAL@EXAMPLE.COM)

Creates clean_customers as a safe working copy.
Applies fixes in the correct order:
- Standardizes all dates to ISO format (YYYY-MM-DD)
- Repairs corrupted postal codes (9.0021E+04 becomes 90021)
- Removes placeholder postal codes (set to NULL)
- Standardizes countries to ISO alpha-2 (USA becomes US, Germany becomes DE)
- Cleans city names (los angeles becomes Los Angeles, sydney becomes Sydney)
- Normalizes emails to lowercase

Verifies all transformations were applied correctly.

Output:

Completed data cleansing of clean_customers table:

Standardized 30 date values in registration_date to ISO format (YYYY-MM-DD)
Fixed 4 Excel-corrupted postal codes (scientific notation)
Removed 5 placeholder postal codes (set to NULL)
Standardized 30 country codes to ISO alpha-2 format
Cleaned 30 city names (casing, whitespace)
Normalized 30 email addresses to lowercase

All changes applied to clean_customers (original customers table preserved).

The cleaned data is available in the clean_customers table in the database. The original customers table remains untouched.

Example 3: Enterprise Multi-Table Cleanup (End-to-End)

A user with a 14-table enterprise supply chain database needs to clean and standardize all data before building Cosmic Frog models for network optimization and simulation.

Database: Enterprise Supply Chain

Task Description: Perform a complete data cleanup across all tables: standardize all dates to ISO, standardize all country codes to alpha-2, clean all city names, fix all postal codes, and normalize all email addresses to lowercase. Work systematically through each table.

What the agent does: The agent works systematically through all tables -- standardizing dates across 12+ tables, fixing country codes, cleaning city names, repairing postal codes, normalizing emails and status fields, detecting and handling negative values, converting mixed units to metric, validating calculated fields like order totals, and reporting any remaining referential integrity issues. This is the most comprehensive operation the agent can perform.

Output: A detailed summary covering every table touched, every transformation applied, and a final quality scorecard showing the before/after improvement.

Example Prompts

Below are example prompts users can try, organized by category.

Exploration & Profiling

"List all tables in the database and show their schemas"
"Check for date format issues across all tables that have date columns"
"Detect all postal code data quality issues in customers, shipments, suppliers, and warehouses tables"
"Detect outliers in product prices, inventory quantities, and order amounts. Use IQR method."

Simple Fixes

"Standardize all dates in the orders table to ISO format"
"Standardize all country codes in the customers table to ISO alpha-2 format (US, DE, etc)"
"Fill missing processor_fee values in transactions with 0.0"
"Standardize all email addresses to lowercase across employees, customers, suppliers. Also trim whitespace."

Complex Multi-Step Tasks

"Clean all postal codes in the customers table: fix Excel corruption, remove placeholders, clean formatting. Follow the proper postal code cleaning workflow."
"Standardize ALL date columns across orders, transactions, inventory, products, and customers to ISO format"
"Convert all product weights to kilograms and all lengths to centimeters. Some products already have metric, some have imperial, some have both - handle this correctly."
"Fill missing credit_limit values in customers using the median credit_limit grouped by customer_segment"

End-to-End Scenarios

"Clean the customers table completely: standardize dates to ISO, fix postal codes (Excel corruption + placeholders), standardize country codes to alpha-2, clean city names, and normalize emails to lowercase"
"Perform a complete data cleanup across all tables: standardize all dates to ISO, standardize all country codes to alpha-2, clean all city names, fix all postal codes, and normalize all email addresses to lowercase. Work systematically through each table."
"Prepare the entire database for data warehouse ETL: clean all tables, ensure referential integrity, standardize all formats, validate all calculations. Generate final quality report."

Custom SQL

"Create a view called customer_summary that shows customer_id, total number of orders, and total order amount from the orders table"
"Update the shipments table to add a new column called 'is_international' (boolean) that is TRUE when ship_from_country differs from ship_to_country"

Other Helpful Resources

Learn more about Cosmic Frog
Learn more about DataStar
An overview of Optilogic's AI Agents and Utilities
The Model Output Insights AI Agent, also available from within DataStar

‍