Overview

The Data Cleansing Agent is one of Ada’s AI-powered assistants. It helps users profile, clean, and standardize their database data without writing code. Users describe what they want in plain English -- such as "find and fix postal code issues in the customers table" or "standardize date formats in the orders table to ISO" -- and the agent autonomously discovers issues, creates safe working copies of the data, applies the appropriate fixes, and verifies the results. The agent handles common supply chain data problems including mixed date formats, inconsistent country codes, Excel-corrupted postal codes, missing values, outliers, and messy text fields. It expects a connected database with one or more tables as input. The output is a set of cleaned copies of their tables in the database which users can immediately use for Cosmic Frog model building, reporting, or further analysis, while the original data is preserved untouched for comparison or rollback.

This documentation describes how this specific agent works and can be configured, including walking through multiple examples. Please see the “AI Agents: Architecture and Components” Help Center article if you are interested in understanding how the Optilogic AI Agents work at a detailed level.

Why It's Useful

Cleaning and standardizing data for supply chain modeling typically requires significant manual effort -- writing SQL queries, inspecting column values, fixing formatting issues one at a time, and verifying results. The Data Cleansing Agent streamlines this process by turning a single natural language prompt into a full profiling, cleaning, and verification workflow.

  1. Enhances productivity by automating repetitive data cleaning tasks that would otherwise require custom SQL or manual inspection.
  2. Reduces errors by using purpose-built tools that handle edge cases (e.g., Excel scientific notation corruption, leading-zero preservation in postal codes) that are easy to miss in manual cleanup.
  3. Preserves data integrity by always working on copies -- original tables are never modified, so there's no risk of data loss.
  4. Adapts to any database schema -- no fixed column name requirements. The agent discovers the structure automatically and applies the right tools.

What's Included

Key Capabilities:

  1. Database Exploration -- Discovers tables and schemas, inspects sample values and value frequencies, and searches for specific values across columns.
  2. Date Standardization -- Detects mixed date formats (including named months and partial dates) and converts them to ISO, US, European, or other custom formats. Handles ambiguous dates like '01/09/2024' (Jan 9 vs. Sep 1) via a configurable DMY/MDY interpretation policy and reports the inferred order back to the user.
  3. Location Standardization -- Standardizes country names/codes to ISO 3166 (alpha-2, alpha-3, or full name), cleans city names, and applies smart title casing to addresses.
  4. Postal Code Repair -- Repairs Excel-corrupted postal codes (scientific notation), restores leading zeros, removes placeholder values (N/A, XXXXX, etc.), cleans formatting, extracts postal codes from address strings, and validates against real postal code databases.
  5. String Normalization -- Trims whitespace, standardizes casing, removes extra spaces, and handles punctuation and unicode cleanup.
  6. Unit Conversion -- Converts measurement units (e.g., pounds to kilograms, inches to centimeters) using a reference table.
  7. Missing Data Handling -- Fills missing values using strategies such as constant, mean, median, mode, forward fill, or backward fill. Supports grouped fills (e.g., median by customer segment).
  8. Outlier Detection -- Detects outliers using IQR, Z-Score, or Modified Z-Score methods. Can handle outliers by nullifying, capping, replacing with statistics, or removing rows.
  9. Type Casting -- Converts column data types (INTEGER, NUMERIC, TEXT, BOOLEAN, DATE, TIMESTAMP, TIMESTAMPTZ) with configurable error handling (skip / null / strict).
  10. Custom SQL -- Executes any SQL statement for operations not covered by the specialized tools, such as creating views, computed columns, or complex joins.
  11. Deduplication (via specialized sub-agent) -- Finds and resolves duplicate records. Supports exact matching, fuzzy matching, cross-table matching, and transaction-row deduplication. Creates golden-key mappings on a master table and propagates them to all referencing transaction tables.
  12. Referential Integrity -- Surfaces primary-key duplicates/nulls and foreign-key orphan rows, presents realistic repair options (null the reference, delete orphans, insert stub parents, etc.), and applies the chosen fix on clean_* copies only after the user confirms. Reuses the Data Profiler's findings when a recent profiler run exists; otherwise inspects the schema directly.
  13. Report Generation -- Produces a markdown or HTML report summarizing analysis or cleanup work, dispatched to a dedicated report-writer sub-agent that chooses the right format based on the user's framing.
  14. Value Replacement -- Replaces specific values in a column (e.g., 'Cancelled' -> 'Canceled', 'N/A' -> NULL) for one or many columns in a batch. Useful for collapsing inconsistent boolean encodings (Y/N vs. Yes/No vs. 1/0) and bulk sentinel-to-NULL conversion.

Tools:

How To Use It

The agent can be accessed on the next generation Optilogic platform by chatting with Ada and through the Run AI Agent task in DataStar. Both ways will be explained, via Ada first, then the DataStar workflow, followed by an overview of the main differences between the 2 methods.

Using the Data Cleansing Agent through Chat with Ada

It is recommended to be somewhat familiar with Ada before diving into this content. Please see the Getting Started with Ada & Agentic AI article, and in particular its How to Use Ada section.

Once logged into the next-generation Optilogic platform at https://ai.optilogic.app, you can start chatting with Ada leveraging the Data Cleansing Agent right away from the central part of the Home page.

  1. If unsure how to start, or if you are just exploring, you can use one of the example prompts at the top to get going.
  2. Underneath the prompt textbox, we can (from left to right):
    1. Select the interaction Style to use. These behave differently in how verbose the agent's answers will be and how often user confirmation/feedback will be sought before proceeding.
    2. Connect to Cosmic Frog models, DataStar projects, and Postgres databases. The prompt can refer to these; the agent will create clean_* copies of any tables it modifies.
    3. Choose the Data Cleanser Agent from the agent selector.
  3. Type your question/task into the prompt textbox.
  4. Click on the submit button at the right bottom of hte prompt box.

Regarding how to write good prompts, please note that the general Best Practices, Tips & Tricks, and Current Limitations and Known Behaviors included in the Getting Started with Ada documentation also apply to the Data Cleansing Agent.

After submitting a prompt, the Data Cleansing Agent will start processing and formulating a response:

  1. The full prompt is shown at the top of the conversation.
  2. Underneath the prompt, a status line shows what Ada is doing currently. To see each status update, expand this section by clicking on the caret icon to the left. Clicking on the Status Updates, Tool Calls, or Knowledge tabs shows progressively more detail.
  3. If you want to stop Ada while she is still working on a response (say you realized your prompt is incorrect/incomplete or you have not connected the right database(s)), you can click the Stop button at the right bottom in the prompt box.

The Data Cleansing Agent may ask for feedback before proceeding — for example, when:

  • A date column contains ambiguous values like '01/09/2024' (Jan 9 vs. Sep 1) and the agent wants to confirm DMY vs MDY interpretation.
  • Sentinel values are detected and the agent wants confirmation before nulling them.
  • Duplicate records are found and the agent wants confirmation on the merge strategy.
  • Foreign-key orphans are found and the agent wants to confirm the repair option (null the reference, delete orphans, etc.).
  1. First, Ada explains why she needs some feedback and/or confirmation to continue.
  2. Next, some things Ada noticed while working on the prompt are noted in the lighter texbox. It gives users additional information to help make their decisions on how to proceed.
  3. The feedback required to continue is then listed. Often this takes the form of multiple options: the user can just click on the option they want to go with.
  4. In case none of the options suggested by Ada are suitable, the user can type how they want to proceed for that item in the textbox.
  5. Once all feedback items have been addressed, click on the Send Response button for Ada to continue.
  6. Click on the Stop Agent option in case continuing does not make sense.

When Ada finishes, the final response is presented:

  1. The original prompt is listed at the top of the conversation.
  2. The status line now shows how long the response took. To see full details, expand it.
  3. Any clarifying questions Ada asked along the way and the answers given are shown — click the caret to expand them.
  4. This is the final response where Ada summarizes what she did — which tables were modified, how many rows were affected, and where the cleaned data lives.
  5. The user can continue the conversation by typing any follow-up questions/tasks and submitting them. It is recommended to start a new conversation if switching to an unrelated task, a different agent, different connected databases, or setting a different interaction style.

For completeness, the cleaned data shows up in the connected database as clean_* table copies — for example, clean_customers, clean_orders — with the originals preserved untouched for comparison or rollback:

Using the Data Cleansing Agent within DataStar

In DataStar, the Data Cleansing Agent is accessed by using a Run AI Agent task, see also the screenshots below. The key inputs are:

  • Database -- Select the database to perform data cleansing on. All database types are supported (Cosmic Frog, DataStar, or Postgres).
  • Task Description -- Describe what you want the agent to do. This is a free-text field (up to 10,000 characters) where you write your request in natural language. Be as specific as possible about tables, columns, and desired outcomes.

The Task Description field includes placeholder examples to help you get started:

  • List all tables and analyze their data quality
  • Standardize date formats in the orders table to ISO
  • Find and fix postal code issues in the customers table
  • Normalize country codes to ISO alpha-2 format

Optionally, users can:

  • Enable Verbose Output (toggle to "Detailed") to see the full list of available tools and detailed execution information. Default is "Concise".
  • Point the Agent to a Knowledge Folder. Any .md and .txt files placed here will be read by the agent and used as additional context to perform the task. Think of column descriptions, known data conversions, etc.
A DataStar macro with a Run AI Agent task added to it
In the Select Utility part of the Configuration tab of a Run AI Agent task, users can select which Agent they want to use. Here we use the Data Cleansing Agent.
After selecting the Agent, it needs to be configured in the Configure Utility part of the Configuration tab. Select the database to run the Agent on, and enter a task description. Optionally toggle the Verbose Output to Detailed and/or set a Knowledge Folder.

Not shown in the screenshots above, there is also a Run Configuration section, where users can add Tags to facilitate finding job runs, set a Timeout for the task, and set the Resource Size to use. Note that for most Run AI Agent tasks, the Resource Size will need to be set to XS or higher.

Suggested workflow:

  1. Start with a profiling prompt to understand your data quality issues (e.g., "List all tables and analyze their data quality").
  2. Review the agent's findings to understand what needs to be fixed. This information can be found in the Task Log, see the section "Reading the Task Log" below for more details.
  3. Run a transformation prompt targeting the specific issues found (e.g., "Standardize date formats in the orders table to ISO and fix postal code issues in the customers table").
  4. Review the summary the agent provides, which is the Agent Response section in the Task Log -- it reports what was changed, how many rows were affected, and any remaining issues.
  5. Query the clean_* tables in the database to verify the results. The original tables remain untouched.

After the run, the agent produces a structured summary of everything it did, including metrics on rows affected, issues found, and issues fixed; see the next section where this Job Log is described in more detail. The cleaned data is persisted as clean_* tables in the database (e.g., clean_customers, clean_shipments).

Data Cleansing Agent Usage: Chat UI vs DataStar

There are a few differences to keep in mind when running the Data Cleansing Agent either through chatting with Ada or from within DataStar:

  1. Chatting with Ada is conversational and interactive. It supports clarification questions, mid-run feedback, follow-up prompts in the same conversation, and on-the-fly database connection changes. Best for exploration, one-off cleanup, and tasks where the cleansing strategy isn't fully decided up front.
  2. DataStar is batch and repeatable. It runs as a Run AI Agent task inside a Macro, ideal for cleanup that needs to be re-run on a schedule or as part of a larger pipeline. Mid-run feedback is not available — the prompt has to be self-contained, and any clarifications surface in the Task Log only after the task finishes.
  3. Both produce the same artifacts: clean_* table copies in the connected database, and optionally a markdown or HTML report if requested.

Recommendation: Use the chat UI to develop and refine a prompt, then transfer the working prompt into a DataStar Run AI Agent task once you want the workflow to become repeatable.

Reading the Task Log

After a run completes, the Task Log provides a detailed trace of every step the agent took. Understanding the log structure helps users verify what happened and troubleshoot if needed. The log follows a consistent structure from start to finish.

The Task Log for the Run AI Agent task is expanded and we are viewing thelog for a run that completed on 6/2. Use the copy icon at the right top to download the log to a .txt file.

Header

Every log begins with a banner showing the database name and the exact prompt that was submitted.

Connection & Setup

The agent validates the database connection and initializes itself with its full set of tools. If Verbose Output is set to "Detailed", the log also prints the system prompt and tool list at this stage.

Planning Phase

For non-trivial tasks, the agent creates a strategic execution plan before taking action. This appears as a PlanningSkill tool call, followed by an AI Response box containing a structured plan with numbered steps, an objective, approach, and skill mapping. The plan gives users visibility into the agent's intended approach before it begins working.

Tool Calls and Thinking

The bulk of the log shows the agent calling its specialized tools one at a time. Each tool call appears in a bordered box showing the tool name. Between tool calls, the agent's reasoning is shown in Thinking boxes -- explaining what it learned from the previous tool, what it plans to do next, and why. These thinking sections are among the most useful parts of the log for understanding the agent's decision-making.

The agent may call many tools in sequence depending on the complexity of the task. Profiling-only prompts typically involve discovery tools (schema, missing data, date issues, location issues, outliers). Cleanup prompts add transformation tools (ensure_clean_table, standardize_country_codes, standardize_date_column, etc.).

Occasionally a Memory Action Applied entry appears between steps -- this is the agent recording context for its own use and can be ignored.

Error Recovery

If the agent encounters a validation error on a tool call (e.g., a column stored as TEXT when a numeric type was expected, or a missing parameter), the log shows the error and the agent's automatic adjustment. The agent reasons about the failure in a Thinking block and retries with corrected parameters. Users do not need to intervene.

Agent Response

At the end of the run, the agent produces a structured summary of everything it discovered or changed. This is the most important section of the log for understanding outcomes:

For profiling prompts, this section reports what was found across all tables -- schema details, missing data percentages, date format inconsistencies, location quality issues, numeric anomalies, and recommendations for next steps. For cleanup prompts, it reports which tables were modified, what transformations were applied, how many rows were affected, and confirmation that originals are preserved.

Execution Summary

The log ends with runtime statistics and the full list of skills that were available to the agent:

Input Requirements

What the agent expects in your database:

The agent works with any tables in the selected database. There are no fixed column name requirements -- the agent discovers the schema automatically. However, for best results:

  • Tables should be loaded in the database before running the agent.
  • If you need unit conversion, include a uom (units of measure) reference table with columns for unit name, symbol, type, and conversion ratio.
  • Postal code tools work best when postal code columns are TEXT type. The agent will cast INT/BIGINT columns to TEXT automatically if needed.

Output Description

Tips & Notes

  • Start with profiling. Before asking the agent to fix anything, run a profiling prompt like "List all tables and analyze their data quality" to understand what's there. The agent will not auto-fix issues unless you explicitly ask it to.
  • Be specific. "Fix postal codes" is good; "Fix postal codes in the customers and shipments tables -- repair Excel corruption, remove placeholders, and clean formatting" is better. The more specific the prompt, the more precise the results.
  • Original data is safe. The agent always creates clean_* copies before making changes. Your source tables are never modified. You can re-run the agent as many times as needed.
  • Order matters for postal codes. The agent handles this automatically, but for reference: Excel corruption must be repaired before formatting cleanup, and formatting must be cleaned before placeholder removal.
  • Grouped fills are supported. For missing data, you can ask for strategies like "fill missing credit_limit using the median grouped by customer_segment" and the agent will handle the grouped calculation.
  • Custom SQL is available. If you need something the specialized tools don't cover -- like creating views, adding computed columns, or complex joins -- the agent can execute arbitrary SQL as a fallback.
  • Runtime varies based on the number of tables, columns, and the complexity of the prompt. Simple profiling tasks complete quickly; full database-wide cleanup across many tables will be longer. Expect at least a few minutes for multi-table operations.
  • Just like many other DataStar tasks, it is possible to run multiple tasks in parallel with the Data Cleansing Agent.
  • Additional info on the run can be found in Run Manager > Job Log after the run finishes. This includes steps that the agent takes, tools it calls, as well as a summary of work.
  • The Run Utility task also offers the ability for users to set Run Configuration (optional): Tags for easy filtering in Run Manager, Timeout for maximum run duration, and Resource Size for different memory/CPU allocations.
  • Duplicates are handled by a dedicated sub-agent. Any prompt that mentions finding, detecting, or merging duplicate records ("find duplicate customers", "deduplicate the suppliers table") is automatically routed to the Deduplication sub-agent. You don't need to invoke it explicitly -- the parent agent decides.
  • Use the Data Profiler pipeline for deep analysis. To run a full database profile, ask for it by name (e.g., "run the data profiler pipeline"). It's a separate, longer-running workflow that persists its findings; a follow-up cleanup prompt like "fix the CRITICAL issues the Data Profiler flagged" reuses those without re-profiling. See the Data Profiler AI Agent help article for what it captures.
  • Append "and write a report" to get a deliverable. Adding phrases like "write a report", "save a report", or "create a report" to any analysis or cleanup prompt produces a markdown or HTML report file alongside the database changes.
  • Comprehensive cleanup is a full sweep. A prompt like "clean the whole database" or "audit the database" triggers an eight-category scan covering schema and freshness, missing values and sentinels, date issues, string and identifier formatting, type mismatches, cross-column logical inconsistencies, referential integrity, and statistical outliers. The agent profiles every category first, then applies fixes -- so partial-completion exits don't happen.

Examples

Example 1: Exploration (Simple)

A user wants to understand what data is in their database before deciding what to clean.

Database: Supply Chain Dataset

Task Description: List all tables in the database and show their schemas

What happens: The agent calls get_database_schema for all tables and exits with a structured report.

Output:

Requested: List all tables and show schemas.

Discovered (schema 'starburst'):

  • customers: 25 rows, 11 columns (customer_id, customer_name, email, country, city, ...)
  • inventory: 40 rows, 9 columns
  • orders: 35 rows, 9 columns
  • products: 30 rows, 11 columns
  • shipments: 37 rows, 13 columns
  • suppliers: 21 rows, 10 columns
  • warehouses: 25 rows, 10 columns
  • uom: 34 rows, 5 columns

...

Total: 12 tables, 405 rows, 112 columns

Example 2: Comprehensive Cleanup (Complex)

A user needs to clean up customer location data before using it in a Cosmic Frog network optimization model.

Database: Supply Chain Dataset

Task Description: Clean the customers table completely: standardize dates to ISO, fix postal codes (Excel corruption + placeholders), standardize country codes to alpha-2, clean city names, and normalize emails to lowercase

What the agent does:

  1. Discovers the customers table schema (11 columns, 30 rows).
  2. Profiles the data and finds:
    • 6 different date formats in registration_date
    • 11 different country representations (USA, US, United States, usa, etc.)
    • 4 Excel-corrupted postal codes in scientific notation (e.g., 9.0021E+04)
    • 5 placeholder postal codes (N/A, XXXXX, ?????, -----)
    • Inconsistent city casing and whitespace (NEW YORK, los angeles, sydney )
    • Mixed-case emails (GLOBAL@EXAMPLE.COM)
  1. Creates clean_customers as a safe working copy.
  2. Applies fixes in the correct order:
    • Standardizes all dates to ISO format (YYYY-MM-DD)
    • Repairs corrupted postal codes (9.0021E+04 becomes 90021)
    • Removes placeholder postal codes (set to NULL)
    • Standardizes countries to ISO alpha-2 (USA becomes US, Germany becomes DE)
    • Cleans city names (los angeles becomes Los Angeles, sydney becomes Sydney)
    • Normalizes emails to lowercase
  1. Verifies all transformations were applied correctly.

Output:

Completed data cleansing of clean_customers table:

  • Standardized 30 date values in registration_date to ISO format (YYYY-MM-DD)
  • Fixed 4 Excel-corrupted postal codes (scientific notation)
  • Removed 5 placeholder postal codes (set to NULL)
  • Standardized 30 country codes to ISO alpha-2 format
  • Cleaned 30 city names (casing, whitespace)
  • Normalized 30 email addresses to lowercase

All changes applied to clean_customers (original customers table preserved).

The cleaned data is available in the clean_customers table in the database. The original customers table remains untouched.

Example 3: Enterprise Multi-Table Cleanup (End-to-End)

A user with a 14-table enterprise supply chain database needs to clean and standardize all data before building Cosmic Frog models for network optimization and simulation.

Database: Enterprise Supply Chain

Task Description: Perform a complete data cleanup across all tables: standardize all dates to ISO, standardize all country codes to alpha-2, clean all city names, fix all postal codes, and normalize all email addresses to lowercase. Work systematically through each table.

What the agent does: The agent works systematically through all tables -- standardizing dates across 12+ tables, fixing country codes, cleaning city names, repairing postal codes, normalizing emails and status fields, detecting and handling negative values, converting mixed units to metric, validating calculated fields like order totals, and reporting any remaining referential integrity issues. This is the most comprehensive operation the agent can perform.

Output: A detailed summary covering every table touched, every transformation applied, and a final quality scorecard showing the before/after improvement.

Example 4: Deduplication (Sub-agent Delegated)

A user has multiple records for the same customer in the customers table and wants golden keys created and propagated to the orders table.

Database: Supply Chain Dataset

Task Description: Find duplicate customer records in the customers table, create golden key mappings, and propagate them to the orders table.

What the agent does: Delegates the task to the Deduplication sub-agent, which detects exact and fuzzy duplicate groups, picks a canonical record for each group, and updates the orders table so every order points to the canonical customer.

Output: A summary listing how many duplicate groups were detected, how many golden keys were created, and how many rows in orders were updated to point to the canonical master record.

Example Prompts

Below are example prompts users can try, organized by category.

Exploration & Profiling

  1. "List all tables in the database and show their schemas"
  2. "Check for date format issues across all tables that have date columns"
  3. "Detect all postal code data quality issues in customers, shipments, suppliers, and warehouses tables"
  4. "Detect outliers in product prices, inventory quantities, and order amounts. Use IQR method."

Simple Fixes

  1. "Standardize all dates in the orders table to ISO format"
  2. "Standardize all country codes in the customers table to ISO alpha-2 format (US, DE, etc)"
  3. "Fill missing processor_fee values in transactions with 0.0"
  4. "Standardize all email addresses to lowercase across employees, customers, suppliers. Also trim whitespace."

Complex Multi-Step Tasks

  1. "Clean all postal codes in the customers table: fix Excel corruption, remove placeholders, clean formatting. Follow the proper postal code cleaning workflow."
  2. "Standardize ALL date columns across orders, transactions, inventory, products, and customers to ISO format"
  3. "Convert all product weights to kilograms and all lengths to centimeters. Some products already have metric, some have imperial, some have both - handle this correctly."
  4. "Fill missing credit_limit values in customers using the median credit_limit grouped by customer_segment"
  5. "Find cross-column inconsistencies in orders: delivery dates before ship dates, totals that don't equal price * quantity, and rows marked status=shipped but missing ship_date -- report only, don't fix"

End-to-End Scenarios

  1. "Clean the customers table completely: standardize dates to ISO, fix postal codes (Excel corruption + placeholders), standardize country codes to alpha-2, clean city names, and normalize emails to lowercase"
  2. "Perform a complete data cleanup across all tables: standardize all dates to ISO, standardize all country codes to alpha-2, clean all city names, fix all postal codes, and normalize all email addresses to lowercase. Work systematically through each table."
  3. "Prepare the entire database for data warehouse ETL: clean all tables, ensure referential integrity, standardize all formats, validate all calculations. Generate final quality report."

Custom SQL

  1. "Create a view called customer_summary that shows customer_id, total number of orders, and total order amount from the orders table"
  2. "Update the shipments table to add a new column called 'is_international' (boolean) that is TRUE when ship_from_country differs from ship_to_country"

Deduplication & Referential Integrity

  1. "Find duplicate customer records in the customers table and report them, but don’t change anything"
  2. "Deduplicate the suppliers table end-to-end: detect duplicates, create golden keys, and propagate them to all referencing transaction tables"
  3. "Validate referential integrity across the database -- show orphan records and recommend fixes, but wait for me to approve before changing anything"

Data Profiler Pipeline

  1. "Run the data profiler pipeline on the full database"
  2. "Run the data profiler on only the customers and orders tables"
  3. "Based on the latest data profiler run, fix the CRITICAL alerts in the clean_* tables"

Reports

  1. "Profile every table for data quality issues and write a markdown report summarizing what you found"
  1. "Clean the customers table and produce an HTML report of every transformation you applied"

Other Helpful Resources

Questions or feedback? Please contact the Optilogic Support team on support@optilogic.com.

Overview

The Data Cleansing Agent is one of Ada’s AI-powered assistants. It helps users profile, clean, and standardize their database data without writing code. Users describe what they want in plain English -- such as "find and fix postal code issues in the customers table" or "standardize date formats in the orders table to ISO" -- and the agent autonomously discovers issues, creates safe working copies of the data, applies the appropriate fixes, and verifies the results. The agent handles common supply chain data problems including mixed date formats, inconsistent country codes, Excel-corrupted postal codes, missing values, outliers, and messy text fields. It expects a connected database with one or more tables as input. The output is a set of cleaned copies of their tables in the database which users can immediately use for Cosmic Frog model building, reporting, or further analysis, while the original data is preserved untouched for comparison or rollback.

This documentation describes how this specific agent works and can be configured, including walking through multiple examples. Please see the “AI Agents: Architecture and Components” Help Center article if you are interested in understanding how the Optilogic AI Agents work at a detailed level.

Why It's Useful

Cleaning and standardizing data for supply chain modeling typically requires significant manual effort -- writing SQL queries, inspecting column values, fixing formatting issues one at a time, and verifying results. The Data Cleansing Agent streamlines this process by turning a single natural language prompt into a full profiling, cleaning, and verification workflow.

  1. Enhances productivity by automating repetitive data cleaning tasks that would otherwise require custom SQL or manual inspection.
  2. Reduces errors by using purpose-built tools that handle edge cases (e.g., Excel scientific notation corruption, leading-zero preservation in postal codes) that are easy to miss in manual cleanup.
  3. Preserves data integrity by always working on copies -- original tables are never modified, so there's no risk of data loss.
  4. Adapts to any database schema -- no fixed column name requirements. The agent discovers the structure automatically and applies the right tools.

What's Included

Key Capabilities:

  1. Database Exploration -- Discovers tables and schemas, inspects sample values and value frequencies, and searches for specific values across columns.
  2. Date Standardization -- Detects mixed date formats (including named months and partial dates) and converts them to ISO, US, European, or other custom formats. Handles ambiguous dates like '01/09/2024' (Jan 9 vs. Sep 1) via a configurable DMY/MDY interpretation policy and reports the inferred order back to the user.
  3. Location Standardization -- Standardizes country names/codes to ISO 3166 (alpha-2, alpha-3, or full name), cleans city names, and applies smart title casing to addresses.
  4. Postal Code Repair -- Repairs Excel-corrupted postal codes (scientific notation), restores leading zeros, removes placeholder values (N/A, XXXXX, etc.), cleans formatting, extracts postal codes from address strings, and validates against real postal code databases.
  5. String Normalization -- Trims whitespace, standardizes casing, removes extra spaces, and handles punctuation and unicode cleanup.
  6. Unit Conversion -- Converts measurement units (e.g., pounds to kilograms, inches to centimeters) using a reference table.
  7. Missing Data Handling -- Fills missing values using strategies such as constant, mean, median, mode, forward fill, or backward fill. Supports grouped fills (e.g., median by customer segment).
  8. Outlier Detection -- Detects outliers using IQR, Z-Score, or Modified Z-Score methods. Can handle outliers by nullifying, capping, replacing with statistics, or removing rows.
  9. Type Casting -- Converts column data types (INTEGER, NUMERIC, TEXT, BOOLEAN, DATE, TIMESTAMP, TIMESTAMPTZ) with configurable error handling (skip / null / strict).
  10. Custom SQL -- Executes any SQL statement for operations not covered by the specialized tools, such as creating views, computed columns, or complex joins.
  11. Deduplication (via specialized sub-agent) -- Finds and resolves duplicate records. Supports exact matching, fuzzy matching, cross-table matching, and transaction-row deduplication. Creates golden-key mappings on a master table and propagates them to all referencing transaction tables.
  12. Referential Integrity -- Surfaces primary-key duplicates/nulls and foreign-key orphan rows, presents realistic repair options (null the reference, delete orphans, insert stub parents, etc.), and applies the chosen fix on clean_* copies only after the user confirms. Reuses the Data Profiler's findings when a recent profiler run exists; otherwise inspects the schema directly.
  13. Report Generation -- Produces a markdown or HTML report summarizing analysis or cleanup work, dispatched to a dedicated report-writer sub-agent that chooses the right format based on the user's framing.
  14. Value Replacement -- Replaces specific values in a column (e.g., 'Cancelled' -> 'Canceled', 'N/A' -> NULL) for one or many columns in a batch. Useful for collapsing inconsistent boolean encodings (Y/N vs. Yes/No vs. 1/0) and bulk sentinel-to-NULL conversion.

Tools:

How To Use It

The agent can be accessed on the next generation Optilogic platform by chatting with Ada and through the Run AI Agent task in DataStar. Both ways will be explained, via Ada first, then the DataStar workflow, followed by an overview of the main differences between the 2 methods.

Using the Data Cleansing Agent through Chat with Ada

It is recommended to be somewhat familiar with Ada before diving into this content. Please see the Getting Started with Ada & Agentic AI article, and in particular its How to Use Ada section.

Once logged into the next-generation Optilogic platform at https://ai.optilogic.app, you can start chatting with Ada leveraging the Data Cleansing Agent right away from the central part of the Home page.

  1. If unsure how to start, or if you are just exploring, you can use one of the example prompts at the top to get going.
  2. Underneath the prompt textbox, we can (from left to right):
    1. Select the interaction Style to use. These behave differently in how verbose the agent's answers will be and how often user confirmation/feedback will be sought before proceeding.
    2. Connect to Cosmic Frog models, DataStar projects, and Postgres databases. The prompt can refer to these; the agent will create clean_* copies of any tables it modifies.
    3. Choose the Data Cleanser Agent from the agent selector.
  3. Type your question/task into the prompt textbox.
  4. Click on the submit button at the right bottom of hte prompt box.

Regarding how to write good prompts, please note that the general Best Practices, Tips & Tricks, and Current Limitations and Known Behaviors included in the Getting Started with Ada documentation also apply to the Data Cleansing Agent.

After submitting a prompt, the Data Cleansing Agent will start processing and formulating a response:

  1. The full prompt is shown at the top of the conversation.
  2. Underneath the prompt, a status line shows what Ada is doing currently. To see each status update, expand this section by clicking on the caret icon to the left. Clicking on the Status Updates, Tool Calls, or Knowledge tabs shows progressively more detail.
  3. If you want to stop Ada while she is still working on a response (say you realized your prompt is incorrect/incomplete or you have not connected the right database(s)), you can click the Stop button at the right bottom in the prompt box.

The Data Cleansing Agent may ask for feedback before proceeding — for example, when:

  • A date column contains ambiguous values like '01/09/2024' (Jan 9 vs. Sep 1) and the agent wants to confirm DMY vs MDY interpretation.
  • Sentinel values are detected and the agent wants confirmation before nulling them.
  • Duplicate records are found and the agent wants confirmation on the merge strategy.
  • Foreign-key orphans are found and the agent wants to confirm the repair option (null the reference, delete orphans, etc.).
  1. First, Ada explains why she needs some feedback and/or confirmation to continue.
  2. Next, some things Ada noticed while working on the prompt are noted in the lighter texbox. It gives users additional information to help make their decisions on how to proceed.
  3. The feedback required to continue is then listed. Often this takes the form of multiple options: the user can just click on the option they want to go with.
  4. In case none of the options suggested by Ada are suitable, the user can type how they want to proceed for that item in the textbox.
  5. Once all feedback items have been addressed, click on the Send Response button for Ada to continue.
  6. Click on the Stop Agent option in case continuing does not make sense.

When Ada finishes, the final response is presented:

  1. The original prompt is listed at the top of the conversation.
  2. The status line now shows how long the response took. To see full details, expand it.
  3. Any clarifying questions Ada asked along the way and the answers given are shown — click the caret to expand them.
  4. This is the final response where Ada summarizes what she did — which tables were modified, how many rows were affected, and where the cleaned data lives.
  5. The user can continue the conversation by typing any follow-up questions/tasks and submitting them. It is recommended to start a new conversation if switching to an unrelated task, a different agent, different connected databases, or setting a different interaction style.

For completeness, the cleaned data shows up in the connected database as clean_* table copies — for example, clean_customers, clean_orders — with the originals preserved untouched for comparison or rollback:

Using the Data Cleansing Agent within DataStar

In DataStar, the Data Cleansing Agent is accessed by using a Run AI Agent task, see also the screenshots below. The key inputs are:

  • Database -- Select the database to perform data cleansing on. All database types are supported (Cosmic Frog, DataStar, or Postgres).
  • Task Description -- Describe what you want the agent to do. This is a free-text field (up to 10,000 characters) where you write your request in natural language. Be as specific as possible about tables, columns, and desired outcomes.

The Task Description field includes placeholder examples to help you get started:

  • List all tables and analyze their data quality
  • Standardize date formats in the orders table to ISO
  • Find and fix postal code issues in the customers table
  • Normalize country codes to ISO alpha-2 format

Optionally, users can:

  • Enable Verbose Output (toggle to "Detailed") to see the full list of available tools and detailed execution information. Default is "Concise".
  • Point the Agent to a Knowledge Folder. Any .md and .txt files placed here will be read by the agent and used as additional context to perform the task. Think of column descriptions, known data conversions, etc.
A DataStar macro with a Run AI Agent task added to it
In the Select Utility part of the Configuration tab of a Run AI Agent task, users can select which Agent they want to use. Here we use the Data Cleansing Agent.
After selecting the Agent, it needs to be configured in the Configure Utility part of the Configuration tab. Select the database to run the Agent on, and enter a task description. Optionally toggle the Verbose Output to Detailed and/or set a Knowledge Folder.

Not shown in the screenshots above, there is also a Run Configuration section, where users can add Tags to facilitate finding job runs, set a Timeout for the task, and set the Resource Size to use. Note that for most Run AI Agent tasks, the Resource Size will need to be set to XS or higher.

Suggested workflow:

  1. Start with a profiling prompt to understand your data quality issues (e.g., "List all tables and analyze their data quality").
  2. Review the agent's findings to understand what needs to be fixed. This information can be found in the Task Log, see the section "Reading the Task Log" below for more details.
  3. Run a transformation prompt targeting the specific issues found (e.g., "Standardize date formats in the orders table to ISO and fix postal code issues in the customers table").
  4. Review the summary the agent provides, which is the Agent Response section in the Task Log -- it reports what was changed, how many rows were affected, and any remaining issues.
  5. Query the clean_* tables in the database to verify the results. The original tables remain untouched.

After the run, the agent produces a structured summary of everything it did, including metrics on rows affected, issues found, and issues fixed; see the next section where this Job Log is described in more detail. The cleaned data is persisted as clean_* tables in the database (e.g., clean_customers, clean_shipments).

Data Cleansing Agent Usage: Chat UI vs DataStar

There are a few differences to keep in mind when running the Data Cleansing Agent either through chatting with Ada or from within DataStar:

  1. Chatting with Ada is conversational and interactive. It supports clarification questions, mid-run feedback, follow-up prompts in the same conversation, and on-the-fly database connection changes. Best for exploration, one-off cleanup, and tasks where the cleansing strategy isn't fully decided up front.
  2. DataStar is batch and repeatable. It runs as a Run AI Agent task inside a Macro, ideal for cleanup that needs to be re-run on a schedule or as part of a larger pipeline. Mid-run feedback is not available — the prompt has to be self-contained, and any clarifications surface in the Task Log only after the task finishes.
  3. Both produce the same artifacts: clean_* table copies in the connected database, and optionally a markdown or HTML report if requested.

Recommendation: Use the chat UI to develop and refine a prompt, then transfer the working prompt into a DataStar Run AI Agent task once you want the workflow to become repeatable.

Reading the Task Log

After a run completes, the Task Log provides a detailed trace of every step the agent took. Understanding the log structure helps users verify what happened and troubleshoot if needed. The log follows a consistent structure from start to finish.

The Task Log for the Run AI Agent task is expanded and we are viewing thelog for a run that completed on 6/2. Use the copy icon at the right top to download the log to a .txt file.

Header

Every log begins with a banner showing the database name and the exact prompt that was submitted.

Connection & Setup

The agent validates the database connection and initializes itself with its full set of tools. If Verbose Output is set to "Detailed", the log also prints the system prompt and tool list at this stage.

Planning Phase

For non-trivial tasks, the agent creates a strategic execution plan before taking action. This appears as a PlanningSkill tool call, followed by an AI Response box containing a structured plan with numbered steps, an objective, approach, and skill mapping. The plan gives users visibility into the agent's intended approach before it begins working.

Tool Calls and Thinking

The bulk of the log shows the agent calling its specialized tools one at a time. Each tool call appears in a bordered box showing the tool name. Between tool calls, the agent's reasoning is shown in Thinking boxes -- explaining what it learned from the previous tool, what it plans to do next, and why. These thinking sections are among the most useful parts of the log for understanding the agent's decision-making.

The agent may call many tools in sequence depending on the complexity of the task. Profiling-only prompts typically involve discovery tools (schema, missing data, date issues, location issues, outliers). Cleanup prompts add transformation tools (ensure_clean_table, standardize_country_codes, standardize_date_column, etc.).

Occasionally a Memory Action Applied entry appears between steps -- this is the agent recording context for its own use and can be ignored.

Error Recovery

If the agent encounters a validation error on a tool call (e.g., a column stored as TEXT when a numeric type was expected, or a missing parameter), the log shows the error and the agent's automatic adjustment. The agent reasons about the failure in a Thinking block and retries with corrected parameters. Users do not need to intervene.

Agent Response

At the end of the run, the agent produces a structured summary of everything it discovered or changed. This is the most important section of the log for understanding outcomes:

For profiling prompts, this section reports what was found across all tables -- schema details, missing data percentages, date format inconsistencies, location quality issues, numeric anomalies, and recommendations for next steps. For cleanup prompts, it reports which tables were modified, what transformations were applied, how many rows were affected, and confirmation that originals are preserved.

Execution Summary

The log ends with runtime statistics and the full list of skills that were available to the agent:

Input Requirements

What the agent expects in your database:

The agent works with any tables in the selected database. There are no fixed column name requirements -- the agent discovers the schema automatically. However, for best results:

  • Tables should be loaded in the database before running the agent.
  • If you need unit conversion, include a uom (units of measure) reference table with columns for unit name, symbol, type, and conversion ratio.
  • Postal code tools work best when postal code columns are TEXT type. The agent will cast INT/BIGINT columns to TEXT automatically if needed.

Output Description

Tips & Notes

  • Start with profiling. Before asking the agent to fix anything, run a profiling prompt like "List all tables and analyze their data quality" to understand what's there. The agent will not auto-fix issues unless you explicitly ask it to.
  • Be specific. "Fix postal codes" is good; "Fix postal codes in the customers and shipments tables -- repair Excel corruption, remove placeholders, and clean formatting" is better. The more specific the prompt, the more precise the results.
  • Original data is safe. The agent always creates clean_* copies before making changes. Your source tables are never modified. You can re-run the agent as many times as needed.
  • Order matters for postal codes. The agent handles this automatically, but for reference: Excel corruption must be repaired before formatting cleanup, and formatting must be cleaned before placeholder removal.
  • Grouped fills are supported. For missing data, you can ask for strategies like "fill missing credit_limit using the median grouped by customer_segment" and the agent will handle the grouped calculation.
  • Custom SQL is available. If you need something the specialized tools don't cover -- like creating views, adding computed columns, or complex joins -- the agent can execute arbitrary SQL as a fallback.
  • Runtime varies based on the number of tables, columns, and the complexity of the prompt. Simple profiling tasks complete quickly; full database-wide cleanup across many tables will be longer. Expect at least a few minutes for multi-table operations.
  • Just like many other DataStar tasks, it is possible to run multiple tasks in parallel with the Data Cleansing Agent.
  • Additional info on the run can be found in Run Manager > Job Log after the run finishes. This includes steps that the agent takes, tools it calls, as well as a summary of work.
  • The Run Utility task also offers the ability for users to set Run Configuration (optional): Tags for easy filtering in Run Manager, Timeout for maximum run duration, and Resource Size for different memory/CPU allocations.
  • Duplicates are handled by a dedicated sub-agent. Any prompt that mentions finding, detecting, or merging duplicate records ("find duplicate customers", "deduplicate the suppliers table") is automatically routed to the Deduplication sub-agent. You don't need to invoke it explicitly -- the parent agent decides.
  • Use the Data Profiler pipeline for deep analysis. To run a full database profile, ask for it by name (e.g., "run the data profiler pipeline"). It's a separate, longer-running workflow that persists its findings; a follow-up cleanup prompt like "fix the CRITICAL issues the Data Profiler flagged" reuses those without re-profiling. See the Data Profiler AI Agent help article for what it captures.
  • Append "and write a report" to get a deliverable. Adding phrases like "write a report", "save a report", or "create a report" to any analysis or cleanup prompt produces a markdown or HTML report file alongside the database changes.
  • Comprehensive cleanup is a full sweep. A prompt like "clean the whole database" or "audit the database" triggers an eight-category scan covering schema and freshness, missing values and sentinels, date issues, string and identifier formatting, type mismatches, cross-column logical inconsistencies, referential integrity, and statistical outliers. The agent profiles every category first, then applies fixes -- so partial-completion exits don't happen.

Examples

Example 1: Exploration (Simple)

A user wants to understand what data is in their database before deciding what to clean.

Database: Supply Chain Dataset

Task Description: List all tables in the database and show their schemas

What happens: The agent calls get_database_schema for all tables and exits with a structured report.

Output:

Requested: List all tables and show schemas.

Discovered (schema 'starburst'):

  • customers: 25 rows, 11 columns (customer_id, customer_name, email, country, city, ...)
  • inventory: 40 rows, 9 columns
  • orders: 35 rows, 9 columns
  • products: 30 rows, 11 columns
  • shipments: 37 rows, 13 columns
  • suppliers: 21 rows, 10 columns
  • warehouses: 25 rows, 10 columns
  • uom: 34 rows, 5 columns

...

Total: 12 tables, 405 rows, 112 columns

Example 2: Comprehensive Cleanup (Complex)

A user needs to clean up customer location data before using it in a Cosmic Frog network optimization model.

Database: Supply Chain Dataset

Task Description: Clean the customers table completely: standardize dates to ISO, fix postal codes (Excel corruption + placeholders), standardize country codes to alpha-2, clean city names, and normalize emails to lowercase

What the agent does:

  1. Discovers the customers table schema (11 columns, 30 rows).
  2. Profiles the data and finds:
    • 6 different date formats in registration_date
    • 11 different country representations (USA, US, United States, usa, etc.)
    • 4 Excel-corrupted postal codes in scientific notation (e.g., 9.0021E+04)
    • 5 placeholder postal codes (N/A, XXXXX, ?????, -----)
    • Inconsistent city casing and whitespace (NEW YORK, los angeles, sydney )
    • Mixed-case emails (GLOBAL@EXAMPLE.COM)
  1. Creates clean_customers as a safe working copy.
  2. Applies fixes in the correct order:
    • Standardizes all dates to ISO format (YYYY-MM-DD)
    • Repairs corrupted postal codes (9.0021E+04 becomes 90021)
    • Removes placeholder postal codes (set to NULL)
    • Standardizes countries to ISO alpha-2 (USA becomes US, Germany becomes DE)
    • Cleans city names (los angeles becomes Los Angeles, sydney becomes Sydney)
    • Normalizes emails to lowercase
  1. Verifies all transformations were applied correctly.

Output:

Completed data cleansing of clean_customers table:

  • Standardized 30 date values in registration_date to ISO format (YYYY-MM-DD)
  • Fixed 4 Excel-corrupted postal codes (scientific notation)
  • Removed 5 placeholder postal codes (set to NULL)
  • Standardized 30 country codes to ISO alpha-2 format
  • Cleaned 30 city names (casing, whitespace)
  • Normalized 30 email addresses to lowercase

All changes applied to clean_customers (original customers table preserved).

The cleaned data is available in the clean_customers table in the database. The original customers table remains untouched.

Example 3: Enterprise Multi-Table Cleanup (End-to-End)

A user with a 14-table enterprise supply chain database needs to clean and standardize all data before building Cosmic Frog models for network optimization and simulation.

Database: Enterprise Supply Chain

Task Description: Perform a complete data cleanup across all tables: standardize all dates to ISO, standardize all country codes to alpha-2, clean all city names, fix all postal codes, and normalize all email addresses to lowercase. Work systematically through each table.

What the agent does: The agent works systematically through all tables -- standardizing dates across 12+ tables, fixing country codes, cleaning city names, repairing postal codes, normalizing emails and status fields, detecting and handling negative values, converting mixed units to metric, validating calculated fields like order totals, and reporting any remaining referential integrity issues. This is the most comprehensive operation the agent can perform.

Output: A detailed summary covering every table touched, every transformation applied, and a final quality scorecard showing the before/after improvement.

Example 4: Deduplication (Sub-agent Delegated)

A user has multiple records for the same customer in the customers table and wants golden keys created and propagated to the orders table.

Database: Supply Chain Dataset

Task Description: Find duplicate customer records in the customers table, create golden key mappings, and propagate them to the orders table.

What the agent does: Delegates the task to the Deduplication sub-agent, which detects exact and fuzzy duplicate groups, picks a canonical record for each group, and updates the orders table so every order points to the canonical customer.

Output: A summary listing how many duplicate groups were detected, how many golden keys were created, and how many rows in orders were updated to point to the canonical master record.

Example Prompts

Below are example prompts users can try, organized by category.

Exploration & Profiling

  1. "List all tables in the database and show their schemas"
  2. "Check for date format issues across all tables that have date columns"
  3. "Detect all postal code data quality issues in customers, shipments, suppliers, and warehouses tables"
  4. "Detect outliers in product prices, inventory quantities, and order amounts. Use IQR method."

Simple Fixes

  1. "Standardize all dates in the orders table to ISO format"
  2. "Standardize all country codes in the customers table to ISO alpha-2 format (US, DE, etc)"
  3. "Fill missing processor_fee values in transactions with 0.0"
  4. "Standardize all email addresses to lowercase across employees, customers, suppliers. Also trim whitespace."

Complex Multi-Step Tasks

  1. "Clean all postal codes in the customers table: fix Excel corruption, remove placeholders, clean formatting. Follow the proper postal code cleaning workflow."
  2. "Standardize ALL date columns across orders, transactions, inventory, products, and customers to ISO format"
  3. "Convert all product weights to kilograms and all lengths to centimeters. Some products already have metric, some have imperial, some have both - handle this correctly."
  4. "Fill missing credit_limit values in customers using the median credit_limit grouped by customer_segment"
  5. "Find cross-column inconsistencies in orders: delivery dates before ship dates, totals that don't equal price * quantity, and rows marked status=shipped but missing ship_date -- report only, don't fix"

End-to-End Scenarios

  1. "Clean the customers table completely: standardize dates to ISO, fix postal codes (Excel corruption + placeholders), standardize country codes to alpha-2, clean city names, and normalize emails to lowercase"
  2. "Perform a complete data cleanup across all tables: standardize all dates to ISO, standardize all country codes to alpha-2, clean all city names, fix all postal codes, and normalize all email addresses to lowercase. Work systematically through each table."
  3. "Prepare the entire database for data warehouse ETL: clean all tables, ensure referential integrity, standardize all formats, validate all calculations. Generate final quality report."

Custom SQL

  1. "Create a view called customer_summary that shows customer_id, total number of orders, and total order amount from the orders table"
  2. "Update the shipments table to add a new column called 'is_international' (boolean) that is TRUE when ship_from_country differs from ship_to_country"

Deduplication & Referential Integrity

  1. "Find duplicate customer records in the customers table and report them, but don’t change anything"
  2. "Deduplicate the suppliers table end-to-end: detect duplicates, create golden keys, and propagate them to all referencing transaction tables"
  3. "Validate referential integrity across the database -- show orphan records and recommend fixes, but wait for me to approve before changing anything"

Data Profiler Pipeline

  1. "Run the data profiler pipeline on the full database"
  2. "Run the data profiler on only the customers and orders tables"
  3. "Based on the latest data profiler run, fix the CRITICAL alerts in the clean_* tables"

Reports

  1. "Profile every table for data quality issues and write a markdown report summarizing what you found"
  1. "Clean the customers table and produce an HTML report of every transformation you applied"

Other Helpful Resources

Questions or feedback? Please contact the Optilogic Support team on support@optilogic.com.

Have More Questions?

Contact Support

Get in touch

Contact Sales

Get in touch

Visit Frogger Pond Community

Visit our Community