Data Profiler AI Agent

Documentation

Overview

The Data Profiler AI Agent is one of Ada's AI-powered assistants, focused on assessing data. It automatically analyzes the quality, structure, and relationships of data stored in an Optilogic database. By profiling every table and column, the agent creates a comprehensive data-quality catalog that helps users understand their data, identify issues, discover relationships, and prioritize cleansing efforts.

The agent can be accessed by chatting with Ada in the next generation Optilogic platform and via Run AI Agent tasks in DataStar.

Why Use the Data Profiler AI Agent?

Understanding the quality and meaning of data is often one of the most time-consuming steps in any analytics, modeling, or optimization project. The Data Profiler AI Agent automates this process by:

Generating detailed statistical profiles for tables and columns
Creating business-friendly descriptions of tables and fields
Detecting data-quality issues and anomalies
Identifying potential primary and foreign key relationships
Recommending appropriate data types
Assigning quality scores to help prioritize remediation efforts
Maintaining historical profiling results for trend analysis

The result is a queryable inventory of your data assets, complete with quality assessments and relationship insights.

What the Agent Does

The Data Profiler AI Agent performs several layers of analysis.

Statistical Profiling

For every table and column, the agent calculates statistical characteristics such as:

Row counts
Column counts
Missing value percentages
Distinct value counts
Percentiles
Skewness
Zero-value frequency
Interquartile range (IQR) outlier detection
Pearson correlations between column pairs

For large datasets, the agent uses deterministic sampling to ensure consistent results across profiling runs.

Table and Column Documentation

Using LLM-assisted analysis, the agent generates:

Executive summaries for tables
Descriptions for each column
Classification of tables as either:
- Master Data
- Transactional Data
Semantic classifications for columns, such as:
- Email
- Postal code
- Currency
- Identifier
- Date
- Phone number
- Geographic coordinate

These descriptions help users quickly understand the purpose and meaning of data assets.

Cast-Type Recommendations

Based on semantic classifications, the agent recommends appropriate database data types. Examples include:

currency → NUMERIC
identifier → TEXT to preserve leading zeros
date → DATE or TIMESTAMP

These recommendations help improve data consistency and prevent issues such as loss of leading zeros in identifiers.

Semantic Data Validation

After semantic types are identified, the agent performs specialized validation checks against actual data values. Examples include:

Malformed email addresses
Deprecated currency codes
Mixed phone number formats
Ambiguous date formats
Out-of-range geographic coordinates
Placeholder/sentinel values
Hidden control characters
Excessive whitespace
Padded categories

The agent performs dozens of validation checks tailored to the detected semantic type.

Alert Generation

Each detected issue is stored as a single row structured alert. Each alert contains:

Check name
Category
Severity (CRITICAL / MEDIUM / LOW)
Affected table and column
Human-readable explanation

Notable specialized checks include:

Null-sentinel strings in text columns - values such as "N/A", "-", "None", etc. that masquerade as data
Fixed-width truncation patterns - column values silently cust at 16, 32, 255 characters
Shadow columns - near-perfect correlation suggesting a duplicate/linear transform
Row-count drift - significant row-count changes between profiling runs

Primary and Foreign Key Discovery

The agent can identify relationships even when keys are not formally defined in the schema.

The discovery process includes

‍Heuristics on column names and uniqueness stats propose candidates.
SQL probes validate uniqueness, null rates, and FK match rates against actual rows.
LLM interpretation resolves ambiguous candidates; an optional pass suggests relationships heuristics missed.‍
Composite primary keys (2–4 columns) are tested and the minimal unique set is returned.

Data Quality Scoring

The Data Profiler AI Agent assigns scores ranging from 0.0 to 1.0 across three dimensions:

Data reliability — values present, valid, in range. ‍
Integrity and completeness — structural soundness, duplicates, Primary Key/Foreign Key health. ‍
Cross-consistency — uniformity of format and encoding.

The overall score is a weighted average, which is capped if data integrity drops too low. Tables without data receive a baseline minimum score, while tables that generate errors display an error stub so users are always aware of the issue.

Data Reliability

Measures whether values are:

Present
Valid
Within expected ranges

Cross-Table Consistency Analysis

When the same column name appears in multiple tables with different semantic tags, a majority vote picks one and corrects the outliers.

Inputs and Outputs

Inputs

The only required input is to point the agent to a database. There are several optional inputs which we will cover in the Using the Data Profiler Agent section below.

Outputs

The output consists of tables written to the database that was profiled:

In addition to database outputs, the agent generates a timestamped execution log, which includes table processing times, alerts, and primary key/foreign key findings. Reviewing the log can help diagnose profiling issues and understand execution performance.

Using the Data Profiler Agent

There are two ways to access the Data Profiler Agent:

By using Run AI Agent tasks within DataStar
Through chatting with Ada in the next generation Optilogic platform

Both ways will be explained, through DataStar first, then using the chat UI.

Using the Data Profiler Agent within DataStar

Accessing the Data Profiler Agent through DataStar is done via a Run AI Agent task:

In the Select Utility section of the task's Configuration tab, choose Data Profiler

In the Configure Utility section of the Configuration tab (from top to bottom):

Database (required) - choose the database, a Cosmic Frog model, DataStar project or Postgres database, to be profiled.
Table Selection (optional) - by default All Tables in the database will be profiled. Switch to Specific Tables if profiling a subset, which will require listing the tables to be profiled inthe Tables to Profile field that will come up.
Max Rows to Sample (optional) - by default a maximum of 100,000 rows per table will be profiled. Set to a different number if required. Setting to 0 means profile all rows, which may be time-consuming for large tables.
Profile Mode (optional) - by default the full pipeline performing all the profiling steps as described in the What the Agent Does section above will be run. For a faster result switch to "Statistics + Alerts + Scoring" which omits the descriptions, semantic checks, and primary key/foreign discovery.
Max Table Workers (optional) - by default up to 16 tables can be processed concurrently. Lower this for very large databases.
Cache Descriptions (optional) - by default descriptions and semantic type are generated fresh ("Refresh") when running the data profiler. Switch to "Reuse" to use the descriptions and semantic types from the previous Data Profiler run if the table's columns are unchanged.

Note that it is recommended to change the Resource Size from 3XS to XS in the Run Configuration section, since 3XS is usually not sufficient to run the Data Profiler Agent:

While the task is running and after it has completed, the Task Logs tab contains the log file where the user can monitor progress and review key alerts and high-level output summaries:

As an example output, let us have a look at the _dq_table_profiles table:

One of the resulting tables of a Data Profiler Agent run. Not all columns are shown here.

In this table, the entire executive summary for the customer_returns table is as follows: "The customer_returns table records return events tied to individual sales order lines, supporting analysis of return volumes, reasons, and financial impact. Each record links a return reference return_id to an order_line_id, with return_date providing the time dimension for trend reporting. Operational metrics include return_qty and restock_flag, while refund_amount captures the customer reimbursement value but is stored as text and includes invalid entries. Return reasons are mostly standardized but include missing values and placeholder or corrupted categories, suggesting a need for data cleansing and validation."

Using the Data Profiler Agent within the Chat UI

It is recommended to be somewhat familiar with Ada and how to talk to her in the chat UI before diving into this content. Please see the Getting Started with Ada & Agentic AI article, and in particular its How to Use Ada section.

Once logged into the next generation Optilogic platform at https://ai.optilogic.app, you can start chatting with Ada and leveraging the Data Profiler Agent right away from the central part of the Home page. You can access it by using the Data Cleanser Agent, as this agent can call the Data Profiler Agent as a tool.

Pro tip

When the Data Profiler Agent is used by the Data Cleanser Agent, the results of the profiling are available to the Data Cleanser Agent as part of the conversation context. It can then base its next cleaning steps in a follow-up prompt on this profile.

‍

After selecting the database to profile (here a DataStar project named Dairy End-to-End Ada), set the agent to Data Cleanser, write your prompt indicating you want to profile the data (or a subset of it) contained in the connected database. It is recommended to mention the Data Profiler pipeline and wanting to write the results into the database itself:

This prompt results in running the full Data Profiler Agent's pipeline and the __pq_ tables can be found in the sandbox of the connected DataStar project.

Use DataStar or Chat UI?

The following table summarizes the most common use cases for the 2 ways of accessing the Data Profiler Agent:

Data Profiler or Data Cleanser Agent

To help you decide if you should use the Data Profiler or Data Cleanser Agent for your task, here is a quick overview of both:

Summary

You point the Data Profiler AI Agent at a database, walk away, and come back a few minutes later to a queryable catalogue of every table — what each column means, what type it should be, where the data is broken, how the tables relate, and a single quality score per table to triage what needs cleaning first.

Other Helpful Resources

Getting Started with Ada & Agentic AI‍
An overview of Optilogic's AI Agents and Utilities ‍
AI Agents: Architecture and Components‍
The Modeler AI Agent
The Model Output Insights AI Agent‍
The Data Cleansing AI Agent
Learn more about Cosmic Frog‍
Learn more about DataStar

Questions or feedback? Please contact the Optilogic Support team on support@optilogic.com.

‍

Documentation

Getting Started with Optilogic

Account Creation

Product Overview

Tutorial Model

Navigating Ada & Agentic AI

Navigating DataStar

Getting Started

Quick Start Guides

AI Agents & Utilities

Data Connections

Scripting

Tasks

Navigating Cosmic Frog

Analytics

Anura Data Schema

Dart - Risk

Dendro - Simulation-Optimization

Hopper - Transportation Optimization

Triad - Intelligent Greenfield

Navigating Atlas

Navigating the Supply Chain Design Kit (SDK)

Connecting to External Data

API Tools

Data Tools

Overview

The agent can be accessed by chatting with Ada in the next generation Optilogic platform and via Run AI Agent tasks in DataStar.

Why Use the Data Profiler AI Agent?

Understanding the quality and meaning of data is often one of the most time-consuming steps in any analytics, modeling, or optimization project. The Data Profiler AI Agent automates this process by:

Generating detailed statistical profiles for tables and columns
Creating business-friendly descriptions of tables and fields
Detecting data-quality issues and anomalies
Identifying potential primary and foreign key relationships
Recommending appropriate data types
Assigning quality scores to help prioritize remediation efforts
Maintaining historical profiling results for trend analysis

The result is a queryable inventory of your data assets, complete with quality assessments and relationship insights.