Advanced Data Enrichment and Transformation using the Run Python Task

What Is It

The Run Python task allows you to execute Python scripts within a DataStar macro. You select a script, define its inputs (arguments), and run it as part of your workflow. This is ideal when built-in tasks are insufficient or when you want to reuse existing Python logic.

Use Cases

The Run Python task is especially useful when:

  • Bulk importing or exporting data into or out of a project sandbox.
  • Reusing existing Python scripts for data transformation instead of rebuilding logic in DataStar.
  • The user prefers Python over visual workflow construction.
  • Enriching data using external APIs.
  • Integrating advanced libraries (e.g., pandas, NumPy) for efficient data manipulation.

Configuration

This walkthrough uses the end state of the DataStar Quick Start: Creating a Task using Natural Language guide as a starting point. At that point, a raw_shipments table has been imported and a Run SQL task has produced a customers table with unique customers. We will use a Run Python task to transform customer names from the format CZ1, CZ10, CZ100 to Cust_0001, Cust_0010, Cust_0100 - ensuring alphabetical sort matches customer number order and aligning the prefix with other data sources. The transformation steps are:

  1. Remove the "CZ" prefix.
  2. Left pad the number with leading zeroes to 4 digits.
  3. Add the "Cust_" prefix.

The screenshot below shows the "Change customer names" Run Python task added to the macro:

Code File

Once a Python task is added to the macro canvas, its configuration tab opens on the right:

  1. Enter the task name here; in this example it is "Change customer names".
  2. In the Code File section, choose File to select an existing .py file, or Template to use a pre-built script (covered in the Templates section).
  3. Drag and drop a .py file from your local computer here to upload it to My Files/DataStar in your Optilogic account.
  4. The list of .py files in your Optilogic account is shown here. The selected file (update_customer_names.py) is pinned to the top with a darker background so no scrolling is needed to see which file is selected.
  5. The currently selected file is greyed out in the list (already selected).
  6. Hover over a truncated column value to see its full text in a tooltip.
  7. The file table is customizable:
    1. Drag column headers to reorder; click a header to sort (click again to reverse, third click removes sort; use Ctrl + Shift for multi-column sort).
    2. Click on the three-dot icon on a column header for a context menu with sort, pin, resize, and column visibility options.
    3. Drag the column divider to resize a column (the cursor changes to 2 arrows pointing away from each other).
  8. Use the Search box to filter files by name.
  9. After selecting a file, click on Use File to proceed to the next configuration step.
  10. Click Create File to start a new script from scratch (see the Create File section).

After clicking Use File with update_customer_names.py selected, the configuration updates as follows:

  1. The selected file name is shown in the File Name field.
  2. The file's path in your Optilogic account is shown in Directory Path.
  3. Click Replace File to choose a different Python script.
  4. Click Edit File to open the script in the Lightning Editor, another application on the Optilogic platform.
  5. The Configure Python Arguments section allows an AI agent to auto-detect the script's arguments. The script needs to use the argparse module, see next section, for auto-detection to be possible. Note: auto-detection may occasionally be incorrect, and running it overwrites any previously configured arguments in the Command Arguments input of the Run Configuration section, see below.
  6. Click Detect arguments to run the auto-detection.

Before configuring arguments, we will cover the update_customer_names.py script. We can review its arguments so we can verify auto-detection is correct and look at the script body to understand what it does. You can copy the full script text from the Appendix.

  1. Lines 1-2 import the argparse module (for parsing arguments) and the DataStar Python library (which also makes pandas available). See the Using the DataStar Python Library Help Center article for more details.
  2. Line 4 begins the main() function - the conventional entry point for a script's logic.
  3. Lines 5-12 define and call the argument parser:
    1. The description keyword briefly describes the script's purpose.
    2. Lines 8-10 define three required arguments (--project, --original_table, --column_name). Each has a name (double-dash prefix), a required flag, and a help string which will be shown as a tooltip in the DataStar UI when auto-detecting arguments.
    3. Line 12 parses the entered arguments; they are subsequently accessible as args.project, args.original_table, and args.column_name.
    4. Troubleshooting tip: add a print statement after line 12 to log parsed argument values to the Job Log (visible in the Run Manager application when a task is run). For example the statement print(f"project argument: {args.project}") will print the text "project argument: " followed by the value of args.project.
  4. Lines 14-21 perform the data transformation using the DataStar Python library and pandas (which can be accessed after importing the DataStar library):
    1. Lines 14-17 connect to the project (--project), access its sandbox, read the --original_table into a dataframe (df_table), and copy it to df_new_table.
    2. Lines 18-20 transform the --column_name values: strip the "CZ" prefix, left-pad the number to 4 digits, and prepend with "Cust_".
    3. Line 21 writes df_new_table to the sandbox as a new table named new_<original_table_name>.
  5. Line 23 prints a confirmation message to the Job Log once the script completes successfully.
  6. Lines 25-26 call main() when the script is executed.

Configure Python Arguments and Run Configuration

With the script understood, let us use Detect arguments and configure the task:

  1. Three arguments are correctly detected. Their display names are derived from the key arguments of the argparse add_argument method.
    1. DataStar Project Name: type the name of the DataStar project to use.
    2. Original Table Name: type the name of the table containing the column with customer names that are to be updated. Hovering over the blue question mark shows the tooltip from the script's help key argument (see script screenshot above, line 9).
    3. Customer Name Column: type the name of the column containing the customer name values to be updated.
  2. If the script changes, click the refresh button to re-run auto-detection of the script's arguments.
  3. The Run Configuration section contains:
    1. Command Arguments: if auto-detection was not used, enter arguments manually here. When auto-detection is used, this field is read-only and shows the generated argument string.
    2. Tags: optionally, enter tags, which will be shown in the Run Manager jobs list and can help identify the job(s) of interest quickly.
    3. Timeout: the task is stopped if not completed within this many seconds. Default is 86,400 (24 hours).
    4. Resource Size: select the CPU/RAM size for the run. Larger resources have higher billing factors - see the Resource Size Selection Guide for more information.

Notes

The Notes section is the final configuration area. It is especially valuable for complex tasks or collaborative projects, enabling users to quickly understand what the task does. Formatting options are available above the text box:

Running the Python Task

Before running the task, we examine the customers table in the sandbox:

  1. Click the customers table in the sandbox to open it.
  2. The customername column uses a CZ prefix and no leading zeros for the numbers, so values sort alphabetically by string rather than by customer number.

Now run the "Change customer names" task (hover over it on the macro canvas and click the play button) and examine the results:

  1. A new table named new_customers has been created in the project sandbox. Click it to open.
  2. Customer names now follow the Cust_XXXX format. Sorting by name now orders them by customer number.

The Run Manager logs are useful for monitoring progress and troubleshooting:

  1. Click the Run Manager icon in the left-hand application bar (scroll down or click the three-dot icon if not visible).
  2. Find the Python task run in the job list - its Job Name matches the Python file name. Click it to see available logs on the right.
  3. Click the report icon to open the Job Log.
  4. Print statements in the script appear here. Use them throughout your script to track progress of long-running tasks.
  5. If a job fails, click the hazard icon to open the Error Log which may contain helpful diagnostic information.

Create File

Instead of using an existing file, users can click the Create File button in the Code File area (see bullet 10 underneath the first screenshot of the Code File section) to create a new script directly on the Optilogic platform:

  1. A new file named datastar-<timestamp>.py is created in My Files/DataStar in your Optilogic account.
  2. Clicking the file opens it in the Lightning Editor application.
  3. The file initially contains only print("Hello Datastar!"). Edit it to build your own script, and use for example the text of the script covered above, which can be copied from the Appendix, as a starting point.

Templates

Templates are pre-built scripts available to all users from the Resource Library, covering common import/export patterns. To browse them, open the Resource Library application, filter for DataStar resources (button at the right top) and the Script tag. Clicking a resource also shows available documentation, which can be copied to your Optilogic account or downloaded.

  1. Click on the Template tab in the Code File section.
  2. Browse and select a template from the list. The same search and grid customization features apply here as in the File tab, see the Code File section above.
  3. Click on the Use Template button to proceed:
  1. A default file name based on the script name and current timestamp is used for the copy saved to your account.
  2. The file is placed in /My Files/DataStar.
  3. Click Replace File to choose a different script or Edit File to open the current file in the Lightning Editor.
  4. If the selected Template does not use argparse, as is the case here, auto-detection of arguments is unavailable. Arguments (if any) must be entered manually in the Command Arguments field of the Run Configuration area. For this script, user-configurable values are set directly in the script's "User Configuration" section, and nothing needs to be entered in the Command Arguments field:
  1. Open the Explorer application and locate the script under My Files/DataStar. Click it to open in the Lightning Editor.
  2. The imports section shows which libraries the script uses. All these are included when running a Python file from your Optilogic account, so no requirements.txt is needed in this case.
  3. User-configurable values are set directly in the "User Configuration" section of the script. Comments (green text) describe the expected inputs, and more information may be available in the documentation of the script on the Resource Library.

Requirements.txt File

The Python base image used by the Run Python task contains the Python libraries most used in conjunction with DataStar. If your script uses a library that is not included in this base image, you need to create a requirements.txt file and place it in the same location as your Python script. In this file, you need to list the names of the libraries (without quotes), 1 library name per line.

If your script uses a library that is not part of the base image, the Job Error Log of the Python task run will contain the following error: "ModuleNotFoundError: No module named '<module name>'".

Tips & Tricks

Keep the following in mind when developing scripts for Run Python tasks:

  • Consider whether to hard-code values or pass them as arguments. Hard-coded values make a script simpler but less reusable. Using arguments in our example script allows it to work across projects, tables, and columns. Think for example of needing to update the customer names in all raw data, including a "shipments" table where the "destination" field contains customer names.
  • Use print statements throughout your script for troubleshooting. They appear in the Job Log and, possibly in combination with the Job Error log, help pinpoint where a script fails or stalls.
  • Build and test incrementally - run the script after each sub-task is complete to catch issues early.
  • Test scripts on copies of your data to protect the originals.
  • When using uncommon Python libraries in your script, place a correctly populated requirements.txt in the same folder as the script. If this is not done, the Job Error Log will contain the following error: "ModuleNotFoundError: No module named '<module name>'".

Helpful Resources

As always, our Support team is happy to help with any questions; they can be reached at support@optilogic.com.

Appendix - Script Text

Copy the script below into a .py file in your Optilogic account (via the Lightning Editor) and modify it as needed:

import argparse
from datastar import *

def main():
    parser = argparse.ArgumentParser(description="Script to update names of customers using format of CZ1, CZ10, etc. \
                                                    to Cust_0001, Cust_0010, etc; copies the entire original table into a \
                                                    new table (new_<original table name>) with just the customer names updated.")
    parser.add_argument("--project", required=True, help="Name of the DataStar project")
    parser.add_argument("--original_table", required=True, help="Original table with the customer name field to be updated")
    parser.add_argument("--column_name", required=True, help="Name of the field containing the name of the customer")
    
    args = parser.parse_args()

    project = Project.connect_to(args.project)
    sandbox = project.get_sandbox()
    df_table = sandbox.read_table(args.original_table)
    df_new_table = df_table
    df_new_table[args.column_name] = df_new_table[args.column_name].map(lambda x: x.lstrip('CZ'))
    df_new_table[args.column_name] = df_new_table[args.column_name].str.zfill(4)
    df_new_table[args.column_name] = 'Cust_' + df_new_table[args.column_name]
    sandbox.write_table(df_new_table, "new_" + args.original_table)
    
    print(f"new table new_{args.original_table} created with updated customer names")

if __name__ == "__main__":
    main()

What Is It

The Run Python task allows you to execute Python scripts within a DataStar macro. You select a script, define its inputs (arguments), and run it as part of your workflow. This is ideal when built-in tasks are insufficient or when you want to reuse existing Python logic.

Use Cases

The Run Python task is especially useful when:

  • Bulk importing or exporting data into or out of a project sandbox.
  • Reusing existing Python scripts for data transformation instead of rebuilding logic in DataStar.
  • The user prefers Python over visual workflow construction.
  • Enriching data using external APIs.
  • Integrating advanced libraries (e.g., pandas, NumPy) for efficient data manipulation.

Configuration

This walkthrough uses the end state of the DataStar Quick Start: Creating a Task using Natural Language guide as a starting point. At that point, a raw_shipments table has been imported and a Run SQL task has produced a customers table with unique customers. We will use a Run Python task to transform customer names from the format CZ1, CZ10, CZ100 to Cust_0001, Cust_0010, Cust_0100 - ensuring alphabetical sort matches customer number order and aligning the prefix with other data sources. The transformation steps are:

  1. Remove the "CZ" prefix.
  2. Left pad the number with leading zeroes to 4 digits.
  3. Add the "Cust_" prefix.

The screenshot below shows the "Change customer names" Run Python task added to the macro:

Code File

Once a Python task is added to the macro canvas, its configuration tab opens on the right:

  1. Enter the task name here; in this example it is "Change customer names".
  2. In the Code File section, choose File to select an existing .py file, or Template to use a pre-built script (covered in the Templates section).
  3. Drag and drop a .py file from your local computer here to upload it to My Files/DataStar in your Optilogic account.
  4. The list of .py files in your Optilogic account is shown here. The selected file (update_customer_names.py) is pinned to the top with a darker background so no scrolling is needed to see which file is selected.
  5. The currently selected file is greyed out in the list (already selected).
  6. Hover over a truncated column value to see its full text in a tooltip.
  7. The file table is customizable:
    1. Drag column headers to reorder; click a header to sort (click again to reverse, third click removes sort; use Ctrl + Shift for multi-column sort).
    2. Click on the three-dot icon on a column header for a context menu with sort, pin, resize, and column visibility options.
    3. Drag the column divider to resize a column (the cursor changes to 2 arrows pointing away from each other).
  8. Use the Search box to filter files by name.
  9. After selecting a file, click on Use File to proceed to the next configuration step.
  10. Click Create File to start a new script from scratch (see the Create File section).

After clicking Use File with update_customer_names.py selected, the configuration updates as follows:

  1. The selected file name is shown in the File Name field.
  2. The file's path in your Optilogic account is shown in Directory Path.
  3. Click Replace File to choose a different Python script.
  4. Click Edit File to open the script in the Lightning Editor, another application on the Optilogic platform.
  5. The Configure Python Arguments section allows an AI agent to auto-detect the script's arguments. The script needs to use the argparse module, see next section, for auto-detection to be possible. Note: auto-detection may occasionally be incorrect, and running it overwrites any previously configured arguments in the Command Arguments input of the Run Configuration section, see below.
  6. Click Detect arguments to run the auto-detection.

Before configuring arguments, we will cover the update_customer_names.py script. We can review its arguments so we can verify auto-detection is correct and look at the script body to understand what it does. You can copy the full script text from the Appendix.

  1. Lines 1-2 import the argparse module (for parsing arguments) and the DataStar Python library (which also makes pandas available). See the Using the DataStar Python Library Help Center article for more details.
  2. Line 4 begins the main() function - the conventional entry point for a script's logic.
  3. Lines 5-12 define and call the argument parser:
    1. The description keyword briefly describes the script's purpose.
    2. Lines 8-10 define three required arguments (--project, --original_table, --column_name). Each has a name (double-dash prefix), a required flag, and a help string which will be shown as a tooltip in the DataStar UI when auto-detecting arguments.
    3. Line 12 parses the entered arguments; they are subsequently accessible as args.project, args.original_table, and args.column_name.
    4. Troubleshooting tip: add a print statement after line 12 to log parsed argument values to the Job Log (visible in the Run Manager application when a task is run). For example the statement print(f"project argument: {args.project}") will print the text "project argument: " followed by the value of args.project.
  4. Lines 14-21 perform the data transformation using the DataStar Python library and pandas (which can be accessed after importing the DataStar library):
    1. Lines 14-17 connect to the project (--project), access its sandbox, read the --original_table into a dataframe (df_table), and copy it to df_new_table.
    2. Lines 18-20 transform the --column_name values: strip the "CZ" prefix, left-pad the number to 4 digits, and prepend with "Cust_".
    3. Line 21 writes df_new_table to the sandbox as a new table named new_<original_table_name>.
  5. Line 23 prints a confirmation message to the Job Log once the script completes successfully.
  6. Lines 25-26 call main() when the script is executed.

Configure Python Arguments and Run Configuration

With the script understood, let us use Detect arguments and configure the task:

  1. Three arguments are correctly detected. Their display names are derived from the key arguments of the argparse add_argument method.
    1. DataStar Project Name: type the name of the DataStar project to use.
    2. Original Table Name: type the name of the table containing the column with customer names that are to be updated. Hovering over the blue question mark shows the tooltip from the script's help key argument (see script screenshot above, line 9).
    3. Customer Name Column: type the name of the column containing the customer name values to be updated.
  2. If the script changes, click the refresh button to re-run auto-detection of the script's arguments.
  3. The Run Configuration section contains:
    1. Command Arguments: if auto-detection was not used, enter arguments manually here. When auto-detection is used, this field is read-only and shows the generated argument string.
    2. Tags: optionally, enter tags, which will be shown in the Run Manager jobs list and can help identify the job(s) of interest quickly.
    3. Timeout: the task is stopped if not completed within this many seconds. Default is 86,400 (24 hours).
    4. Resource Size: select the CPU/RAM size for the run. Larger resources have higher billing factors - see the Resource Size Selection Guide for more information.

Notes

The Notes section is the final configuration area. It is especially valuable for complex tasks or collaborative projects, enabling users to quickly understand what the task does. Formatting options are available above the text box:

Running the Python Task

Before running the task, we examine the customers table in the sandbox:

  1. Click the customers table in the sandbox to open it.
  2. The customername column uses a CZ prefix and no leading zeros for the numbers, so values sort alphabetically by string rather than by customer number.

Now run the "Change customer names" task (hover over it on the macro canvas and click the play button) and examine the results:

  1. A new table named new_customers has been created in the project sandbox. Click it to open.
  2. Customer names now follow the Cust_XXXX format. Sorting by name now orders them by customer number.

The Run Manager logs are useful for monitoring progress and troubleshooting:

  1. Click the Run Manager icon in the left-hand application bar (scroll down or click the three-dot icon if not visible).
  2. Find the Python task run in the job list - its Job Name matches the Python file name. Click it to see available logs on the right.
  3. Click the report icon to open the Job Log.
  4. Print statements in the script appear here. Use them throughout your script to track progress of long-running tasks.
  5. If a job fails, click the hazard icon to open the Error Log which may contain helpful diagnostic information.

Create File

Instead of using an existing file, users can click the Create File button in the Code File area (see bullet 10 underneath the first screenshot of the Code File section) to create a new script directly on the Optilogic platform:

  1. A new file named datastar-<timestamp>.py is created in My Files/DataStar in your Optilogic account.
  2. Clicking the file opens it in the Lightning Editor application.
  3. The file initially contains only print("Hello Datastar!"). Edit it to build your own script, and use for example the text of the script covered above, which can be copied from the Appendix, as a starting point.

Templates

Templates are pre-built scripts available to all users from the Resource Library, covering common import/export patterns. To browse them, open the Resource Library application, filter for DataStar resources (button at the right top) and the Script tag. Clicking a resource also shows available documentation, which can be copied to your Optilogic account or downloaded.

  1. Click on the Template tab in the Code File section.
  2. Browse and select a template from the list. The same search and grid customization features apply here as in the File tab, see the Code File section above.
  3. Click on the Use Template button to proceed:
  1. A default file name based on the script name and current timestamp is used for the copy saved to your account.
  2. The file is placed in /My Files/DataStar.
  3. Click Replace File to choose a different script or Edit File to open the current file in the Lightning Editor.
  4. If the selected Template does not use argparse, as is the case here, auto-detection of arguments is unavailable. Arguments (if any) must be entered manually in the Command Arguments field of the Run Configuration area. For this script, user-configurable values are set directly in the script's "User Configuration" section, and nothing needs to be entered in the Command Arguments field:
  1. Open the Explorer application and locate the script under My Files/DataStar. Click it to open in the Lightning Editor.
  2. The imports section shows which libraries the script uses. All these are included when running a Python file from your Optilogic account, so no requirements.txt is needed in this case.
  3. User-configurable values are set directly in the "User Configuration" section of the script. Comments (green text) describe the expected inputs, and more information may be available in the documentation of the script on the Resource Library.

Requirements.txt File

The Python base image used by the Run Python task contains the Python libraries most used in conjunction with DataStar. If your script uses a library that is not included in this base image, you need to create a requirements.txt file and place it in the same location as your Python script. In this file, you need to list the names of the libraries (without quotes), 1 library name per line.

If your script uses a library that is not part of the base image, the Job Error Log of the Python task run will contain the following error: "ModuleNotFoundError: No module named '<module name>'".

Tips & Tricks

Keep the following in mind when developing scripts for Run Python tasks:

  • Consider whether to hard-code values or pass them as arguments. Hard-coded values make a script simpler but less reusable. Using arguments in our example script allows it to work across projects, tables, and columns. Think for example of needing to update the customer names in all raw data, including a "shipments" table where the "destination" field contains customer names.
  • Use print statements throughout your script for troubleshooting. They appear in the Job Log and, possibly in combination with the Job Error log, help pinpoint where a script fails or stalls.
  • Build and test incrementally - run the script after each sub-task is complete to catch issues early.
  • Test scripts on copies of your data to protect the originals.
  • When using uncommon Python libraries in your script, place a correctly populated requirements.txt in the same folder as the script. If this is not done, the Job Error Log will contain the following error: "ModuleNotFoundError: No module named '<module name>'".

Helpful Resources

As always, our Support team is happy to help with any questions; they can be reached at support@optilogic.com.

Appendix - Script Text

Copy the script below into a .py file in your Optilogic account (via the Lightning Editor) and modify it as needed:

import argparse
from datastar import *

def main():
    parser = argparse.ArgumentParser(description="Script to update names of customers using format of CZ1, CZ10, etc. \
                                                    to Cust_0001, Cust_0010, etc; copies the entire original table into a \
                                                    new table (new_<original table name>) with just the customer names updated.")
    parser.add_argument("--project", required=True, help="Name of the DataStar project")
    parser.add_argument("--original_table", required=True, help="Original table with the customer name field to be updated")
    parser.add_argument("--column_name", required=True, help="Name of the field containing the name of the customer")
    
    args = parser.parse_args()

    project = Project.connect_to(args.project)
    sandbox = project.get_sandbox()
    df_table = sandbox.read_table(args.original_table)
    df_new_table = df_table
    df_new_table[args.column_name] = df_new_table[args.column_name].map(lambda x: x.lstrip('CZ'))
    df_new_table[args.column_name] = df_new_table[args.column_name].str.zfill(4)
    df_new_table[args.column_name] = 'Cust_' + df_new_table[args.column_name]
    sandbox.write_table(df_new_table, "new_" + args.original_table)
    
    print(f"new table new_{args.original_table} created with updated customer names")

if __name__ == "__main__":
    main()

Have More Questions?

Contact Support

Get in touch

Contact Sales

Get in touch

Visit Frogger Pond Community

Visit our Community