Skip to content

Dataset columns incorrectly detected as null and removed during model.build() #153

@marccuello2025

Description

@marccuello2025

DatasetAnalyser incorrectly removes all columns as null even when DataFrame has valid data.

When calling model.build() with a valid pandas DataFrame, Plexe’s internal DatasetAnalyser incorrectly concludes that all dataset columns are completely null and removes them using drop_null_columns.
As a result, dataset_0 ends up with zero columns, even though the input DataFrame contains valid non-null string data.

This occurs consistently, regardless of whether the dataset is loaded from CSV, JSON, TXT, or even constructed manually in code.

This makes it impossible to train a model using any tabular data.

-Python version: 3.12
-Plexe version: 0.26.2

A simple example like this one:

import pandas as pd
import plexe

model = plexe.Model(
    intent="Map description to code.",
    input_schema={"description": str},
    output_schema={"code": str},
)

df = pd.DataFrame({
    "code": ["A", "B", "C"],
    "description": ["one", "two", "three"],
})

model.build(
    datasets=[df],
    provider="openai/gpt-4o-mini",
    max_iterations=1,
    timeout=300,
    run_timeout=120,
    verbose=True,
)

Plexe’s internal agent logs show:

Orchestrator · Step 2 · 09:57:38
💡 Executing Initial Steps for ML
💭 I began by executing the first two steps of the plan. I called the get_latest_datasets() function and retrieved dataset_0, which is the latest version of our dataset. I also selected the target metric for the task, which was determined to be Mean Absolute Error (MAE), but it returned an infinite value, indicating a potential issue.

DatasetAnalyser · Step 3 · 09:58:47
💡 Retrieving and Cleaning Datasets
💭 I needed to retrieve the latest datasets to ensure I was working with the most accurate data. I called the get_latest_datasets() function to access the available datasets, and confirmed that dataset_0 was retrieved. My next steps will involve cleaning the dataset by removing null columns and
analyzing its characteristics using the provided template.

DatasetAnalyser · Step 4 · 09:59:12
💡 Cleaning Null Columns in Dataset
💭 I identified that dataset_0 was available as a raw dataset and needed cleaning by dropping completely null columns. I executed the function to drop the null columns from dataset_0. As a result, I successfully dropped 2 null columns from the dataset.

DatasetAnalyser · Step 5 · 09:59:59
💡 Retrieving Dataset Schema
💭 I dropped the null columns from dataset_0 and accessed the dataset using the specified pattern. I converted the dataset to a pandas DataFrame and retrieved its schema to understand the structure and types of columns. However, I observed that the DataFrame was empty, indicating that no data was available after the drop operation.

DatasetAnalyser · Step 6 · 10:00:48
💡 Registering EDA Report for Empty Dataset
💭 I realized that after dropping null columns, dataset_0 was empty, containing no data for analysis. I summarized my findings in an EDA report, highlighting the absence of features and relationships due to the dataset's emptiness. I recommended investigating the source of the dataset for potential data handling errors.

These or similar logs occur consistently on every execution of the program. I am confident the files are not the source of the problem since I routinely print the DataFrame prior to model creation and it always displays the correct data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions