Dataset columns incorrectly detected as null and removed during model.build()

DatasetAnalyser incorrectly removes all columns as null even when DataFrame has valid data.

When calling model.build() with a valid pandas DataFrame, Plexe’s internal DatasetAnalyser incorrectly concludes that all dataset columns are completely null and removes them using drop_null_columns.
As a result, dataset_0 ends up with zero columns, even though the input DataFrame contains valid non-null string data.

This occurs consistently, regardless of whether the dataset is loaded from CSV, JSON, TXT, or even constructed manually in code.

This makes it impossible to train a model using any tabular data.

-Python version: 3.12
-Plexe version: 0.26.2


A simple example like this one:

```
import pandas as pd
import plexe

model = plexe.Model(
    intent="Map description to code.",
    input_schema={"description": str},
    output_schema={"code": str},
)

df = pd.DataFrame({
    "code": ["A", "B", "C"],
    "description": ["one", "two", "three"],
})

model.build(
    datasets=[df],
    provider="openai/gpt-4o-mini",
    max_iterations=1,
    timeout=300,
    run_timeout=120,
    verbose=True,
)
```

Plexe’s internal agent logs show:

Orchestrator · Step 2 · 09:57:38
💡 Executing Initial Steps for ML
💭 I began by executing the first two steps of the plan. I called the `get_latest_datasets()` function and retrieved `dataset_0`, which is the latest version of our dataset. I also selected the target metric for the task, which was determined to be Mean Absolute Error (MAE), but it returned an infinite value, indicating a potential issue.

DatasetAnalyser · Step 3 · 09:58:47
💡 Retrieving and Cleaning Datasets
💭 I needed to retrieve the latest datasets to ensure I was working with the most accurate data. I called the `get_latest_datasets()` function to access the available datasets, and confirmed that `dataset_0` was retrieved. My next steps will involve cleaning the dataset by removing null columns and 
analyzing its characteristics using the provided template.

DatasetAnalyser · Step 4 · 09:59:12
💡 Cleaning Null Columns in Dataset
💭 I identified that `dataset_0` was available as a raw dataset and needed cleaning by dropping completely null columns. I executed the function to drop the null columns from `dataset_0`. As a result, I successfully dropped 2 null columns from the dataset.

DatasetAnalyser · Step 5 · 09:59:59
💡 Retrieving Dataset Schema
💭 I dropped the null columns from `dataset_0` and accessed the dataset using the specified pattern. I converted the dataset to a pandas DataFrame and retrieved its schema to understand the structure and types of columns. However, I observed that the DataFrame was empty, indicating that no data was available after the drop operation.

DatasetAnalyser · Step 6 · 10:00:48
💡 Registering EDA Report for Empty Dataset
💭 I realized that after dropping null columns, `dataset_0` was empty, containing no data for analysis. I summarized my findings in an EDA report, highlighting the absence of features and relationships due to the dataset's emptiness. I recommended investigating the source of the dataset for potential data handling errors.


These or similar logs occur consistently on every execution of the program. I am confident the files are not the source of the problem since I routinely print the DataFrame prior to model creation and it always displays the correct data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset columns incorrectly detected as null and removed during model.build() #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset columns incorrectly detected as null and removed during model.build() #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions