-
Notifications
You must be signed in to change notification settings - Fork 236
Description
DatasetAnalyser incorrectly removes all columns as null even when DataFrame has valid data.
When calling model.build() with a valid pandas DataFrame, Plexe’s internal DatasetAnalyser incorrectly concludes that all dataset columns are completely null and removes them using drop_null_columns.
As a result, dataset_0 ends up with zero columns, even though the input DataFrame contains valid non-null string data.
This occurs consistently, regardless of whether the dataset is loaded from CSV, JSON, TXT, or even constructed manually in code.
This makes it impossible to train a model using any tabular data.
-Python version: 3.12
-Plexe version: 0.26.2
A simple example like this one:
import pandas as pd
import plexe
model = plexe.Model(
intent="Map description to code.",
input_schema={"description": str},
output_schema={"code": str},
)
df = pd.DataFrame({
"code": ["A", "B", "C"],
"description": ["one", "two", "three"],
})
model.build(
datasets=[df],
provider="openai/gpt-4o-mini",
max_iterations=1,
timeout=300,
run_timeout=120,
verbose=True,
)
Plexe’s internal agent logs show:
Orchestrator · Step 2 · 09:57:38
💡 Executing Initial Steps for ML
💭 I began by executing the first two steps of the plan. I called the get_latest_datasets() function and retrieved dataset_0, which is the latest version of our dataset. I also selected the target metric for the task, which was determined to be Mean Absolute Error (MAE), but it returned an infinite value, indicating a potential issue.
DatasetAnalyser · Step 3 · 09:58:47
💡 Retrieving and Cleaning Datasets
💭 I needed to retrieve the latest datasets to ensure I was working with the most accurate data. I called the get_latest_datasets() function to access the available datasets, and confirmed that dataset_0 was retrieved. My next steps will involve cleaning the dataset by removing null columns and
analyzing its characteristics using the provided template.
DatasetAnalyser · Step 4 · 09:59:12
💡 Cleaning Null Columns in Dataset
💭 I identified that dataset_0 was available as a raw dataset and needed cleaning by dropping completely null columns. I executed the function to drop the null columns from dataset_0. As a result, I successfully dropped 2 null columns from the dataset.
DatasetAnalyser · Step 5 · 09:59:59
💡 Retrieving Dataset Schema
💭 I dropped the null columns from dataset_0 and accessed the dataset using the specified pattern. I converted the dataset to a pandas DataFrame and retrieved its schema to understand the structure and types of columns. However, I observed that the DataFrame was empty, indicating that no data was available after the drop operation.
DatasetAnalyser · Step 6 · 10:00:48
💡 Registering EDA Report for Empty Dataset
💭 I realized that after dropping null columns, dataset_0 was empty, containing no data for analysis. I summarized my findings in an EDA report, highlighting the absence of features and relationships due to the dataset's emptiness. I recommended investigating the source of the dataset for potential data handling errors.
These or similar logs occur consistently on every execution of the program. I am confident the files are not the source of the problem since I routinely print the DataFrame prior to model creation and it always displays the correct data.