Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

LLM Data Cleaner is a Python package that automates the transformation of messy text columns into structured data using various language models via the LLM API. It eliminates complex regex patterns and manual parsing by leveraging LLMs with schema validation.

## Development Commands

This project uses Poetry for dependency management:

```bash
# Setup and dependencies
poetry install # Install all dependencies
poetry install --only=main # Install only production dependencies

# Testing
poetry run pytest # Run all tests
poetry run pytest tests/test_cleaner.py # Run specific test file
poetry run pytest -v # Verbose test output
poetry run pytest -k "test_name" # Run specific test

# Code quality
poetry run black . # Format code
poetry run isort . # Sort imports
poetry run flake8 # Lint code

# Package building
poetry build # Build package
poetry version patch|minor|major # Bump version
```

## Core Architecture

The package has a clean modular structure:

- **`DataCleaner`** (`cleaner.py`): Main class handling LLM API interactions, batch processing, and data transformation
- **`utils.py`**: Contains `jsonize()` utility and type definitions (`InstructionField`, `InstructionSchema`)
- **YAML instruction loading**: Dynamic Pydantic model creation from YAML configuration files

### Key Technical Patterns

- Uses LLM API's structured output feature with Pydantic schemas for reliable JSON responses
- Implements batch processing with configurable sizes to respect API rate limits
- Dynamic Pydantic model creation allows flexible schema definitions
- Proper retry logic with exponential backoff for API resilience
- Model-agnostic design - works with any LLM API supported model

### Data Flow

1. Load DataFrame and cleaning instructions (programmatic or YAML)
2. Process columns in batches via LLM API with structured outputs
3. Validate responses against Pydantic schemas
4. Return enhanced DataFrame with `cleaned_*` prefixed columns alongside originals

## Configuration Requirements

- **Python 3.9+** required
- **LLM API** must be installed and configured (`pip install llm`)
- **Model setup** via LLM plugins (e.g., `llm install llm-openai` for OpenAI models)
- **API keys** managed by LLM API (e.g., `llm keys set openai`)
- Default model: `gpt-4o-mini` (configurable to any LLM-supported model)

## Important Files

- `pyproject.toml`: Poetry configuration and package metadata
- `instructions.yaml`: Example YAML configuration for cleaning rules
- `tests/test_data/sample.csv`: Test data fixture

## Version Management

The project uses Poetry's version management. Current version is 0.5.0 after migrating to LLM API. Maintain consistency between `pyproject.toml` and `CITATION.cff` when updating versions.

## Testing Strategy

Tests use mocked LLM API calls to avoid real API usage. The mocks patch `llm.get_model` and simulate model responses. Run tests before any API-related changes to ensure mocking remains effective. Test data fixtures are in `tests/test_data/`.
27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# LLM Data Cleaner

LLM Data Cleaner automates the transformation of messy text columns into well structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.
LLM Data Cleaner automates the transformation of messy text columns into well structured data using LLM models via the LLM API. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.

## Why use it?

Expand All @@ -10,23 +10,40 @@ LLM Data Cleaner automates the transformation of messy text columns into well st

## Installation

Requires **Python 3.9+**.
Requires **Python 3.9+** and the LLM API.

```bash
pip install git+https://github.com/codedthinking/llm_data_cleaner.git
pip install llm
```

Or with Poetry:

```bash
poetry add git+https://github.com/codedthinking/llm_data_cleaner.git
poetry add llm
```

## Setup

The LLM API requires model configuration. For OpenAI models:

```bash
# Install the OpenAI plugin for LLM
llm install llm-openai

# Set your API key
llm keys set openai
# Enter your OpenAI API key when prompted
```

For other providers, see the [LLM documentation](https://llm.datasette.io/en/stable/other-models.html).

## Step by step

1. Create Pydantic models describing the cleaned values.
2. Define a dictionary of instructions mapping column names to a prompt and schema.
3. Instantiate `DataCleaner` with your OpenAI API key.
3. Instantiate `DataCleaner` with your preferred model (no API key needed - handled by LLM).
4. Load your raw CSV file with `pandas`.
5. Call `clean_dataframe(df, instructions)`.
6. Inspect the returned DataFrame which contains new `cleaned_*` columns.
Expand Down Expand Up @@ -60,7 +77,7 @@ instructions = {
},
}

cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY")
cleaner = DataCleaner(model="gpt-4o-mini") # or any LLM-supported model
raw_df = pd.DataFrame({
"address": ["Budapest Váci út 1", "1200 Vienna Mariahilfer Straße 10"],
"profession": ["dev", "data eng"]
Expand All @@ -76,7 +93,7 @@ from llm_data_cleaner import DataCleaner, load_yaml_instructions
import pandas as pd

instructions = load_yaml_instructions("instructions.yaml")
cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY", system_prompt="{column_prompt}")
cleaner = DataCleaner(model="gpt-4o-mini", system_prompt="{column_prompt}")
raw_df = pd.read_csv("data.csv")
result = cleaner.clean_dataframe(raw_df, instructions)
```
Expand Down
36 changes: 22 additions & 14 deletions llm_data_cleaner/cleaner.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import os
import pandas as pd
from typing import Dict, Any, Type, List, Optional
from openai import OpenAI
import llm
import json
from pydantic import BaseModel, create_model, ConfigDict
from llm_data_cleaner.utils import InstructionField, InstructionSchema
import time
Expand All @@ -11,21 +12,20 @@

class DataCleaner:
"""
Batch DataCleaner that uses OpenAI's responses.parse method with auto-generated prompts.
Batch DataCleaner that uses LLM API with structured output for data cleaning.
"""

def __init__(
self,
api_key: str,
model: str = "gpt-4o-2024-08-06",
model: str = "gpt-4o-mini",
max_retries: int = 3,
retry_delay: int = 5,
batch_size: int = 10,
system_prompt: str = None,
temperature: float = 0.0,
):
self.client = OpenAI(api_key=api_key)
self.model = model
self.model_name = model
self.model = llm.get_model(model)
self.max_retries = max_retries
self.retry_delay = retry_delay
self.batch_size = batch_size
Expand Down Expand Up @@ -118,7 +118,7 @@ def _process_batch(
if item is None:
continue
index = item.index
for fname in item.model_fields:
for fname in item.__class__.model_fields:
if fname != "index":
colname = f"cleaned_{fname}"
if colname not in result_batch.columns:
Expand All @@ -138,13 +138,21 @@ def _clean_batch(
):
for attempt in range(self.max_retries):
try:
resp = self.client.responses.parse(
model=self.model,
input=messages,
text_format=pyd_model_batch,
temperature=self.temperature,
)
return resp.output_parsed
# Combine system and user messages into a single prompt
system_content = messages[0]["content"] if messages[0]["role"] == "system" else ""
user_content = messages[1]["content"] if len(messages) > 1 and messages[1]["role"] == "user" else ""

full_prompt = f"{system_content}\n\nData to process: {user_content}"

# Use LLM API with Pydantic schema for structured output
response = self.model.prompt(full_prompt, schema=pyd_model_batch, temperature=self.temperature)

# Parse the JSON response
parsed_data = json.loads(response.text())

# Convert to the expected Pydantic model instance
return pyd_model_batch(**parsed_data)

except Exception as e:
print(f"Batch cleaning error: {e} (attempt {attempt+1}/{self.max_retries})")
time.sleep(self.retry_delay)
Expand Down
Loading