Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 44 additions & 9 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

## Project Overview

**LLM Data Cleaner** is a Python package that automates the transformation of messy text columns into well-structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring output conforms to a schema.
**LLM Data Cleaner** is a Python package that automates the transformation of messy text columns into well-structured data using LLM APIs. It supports **OpenAI, OpenRouter, Anthropic, and 100+ other providers** through LiteLLM. The package eliminates the need for complex regular expressions or manual parsing while ensuring output conforms to a schema.

### Key Features
- **Automated data cleaning** using OpenAI's language models
- **Provider-agnostic** - Works with OpenAI, OpenRouter, Anthropic, and 100+ providers via LiteLLM
- **Automated data cleaning** using language models
- **Schema validation** with Pydantic models
- **Batch processing** to respect API rate limits
- **YAML-based configuration** for reusable cleaning instructions
Expand Down Expand Up @@ -37,24 +38,27 @@ The main class that orchestrates data cleaning operations.

**Key Responsibilities:**
- Batch processing of DataFrame columns
- Communication with OpenAI API using structured outputs
- Communication with LLM APIs using LiteLLM for provider-agnostic access
- Structured output parsing with JSON mode and Pydantic validation
- Retry logic for failed API calls
- Progress tracking with tqdm

**Important Methods:**
- `clean_dataframe(df, instructions)` - Main entry point for cleaning data
- `_process_batch()` - Processes a single batch of rows
- `_clean_batch()` - Makes API calls with retry logic
- `_clean_batch()` - Makes API calls with retry logic using LiteLLM
- `_make_batch_model()` - Creates Pydantic models for batch responses

**Configuration Parameters:**
- `api_key`: OpenAI API key (required)
- `model`: Model to use (default: "gpt-4o-2024-08-06")
- `api_key`: API key for the LLM provider (optional if set via environment variables)
- `model`: Model name (default: "gpt-4o-2024-08-06"). Use provider prefixes for non-OpenAI models (e.g., "openrouter/anthropic/claude-3-opus")
- `batch_size`: Number of rows per API call (default: 10)
- `max_retries`: Retry attempts for failed calls (default: 3)
- `retry_delay`: Seconds between retries (default: 5)
- `temperature`: Model temperature (default: 0.0)
- `system_prompt`: Custom system prompt template (optional)
- `api_base`: Base URL for the API (e.g., "https://openrouter.ai/api/v1" for OpenRouter)
- `**litellm_kwargs`: Additional arguments to pass to litellm.completion()

### 2. Utilities (`llm_data_cleaner/utils.py`)

Expand Down Expand Up @@ -91,7 +95,7 @@ column_name:
## Dependencies

### Production
- `openai ^1.0.0` - OpenAI API client
- `litellm ^1.0.0` - Unified API client for 100+ LLM providers (OpenAI, OpenRouter, Anthropic, etc.)
- `pydantic ^2.0.0` - Data validation and schema definition
- `pandas ^2.2.3` - DataFrame operations
- `pyyaml ^6.0.2` - YAML parsing
Expand Down Expand Up @@ -164,13 +168,23 @@ Test files are located in `tests/`:

### Common Issues

1. **API Key Not Set**: Store OpenAI API key in `.secrets/OPENAI_API_KEY` or pass directly
1. **API Key Not Set**: Set API key via environment variables (OPENAI_API_KEY, OPENROUTER_API_KEY, ANTHROPIC_API_KEY) or pass directly to constructor
2. **Rate Limits**: Adjust `batch_size` to control API call frequency
3. **Schema Validation Errors**: Ensure Pydantic models include `index: int` field
4. **Missing Columns**: DataCleaner skips columns not in DataFrame with a warning
5. **Provider-specific Configuration**: Use `api_base` parameter for custom endpoints (e.g., OpenRouter)
6. **Model Naming**: Use provider prefixes for non-OpenAI models (e.g., "openrouter/model-name", "anthropic/model-name")

## API Changes

### Version 0.5.0
- **BREAKING**: Migrated from OpenAI library to LiteLLM for provider-agnostic support
- Now supports OpenAI, OpenRouter, Anthropic, and 100+ other providers
- Added `api_base` parameter for custom API endpoints
- Added `**litellm_kwargs` for provider-specific configuration
- API key is now optional in constructor (can use environment variables)
- Model parameter now supports provider prefixes (e.g., "openrouter/", "anthropic/")

### Version 0.4.x
- Migrated from deprecated `client.responses.parse()` to supported OpenAI methods
- Added `jsonize()` utility for consistent data serialization
Expand All @@ -184,11 +198,32 @@ Test files are located in `tests/`:

1. **Define schemas** (Pydantic or YAML)
2. **Create instructions** dictionary mapping columns to prompts and schemas
3. **Initialize DataCleaner** with API key and configuration
3. **Initialize DataCleaner** with API key, model name, and configuration
4. **Load data** into pandas DataFrame
5. **Call `clean_dataframe()`** to process
6. **Access results** in `cleaned_*` columns

### Provider-Specific Examples

**OpenAI (default):**
```python
cleaner = DataCleaner(api_key="sk-...", model="gpt-4o-2024-08-06")
```

**OpenRouter:**
```python
cleaner = DataCleaner(
api_key="sk-or-...",
model="openrouter/anthropic/claude-3-opus",
api_base="https://openrouter.ai/api/v1"
)
```

**Anthropic:**
```python
cleaner = DataCleaner(api_key="sk-ant-...", model="claude-3-opus-20240229")
```

## Authors
- Miklós Koren (koren@codedthinking.com)
- Gergely Attila Kiss (kiss@codedthinking.com)
Expand Down
72 changes: 68 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# LLM Data Cleaner

LLM Data Cleaner automates the transformation of messy text columns into well structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.
LLM Data Cleaner automates the transformation of messy text columns into well structured data using LLM APIs. It supports **OpenAI, OpenRouter, Anthropic, and 100+ other providers** through LiteLLM. The package eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.

## Why use it?

- **Provider-agnostic** – works with OpenAI, OpenRouter, Anthropic, and many other LLM providers.
- **Less manual work** – delegate repetitive cleaning tasks to a language model.
- **Consistent results** – validate responses with Pydantic models.
- **Batch processing** – send rows in chunks to respect API rate limits.
Expand All @@ -26,13 +27,24 @@ poetry add git+https://github.com/codedthinking/llm_data_cleaner.git

1. Create Pydantic models describing the cleaned values.
2. Define a dictionary of instructions mapping column names to a prompt and schema.
3. Instantiate `DataCleaner` with your OpenAI API key.
3. Instantiate `DataCleaner` with your API key and model name.
4. Load your raw CSV file with `pandas`.
5. Call `clean_dataframe(df, instructions)`.
6. Inspect the returned DataFrame which contains new `cleaned_*` columns.
7. Save or further process the cleaned data.

## Example: inline models
## Supported Providers

Thanks to LiteLLM, this package supports 100+ LLM providers including:

- **OpenAI** (GPT-4, GPT-3.5, etc.)
- **OpenRouter** (access to multiple models through one API)
- **Anthropic** (Claude models)
- **Cohere**, **AI21**, **Replicate**, **Hugging Face**, and many more

See [LiteLLM's provider list](https://docs.litellm.ai/docs/providers) for the complete list.

## Example: inline models with OpenAI

```python
import pandas as pd
Expand Down Expand Up @@ -60,6 +72,7 @@ instructions = {
},
}

# OpenAI (default)
cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY")
raw_df = pd.DataFrame({
"address": ["Budapest Váci út 1", "1200 Vienna Mariahilfer Straße 10"],
Expand All @@ -69,21 +82,72 @@ cleaned = cleaner.clean_dataframe(raw_df, instructions)
print(cleaned)
```

## Example: using OpenRouter

```python
from llm_data_cleaner import DataCleaner

# OpenRouter allows you to access multiple models through one API
cleaner = DataCleaner(
api_key="YOUR_OPENROUTER_API_KEY",
model="openrouter/anthropic/claude-3-opus",
api_base="https://openrouter.ai/api/v1"
)

# Use the same instructions and DataFrame as above
cleaned = cleaner.clean_dataframe(raw_df, instructions)
```

## Example: using Anthropic Claude

```python
from llm_data_cleaner import DataCleaner

# Anthropic Claude models
cleaner = DataCleaner(
api_key="YOUR_ANTHROPIC_API_KEY",
model="claude-3-opus-20240229"
)

cleaned = cleaner.clean_dataframe(raw_df, instructions)
```

## Example: loading YAML instructions

```python
from llm_data_cleaner import DataCleaner, load_yaml_instructions
import pandas as pd

instructions = load_yaml_instructions("instructions.yaml")
cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY", system_prompt="{column_prompt}")
cleaner = DataCleaner(api_key="YOUR_API_KEY", system_prompt="{column_prompt}")
raw_df = pd.read_csv("data.csv")
result = cleaner.clean_dataframe(raw_df, instructions)
```

`load_yaml_instructions` reads the same structure shown above from a YAML file so
cleaning rules can be shared without modifying code.

## Environment Variables

You can also set API keys via environment variables instead of passing them directly:

```bash
# For OpenAI
export OPENAI_API_KEY="sk-..."

# For OpenRouter
export OPENROUTER_API_KEY="sk-or-..."

# For Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
```

Then initialize without the api_key parameter:

```python
cleaner = DataCleaner(model="gpt-4o-2024-08-06") # Uses OPENAI_API_KEY from environment
```

## Authors

- Miklós Koren
Expand Down
18 changes: 14 additions & 4 deletions example.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,13 @@ class JobTitleItem(BaseModel):

yaml_instructions = load_yaml_instructions("instructions.yaml")

# Set your OpenAI API key, reading from .secrets/OPENAI_API_KEY
# Set your API key, reading from .secrets/OPENAI_API_KEY
# You can use OpenAI, OpenRouter, Anthropic, or other providers
with open(".secrets/OPENAI_API_KEY", "r") as f:
api_key = f.read().strip()
# Ensure the API key is set
if not api_key:
raise ValueError("API key is not set. Please provide a valid OpenAI API key.")
raise ValueError("API key is not set. Please provide a valid API key.")
# Create a sample DataFrame
data = {
"education": [
Expand Down Expand Up @@ -61,12 +62,21 @@ class JobTitleItem(BaseModel):
},
}
# Initialize the cleaner with a batch size (default is 20)
# For OpenAI (default):
cleaner = DataCleaner(
api_key=api_key,
batch_size=20,
api_key=api_key,
batch_size=20,
system_prompt='Follow these instructions, but return the answers in Greek. {column_prompt}.',
)

# For OpenRouter, you would use:
# cleaner = DataCleaner(
# api_key="your-openrouter-key",
# model="openrouter/anthropic/claude-3-opus",
# api_base="https://openrouter.ai/api/v1",
# batch_size=20
# )

# Clean the data

result = cleaner.clean_dataframe(df, instructions)
Expand Down
Loading