codedthinking · korenmiklos · Nov 4, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -2,10 +2,11 @@
 
 ## Project Overview
 
-**LLM Data Cleaner** is a Python package that automates the transformation of messy text columns into well-structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring output conforms to a schema.
+**LLM Data Cleaner** is a Python package that automates the transformation of messy text columns into well-structured data using LLM APIs. It supports **OpenAI, OpenRouter, Anthropic, and 100+ other providers** through LiteLLM. The package eliminates the need for complex regular expressions or manual parsing while ensuring output conforms to a schema.
 
 ### Key Features
-- **Automated data cleaning** using OpenAI's language models
+- **Provider-agnostic** - Works with OpenAI, OpenRouter, Anthropic, and 100+ providers via LiteLLM
+- **Automated data cleaning** using language models
 - **Schema validation** with Pydantic models
 - **Batch processing** to respect API rate limits
 - **YAML-based configuration** for reusable cleaning instructions
@@ -37,24 +38,27 @@ The main class that orchestrates data cleaning operations.
 
 **Key Responsibilities:**
 - Batch processing of DataFrame columns
-- Communication with OpenAI API using structured outputs
+- Communication with LLM APIs using LiteLLM for provider-agnostic access
+- Structured output parsing with JSON mode and Pydantic validation
 - Retry logic for failed API calls
 - Progress tracking with tqdm
 
 **Important Methods:**
 - `clean_dataframe(df, instructions)` - Main entry point for cleaning data
 - `_process_batch()` - Processes a single batch of rows
-- `_clean_batch()` - Makes API calls with retry logic
+- `_clean_batch()` - Makes API calls with retry logic using LiteLLM
 - `_make_batch_model()` - Creates Pydantic models for batch responses
 
 **Configuration Parameters:**
-- `api_key`: OpenAI API key (required)
-- `model`: Model to use (default: "gpt-4o-2024-08-06")
+- `api_key`: API key for the LLM provider (optional if set via environment variables)
+- `model`: Model name (default: "gpt-4o-2024-08-06"). Use provider prefixes for non-OpenAI models (e.g., "openrouter/anthropic/claude-3-opus")
 - `batch_size`: Number of rows per API call (default: 10)
 - `max_retries`: Retry attempts for failed calls (default: 3)
 - `retry_delay`: Seconds between retries (default: 5)
 - `temperature`: Model temperature (default: 0.0)
 - `system_prompt`: Custom system prompt template (optional)
+- `api_base`: Base URL for the API (e.g., "https://openrouter.ai/api/v1" for OpenRouter)
+- `**litellm_kwargs`: Additional arguments to pass to litellm.completion()
 
 ### 2. Utilities (`llm_data_cleaner/utils.py`)
 
@@ -91,7 +95,7 @@ column_name:
 ## Dependencies
 
 ### Production
-- `openai ^1.0.0` - OpenAI API client
+- `litellm ^1.0.0` - Unified API client for 100+ LLM providers (OpenAI, OpenRouter, Anthropic, etc.)
 - `pydantic ^2.0.0` - Data validation and schema definition
 - `pandas ^2.2.3` - DataFrame operations
 - `pyyaml ^6.0.2` - YAML parsing
@@ -164,13 +168,23 @@ Test files are located in `tests/`:
 
 ### Common Issues
 
-1. **API Key Not Set**: Store OpenAI API key in `.secrets/OPENAI_API_KEY` or pass directly
+1. **API Key Not Set**: Set API key via environment variables (OPENAI_API_KEY, OPENROUTER_API_KEY, ANTHROPIC_API_KEY) or pass directly to constructor
 2. **Rate Limits**: Adjust `batch_size` to control API call frequency
 3. **Schema Validation Errors**: Ensure Pydantic models include `index: int` field
 4. **Missing Columns**: DataCleaner skips columns not in DataFrame with a warning
+5. **Provider-specific Configuration**: Use `api_base` parameter for custom endpoints (e.g., OpenRouter)
+6. **Model Naming**: Use provider prefixes for non-OpenAI models (e.g., "openrouter/model-name", "anthropic/model-name")
 
 ## API Changes
 
+### Version 0.5.0
+- **BREAKING**: Migrated from OpenAI library to LiteLLM for provider-agnostic support
+- Now supports OpenAI, OpenRouter, Anthropic, and 100+ other providers
+- Added `api_base` parameter for custom API endpoints
+- Added `**litellm_kwargs` for provider-specific configuration
+- API key is now optional in constructor (can use environment variables)
+- Model parameter now supports provider prefixes (e.g., "openrouter/", "anthropic/")
+
 ### Version 0.4.x
 - Migrated from deprecated `client.responses.parse()` to supported OpenAI methods
 - Added `jsonize()` utility for consistent data serialization
@@ -184,11 +198,32 @@ Test files are located in `tests/`:
 
 1. **Define schemas** (Pydantic or YAML)
 2. **Create instructions** dictionary mapping columns to prompts and schemas
-3. **Initialize DataCleaner** with API key and configuration
+3. **Initialize DataCleaner** with API key, model name, and configuration
 4. **Load data** into pandas DataFrame
 5. **Call `clean_dataframe()`** to process
 6. **Access results** in `cleaned_*` columns
 
+### Provider-Specific Examples
+
+**OpenAI (default):**
+```python
+cleaner = DataCleaner(api_key="sk-...", model="gpt-4o-2024-08-06")
+```
+
+**OpenRouter:**
+```python
+cleaner = DataCleaner(
+    api_key="sk-or-...",
+    model="openrouter/anthropic/claude-3-opus",
+    api_base="https://openrouter.ai/api/v1"
+)
+```
+
+**Anthropic:**
+```python
+cleaner = DataCleaner(api_key="sk-ant-...", model="claude-3-opus-20240229")
+```
+
 ## Authors
 - Miklós Koren (koren@codedthinking.com)
 - Gergely Attila Kiss (kiss@codedthinking.com)

diff --git a/README.md b/README.md
@@ -1,9 +1,10 @@
 # LLM Data Cleaner
 
-LLM Data Cleaner automates the transformation of messy text columns into well structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.
+LLM Data Cleaner automates the transformation of messy text columns into well structured data using LLM APIs. It supports **OpenAI, OpenRouter, Anthropic, and 100+ other providers** through LiteLLM. The package eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.
 
 ## Why use it?
 
+- **Provider-agnostic** – works with OpenAI, OpenRouter, Anthropic, and many other LLM providers.
 - **Less manual work** – delegate repetitive cleaning tasks to a language model.
 - **Consistent results** – validate responses with Pydantic models.
 - **Batch processing** – send rows in chunks to respect API rate limits.
@@ -26,13 +27,24 @@ poetry add git+https://github.com/codedthinking/llm_data_cleaner.git
 
 1. Create Pydantic models describing the cleaned values.
 2. Define a dictionary of instructions mapping column names to a prompt and schema.
-3. Instantiate `DataCleaner` with your OpenAI API key.
+3. Instantiate `DataCleaner` with your API key and model name.
 4. Load your raw CSV file with `pandas`.
 5. Call `clean_dataframe(df, instructions)`.
 6. Inspect the returned DataFrame which contains new `cleaned_*` columns.
 7. Save or further process the cleaned data.
 
-## Example: inline models
+## Supported Providers
+
+Thanks to LiteLLM, this package supports 100+ LLM providers including:
+
+- **OpenAI** (GPT-4, GPT-3.5, etc.)
+- **OpenRouter** (access to multiple models through one API)
+- **Anthropic** (Claude models)
+- **Cohere**, **AI21**, **Replicate**, **Hugging Face**, and many more
+
+See [LiteLLM's provider list](https://docs.litellm.ai/docs/providers) for the complete list.
+
+## Example: inline models with OpenAI
 
 ```python
 import pandas as pd
@@ -60,6 +72,7 @@ instructions = {
     },
 }
 
+# OpenAI (default)
 cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY")
 raw_df = pd.DataFrame({
     "address": ["Budapest Váci út 1", "1200 Vienna Mariahilfer Straße 10"],
@@ -69,21 +82,72 @@ cleaned = cleaner.clean_dataframe(raw_df, instructions)
 print(cleaned)
 ```
 
+## Example: using OpenRouter
+
+```python
+from llm_data_cleaner import DataCleaner
+
+# OpenRouter allows you to access multiple models through one API
+cleaner = DataCleaner(
+    api_key="YOUR_OPENROUTER_API_KEY",
+    model="openrouter/anthropic/claude-3-opus",
+    api_base="https://openrouter.ai/api/v1"
+)
+
+# Use the same instructions and DataFrame as above
+cleaned = cleaner.clean_dataframe(raw_df, instructions)
+```
+
+## Example: using Anthropic Claude
+
+```python
+from llm_data_cleaner import DataCleaner
+
+# Anthropic Claude models
+cleaner = DataCleaner(
+    api_key="YOUR_ANTHROPIC_API_KEY",
+    model="claude-3-opus-20240229"
+)
+
+cleaned = cleaner.clean_dataframe(raw_df, instructions)
+```
+
 ## Example: loading YAML instructions
 
 ```python
 from llm_data_cleaner import DataCleaner, load_yaml_instructions
 import pandas as pd
 
 instructions = load_yaml_instructions("instructions.yaml")
-cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY", system_prompt="{column_prompt}")
+cleaner = DataCleaner(api_key="YOUR_API_KEY", system_prompt="{column_prompt}")
 raw_df = pd.read_csv("data.csv")
 result = cleaner.clean_dataframe(raw_df, instructions)
 ```
 
 `load_yaml_instructions` reads the same structure shown above from a YAML file so
 cleaning rules can be shared without modifying code.
 
+## Environment Variables
+
+You can also set API keys via environment variables instead of passing them directly:
+
+```bash
+# For OpenAI
+export OPENAI_API_KEY="sk-..."
+
+# For OpenRouter
+export OPENROUTER_API_KEY="sk-or-..."
+
+# For Anthropic
+export ANTHROPIC_API_KEY="sk-ant-..."
+```
+
+Then initialize without the api_key parameter:
+
+```python
+cleaner = DataCleaner(model="gpt-4o-2024-08-06")  # Uses OPENAI_API_KEY from environment
+```
+
 ## Authors
 
 - Miklós Koren

diff --git a/example.py b/example.py
@@ -19,12 +19,13 @@ class JobTitleItem(BaseModel):
 
 yaml_instructions = load_yaml_instructions("instructions.yaml")
 
-# Set your OpenAI API key, reading from .secrets/OPENAI_API_KEY
+# Set your API key, reading from .secrets/OPENAI_API_KEY
+# You can use OpenAI, OpenRouter, Anthropic, or other providers
 with open(".secrets/OPENAI_API_KEY", "r") as f:
     api_key = f.read().strip()
 # Ensure the API key is set
 if not api_key:
-    raise ValueError("API key is not set. Please provide a valid OpenAI API key.")
+    raise ValueError("API key is not set. Please provide a valid API key.")
 # Create a sample DataFrame
 data = {
     "education": [
@@ -61,12 +62,21 @@ class JobTitleItem(BaseModel):
     },
 }
 # Initialize the cleaner with a batch size (default is 20)
+# For OpenAI (default):
 cleaner = DataCleaner(
-    api_key=api_key, 
-    batch_size=20, 
+    api_key=api_key,
+    batch_size=20,
     system_prompt='Follow these instructions, but return the answers in Greek. {column_prompt}.',
 )
 
+# For OpenRouter, you would use:
+# cleaner = DataCleaner(
+#     api_key="your-openrouter-key",
+#     model="openrouter/anthropic/claude-3-opus",
+#     api_base="https://openrouter.ai/api/v1",
+#     batch_size=20
+# )
+
 # Clean the data
 
 result = cleaner.clean_dataframe(df, instructions)