codedthinking · korenmiklos · Aug 6, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,77 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+LLM Data Cleaner is a Python package that automates the transformation of messy text columns into structured data using various language models via the LLM API. It eliminates complex regex patterns and manual parsing by leveraging LLMs with schema validation.
+
+## Development Commands
+
+This project uses Poetry for dependency management:
+
+```bash
+# Setup and dependencies
+poetry install                    # Install all dependencies
+poetry install --only=main      # Install only production dependencies
+
+# Testing
+poetry run pytest               # Run all tests
+poetry run pytest tests/test_cleaner.py  # Run specific test file
+poetry run pytest -v            # Verbose test output
+poetry run pytest -k "test_name"  # Run specific test
+
+# Code quality
+poetry run black .              # Format code
+poetry run isort .              # Sort imports  
+poetry run flake8              # Lint code
+
+# Package building
+poetry build                    # Build package
+poetry version patch|minor|major  # Bump version
+```
+
+## Core Architecture
+
+The package has a clean modular structure:
+
+- **`DataCleaner`** (`cleaner.py`): Main class handling LLM API interactions, batch processing, and data transformation
+- **`utils.py`**: Contains `jsonize()` utility and type definitions (`InstructionField`, `InstructionSchema`)
+- **YAML instruction loading**: Dynamic Pydantic model creation from YAML configuration files
+
+### Key Technical Patterns
+
+- Uses LLM API's structured output feature with Pydantic schemas for reliable JSON responses
+- Implements batch processing with configurable sizes to respect API rate limits
+- Dynamic Pydantic model creation allows flexible schema definitions
+- Proper retry logic with exponential backoff for API resilience
+- Model-agnostic design - works with any LLM API supported model
+
+### Data Flow
+
+1. Load DataFrame and cleaning instructions (programmatic or YAML)
+2. Process columns in batches via LLM API with structured outputs
+3. Validate responses against Pydantic schemas
+4. Return enhanced DataFrame with `cleaned_*` prefixed columns alongside originals
+
+## Configuration Requirements
+
+- **Python 3.9+** required
+- **LLM API** must be installed and configured (`pip install llm`)
+- **Model setup** via LLM plugins (e.g., `llm install llm-openai` for OpenAI models)
+- **API keys** managed by LLM API (e.g., `llm keys set openai`)
+- Default model: `gpt-4o-mini` (configurable to any LLM-supported model)
+
+## Important Files
+
+- `pyproject.toml`: Poetry configuration and package metadata
+- `instructions.yaml`: Example YAML configuration for cleaning rules
+- `tests/test_data/sample.csv`: Test data fixture
+
+## Version Management
+
+The project uses Poetry's version management. Current version is 0.5.0 after migrating to LLM API. Maintain consistency between `pyproject.toml` and `CITATION.cff` when updating versions.
+
+## Testing Strategy
+
+Tests use mocked LLM API calls to avoid real API usage. The mocks patch `llm.get_model` and simulate model responses. Run tests before any API-related changes to ensure mocking remains effective. Test data fixtures are in `tests/test_data/`.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # LLM Data Cleaner
 
-LLM Data Cleaner automates the transformation of messy text columns into well structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.
+LLM Data Cleaner automates the transformation of messy text columns into well structured data using LLM models via the LLM API. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.
 
 ## Why use it?
 
@@ -10,23 +10,40 @@ LLM Data Cleaner automates the transformation of messy text columns into well st
 
 ## Installation
 
-Requires **Python 3.9+**.
+Requires **Python 3.9+** and the LLM API.
 
 ```bash
 pip install git+https://github.com/codedthinking/llm_data_cleaner.git
+pip install llm
 ```
 
 Or with Poetry:
 
 ```bash
 poetry add git+https://github.com/codedthinking/llm_data_cleaner.git
+poetry add llm
 ```
 
+## Setup
+
+The LLM API requires model configuration. For OpenAI models:
+
+```bash
+# Install the OpenAI plugin for LLM
+llm install llm-openai
+
+# Set your API key
+llm keys set openai
+# Enter your OpenAI API key when prompted
+```
+
+For other providers, see the [LLM documentation](https://llm.datasette.io/en/stable/other-models.html).
+
 ## Step by step
 
 1. Create Pydantic models describing the cleaned values.
 2. Define a dictionary of instructions mapping column names to a prompt and schema.
-3. Instantiate `DataCleaner` with your OpenAI API key.
+3. Instantiate `DataCleaner` with your preferred model (no API key needed - handled by LLM).
 4. Load your raw CSV file with `pandas`.
 5. Call `clean_dataframe(df, instructions)`.
 6. Inspect the returned DataFrame which contains new `cleaned_*` columns.
@@ -60,7 +77,7 @@ instructions = {
     },
 }
 
-cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY")
+cleaner = DataCleaner(model="gpt-4o-mini")  # or any LLM-supported model
 raw_df = pd.DataFrame({
     "address": ["Budapest Váci út 1", "1200 Vienna Mariahilfer Straße 10"],
     "profession": ["dev", "data eng"]
@@ -76,7 +93,7 @@ from llm_data_cleaner import DataCleaner, load_yaml_instructions
 import pandas as pd
 
 instructions = load_yaml_instructions("instructions.yaml")
-cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY", system_prompt="{column_prompt}")
+cleaner = DataCleaner(model="gpt-4o-mini", system_prompt="{column_prompt}")
 raw_df = pd.read_csv("data.csv")
 result = cleaner.clean_dataframe(raw_df, instructions)
 ```

diff --git a/llm_data_cleaner/cleaner.py b/llm_data_cleaner/cleaner.py
@@ -1,7 +1,8 @@
 import os
 import pandas as pd
 from typing import Dict, Any, Type, List, Optional
-from openai import OpenAI
+import llm
+import json
 from pydantic import BaseModel, create_model, ConfigDict
 from llm_data_cleaner.utils import InstructionField, InstructionSchema
 import time
@@ -11,21 +12,20 @@
 
 class DataCleaner:
     """
-    Batch DataCleaner that uses OpenAI's responses.parse method with auto-generated prompts.
+    Batch DataCleaner that uses LLM API with structured output for data cleaning.
     """
 
     def __init__(
         self,
-        api_key: str,
-        model: str = "gpt-4o-2024-08-06",
+        model: str = "gpt-4o-mini",
         max_retries: int = 3,
         retry_delay: int = 5,
         batch_size: int = 10,
         system_prompt: str = None,
         temperature: float = 0.0,
     ):
-        self.client = OpenAI(api_key=api_key)
-        self.model = model
+        self.model_name = model
+        self.model = llm.get_model(model)
         self.max_retries = max_retries
         self.retry_delay = retry_delay
         self.batch_size = batch_size
@@ -118,7 +118,7 @@ def _process_batch(
             if item is None:
                 continue
             index = item.index
-            for fname in item.model_fields:
+            for fname in item.__class__.model_fields:
                 if fname != "index":
                     colname = f"cleaned_{fname}"
                     if colname not in result_batch.columns:
@@ -138,13 +138,21 @@ def _clean_batch(
     ):
         for attempt in range(self.max_retries):
             try:
-                resp = self.client.responses.parse(
-                    model=self.model,
-                    input=messages,
-                    text_format=pyd_model_batch,
-                    temperature=self.temperature,
-                )
-                return resp.output_parsed
+                # Combine system and user messages into a single prompt
+                system_content = messages[0]["content"] if messages[0]["role"] == "system" else ""
+                user_content = messages[1]["content"] if len(messages) > 1 and messages[1]["role"] == "user" else ""
+
+                full_prompt = f"{system_content}\n\nData to process: {user_content}"
+
+                # Use LLM API with Pydantic schema for structured output
+                response = self.model.prompt(full_prompt, schema=pyd_model_batch, temperature=self.temperature)
+
+                # Parse the JSON response
+                parsed_data = json.loads(response.text())
+
+                # Convert to the expected Pydantic model instance
+                return pyd_model_batch(**parsed_data)
+
             except Exception as e:
                 print(f"Batch cleaning error: {e} (attempt {attempt+1}/{self.max_retries})")
                 time.sleep(self.retry_delay)