A Python-based document processing system that uses Large Language Models (LLM) to extract structured data from various document types including PDFs and DOCX files. The system supports multiple document categories and provides both JSON and Excel outputs.
- Multi-format Support: Process PDF and DOCX documents
- Category-based Processing: Support for asset, electricity, lease, rental, and utility documents
- AI-Powered Extraction: Uses OpenAI's GPT-4 for intelligent data extraction
- Structured Output: Provides both JSON and Excel formats
- Batch Processing: Handle multiple files in a single request
- Schema Validation: Ensures extracted data conforms to predefined schemas
- RESTful API: Clean, documented API endpoints
| Category | Description | Schema Fields |
|---|---|---|
asset |
Asset purchase orders and related documents | 21 fields including vendor info, costs, delivery details |
electricity |
Electricity bills and utility statements | 45 fields including billing info, consumption data |
lease |
Lease agreements and rental contracts | Custom fields based on lease schema |
rental |
Rental invoices and payment documents | Custom fields based on rental schema |
util |
General utility bills and service documents | Custom fields based on utility schema |
- Python 3.9 or higher
- OpenAI API key
- Windows/Linux/macOS
git clone https://github.com/rajarshidattapy/AI_document_intelligence.git
cd document_intelligence# Windows
python -m venv venv
venv\Scripts\activate
# Linux/macOS
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=your_openai_api_key_hereImportant: Replace your_openai_api_key_here with your actual OpenAI API key. You can obtain one from OpenAI's platform.
mkdir -p static/outputuvicorn main:app --reload --host 0.0.0.0 --port 8000uvicorn main:app --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000
Once the server is running, you can access:
- Interactive API Docs:
http://localhost:8000/docs - ReDoc Documentation:
http://localhost:8000/redoc
Extract structured data from uploaded documents.
category(form field): Document category (asset, electricity, lease, rental, util)files(file upload): One or more document files (PDF, DOCX)
curl -X POST "http://localhost:8000/extract" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "category=asset" \
-F "files=@document1.pdf" \
-F "files=@document2.docx" \
--output output_files.zipimport requests
url = "http://localhost:8000/extract"
files = [
('files', ('document1.pdf', open('document1.pdf', 'rb'), 'application/pdf')),
('files', ('document2.docx', open('document2.docx', 'rb'), 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'))
]
data = {'category': 'asset'}
response = requests.post(url, files=files, data=data)
# Save the ZIP file
with open('output_files.zip', 'wb') as f:
f.write(response.content)Returns a ZIP file containing:
- JSON files with extracted structured data
- Excel files with the same data in spreadsheet format
document_intelligence/
├── main.py # FastAPI application entry point
├── requirements.txt # Python dependencies
├── README.md # This file
├── .env # Environment variables (create this)
├── schemas/ # JSON schemas for each document category
│ ├── categories.py # Supported categories configuration
│ ├── asset.json # Asset document schema
│ ├── electricity.json # Electricity bill schema
│ ├── lease.json # Lease agreement schema
│ ├── rental.json # Rental invoice schema
│ └── util.json # Utility bill schema
├── utils/ # Core processing modules
│ ├── extract_text.py # Document text extraction
│ ├── validate_category.py # Category validation and prompt generation
│ ├── llm_processor.py # OpenAI LLM integration
│ ├── structure_output.py # Response structuring and validation
│ └── json_to_excel.py # JSON to Excel conversion
├── static/ # Static files and outputs
│ └── output/ # Generated output files
└── resources/ # Sample documents and templates
├── Asset PO/ # Asset purchase order samples
├── Electricity/ # Electricity bill samples
├── Lease/ # Lease agreement samples
├── Rental/ # Rental invoice samples
└── Utility/ # Utility bill samples
- Document Upload: Files are uploaded via the API endpoint
- Text Extraction: LangChain loaders extract text from PDF/DOCX files
- Prompt Generation: Category-specific prompts are generated from JSON schemas
- LLM Processing: OpenAI GPT-4 extracts structured data
- Response Validation: Extracted data is validated against schemas
- Output Generation: Both JSON and Excel formats are created
- File Delivery: Results are packaged in a ZIP file for download
The API includes comprehensive error handling for:
- Invalid document categories
- Unsupported file formats
- Missing API keys
- LLM processing errors
- File I/O errors
| Variable | Description | Required |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key for LLM processing | Yes |
To add new document categories or modify existing ones:
- Add the category name to
schemas/categories.py - Create a corresponding JSON schema file in
schemas/ - The schema should define all expected fields with empty string defaults
Example schema structure:
{
"Field_Name_1": "",
"Field_Name_2": "",
"Field_Name_3": ""
}Description: Primary API endpoint for document data extraction
Parameters:
category(str): Document category for schema selectionfiles(list[UploadFile]): List of uploaded document files
Return Value: StreamingResponse containing ZIP file with JSON and Excel outputs
Code Flow:
- Validates category against supported categories
- Processes each uploaded file through the extraction pipeline
- Extracts text content using LangChain document loaders
- Generates LLM prompts based on category-specific schemas
- Calls OpenAI API for structured data extraction
- Validates and structures the LLM response
- Converts results to Excel format
- Packages all outputs into a ZIP file for download
Exceptions: JSONResponse(400) for invalid categories, file processing errors
Description: Extracts text content from PDF and DOCX documents using LangChain
Parameters:
filepath(str): Path to the document file
Return Value: str - Extracted text content from the document
Code Flow:
- Determines file type based on extension
- Selects appropriate LangChain loader (UnstructuredPDFLoader for PDFs, Docx2txtLoader for DOCX)
- Loads document with optimal strategy for text extraction
- Joins all document elements into unified text string
- Returns cleaned text content for LLM processing
Exceptions: ValueError for unsupported file types
Example:
text = extract_text_from_file("document.pdf")
print(len(text)) # Character count of extracted textDescription: Communicates with OpenAI's GPT-4 model for document data extraction
Parameters:
prompt(str): Formatted prompt containing document text and extraction instructions
Return Value: str - LLM response containing extracted structured data
Code Flow:
- Loads OpenAI API key from environment variables
- Configures OpenAI client with authentication
- Sends structured prompt to GPT-4.1-mini-2025-04-14 model
- Uses low temperature (0.2) for consistent, factual responses
- Extracts and cleans response content
- Returns structured JSON data
Exceptions: EnvironmentError for missing API key, OpenAI API errors
Example:
response = call_llm("Extract vendor information from this document...")
print(response) # JSON formatted extracted dataDescription: Creates structured prompts for LLM processing based on JSON schemas
Parameters:
category(str): Document category for schema selectiontext(str): Extracted document text content
Return Value: str - Formatted prompt string for LLM processing
Code Flow:
- Constructs path to category-specific JSON schema file
- Validates schema file existence
- Loads JSON schema defining expected output structure
- Converts field names from snake_case to Title Case
- Creates JSON example showing expected output format
- Builds comprehensive prompt with extraction guidelines
- Returns formatted prompt with clear instructions
Exceptions: FileNotFoundError for missing schema files
Example:
prompt = generate_prompt_from_schema('asset', document_text)
print(len(prompt)) # Length of generated promptDescription: Validates and structures LLM responses according to predefined schemas
Parameters:
category(str): Document category for schema validationllm_response(str): Raw JSON response from LLMcontent_extracted(str): Original document text content
Return Value: dict - Structured output with category, content, and extracted data
Code Flow:
- Parses LLM response as JSON
- Loads category-specific schema for field validation
- Creates output dictionary with all required schema fields
- Handles missing fields by setting them to empty strings
- Constructs final result structure with metadata
- Saves complete result to JSON file for persistence
- Returns structured result for immediate use
Exceptions: json.JSONDecodeError for invalid JSON, FileNotFoundError for missing schemas
Example:
result = structure_response('asset', llm_json, document_text)
print(result['output']['Vendor_Name']) # Access extracted dataDescription: Converts structured JSON data into Excel format for analysis
Parameters:
json_path(str): Path to JSON file containing structured dataoutput_dir(str): Directory for Excel file output (default: "static/output")
Return Value: str - Path to the created Excel file
Code Flow:
- Reads and parses JSON file containing structured data
- Extracts category name for Excel sheet naming
- Validates output data structure
- Creates new Excel workbook with category as sheet name
- Writes field names as column headers
- Populates data row with extracted values
- Generates descriptive filename based on category and original file
- Ensures output directory exists
- Saves Excel workbook and returns file path
Exceptions: ValueError for invalid output structure, FileNotFoundError for missing JSON
Example:
excel_path = json_to_excel("output.json")
print(f"Excel file created at: {excel_path}")The resources/ directory contains sample documents for testing:
- Asset purchase orders
- Electricity bills
- Lease agreements
- Rental invoices
- Utility bills
- Start the server:
uvicorn main:app --reload - Open
http://localhost:8000/docs - Use the interactive interface to upload sample documents
- Download and verify the generated outputs