AI Powered Document Intelligence

A Python-based document processing system that uses Large Language Models (LLM) to extract structured data from various document types including PDFs and DOCX files. The system supports multiple document categories and provides both JSON and Excel outputs.

Features

Multi-format Support: Process PDF and DOCX documents
Category-based Processing: Support for asset, electricity, lease, rental, and utility documents
AI-Powered Extraction: Uses OpenAI's GPT-4 for intelligent data extraction
Structured Output: Provides both JSON and Excel formats
Batch Processing: Handle multiple files in a single request
Schema Validation: Ensures extracted data conforms to predefined schemas
RESTful API: Clean, documented API endpoints

Supported Document Categories

Category	Description	Schema Fields
`asset`	Asset purchase orders and related documents	21 fields including vendor info, costs, delivery details
`electricity`	Electricity bills and utility statements	45 fields including billing info, consumption data
`lease`	Lease agreements and rental contracts	Custom fields based on lease schema
`rental`	Rental invoices and payment documents	Custom fields based on rental schema
`util`	General utility bills and service documents	Custom fields based on utility schema

Prerequisites

Python 3.9 or higher
OpenAI API key
Windows/Linux/macOS

Installation

1. Clone the Repository

git clone https://github.com/rajarshidattapy/AI_document_intelligence.git
cd document_intelligence

2. Create Virtual Environment

# Windows
python -m venv venv
venv\Scripts\activate

# Linux/macOS
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Environment Configuration

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here

Important: Replace your_openai_api_key_here with your actual OpenAI API key. You can obtain one from OpenAI's platform.

5. Create Required Directories

mkdir -p static/output

Running the Application

Development Server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Production Server

uvicorn main:app --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

API Documentation

Once the server is running, you can access:

Interactive API Docs: http://localhost:8000/docs
ReDoc Documentation: http://localhost:8000/redoc

API Usage

Endpoint: `POST /extract`

Extract structured data from uploaded documents.

Request Parameters

category (form field): Document category (asset, electricity, lease, rental, util)
files (file upload): One or more document files (PDF, DOCX)

Example Request

curl -X POST "http://localhost:8000/extract" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "category=asset" \
  -F "files=@document1.pdf" \
  -F "files=@document2.docx" \
  --output output_files.zip

Example Python Request

import requests

url = "http://localhost:8000/extract"
files = [
    ('files', ('document1.pdf', open('document1.pdf', 'rb'), 'application/pdf')),
    ('files', ('document2.docx', open('document2.docx', 'rb'), 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'))
]
data = {'category': 'asset'}

response = requests.post(url, files=files, data=data)

# Save the ZIP file
with open('output_files.zip', 'wb') as f:
    f.write(response.content)

Response

Returns a ZIP file containing:

JSON files with extracted structured data
Excel files with the same data in spreadsheet format

Project Structure

document_intelligence/
├── main.py                 # FastAPI application entry point
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── .env                   # Environment variables (create this)
├── schemas/               # JSON schemas for each document category
│   ├── categories.py      # Supported categories configuration
│   ├── asset.json         # Asset document schema
│   ├── electricity.json   # Electricity bill schema
│   ├── lease.json         # Lease agreement schema
│   ├── rental.json        # Rental invoice schema
│   └── util.json          # Utility bill schema
├── utils/                 # Core processing modules
│   ├── extract_text.py    # Document text extraction
│   ├── validate_category.py # Category validation and prompt generation
│   ├── llm_processor.py   # OpenAI LLM integration
│   ├── structure_output.py # Response structuring and validation
│   └── json_to_excel.py   # JSON to Excel conversion
├── static/                # Static files and outputs
│   └── output/            # Generated output files
└── resources/             # Sample documents and templates
    ├── Asset PO/          # Asset purchase order samples
    ├── Electricity/       # Electricity bill samples
    ├── Lease/             # Lease agreement samples
    ├── Rental/            # Rental invoice samples
    └── Utility/           # Utility bill samples

Processing Pipeline

Document Upload: Files are uploaded via the API endpoint
Text Extraction: LangChain loaders extract text from PDF/DOCX files
Prompt Generation: Category-specific prompts are generated from JSON schemas
LLM Processing: OpenAI GPT-4 extracts structured data
Response Validation: Extracted data is validated against schemas
Output Generation: Both JSON and Excel formats are created
File Delivery: Results are packaged in a ZIP file for download

Error Handling

The API includes comprehensive error handling for:

Invalid document categories
Unsupported file formats
Missing API keys
LLM processing errors
File I/O errors

Configuration

Environment Variables

Variable	Description	Required
`OPENAI_API_KEY`	OpenAI API key for LLM processing	Yes

Customizing Schemas

To add new document categories or modify existing ones:

Add the category name to schemas/categories.py
Create a corresponding JSON schema file in schemas/
The schema should define all expected fields with empty string defaults

Example schema structure:

{
    "Field_Name_1": "",
    "Field_Name_2": "",
    "Field_Name_3": ""
}

Detailed Function Documentation

Main Application (`main.py`)

Function: `extract(category: str, files: list[UploadFile])`

Description: Primary API endpoint for document data extraction
Parameters:

category (str): Document category for schema selection
files (list[UploadFile]): List of uploaded document files
Return Value: StreamingResponse containing ZIP file with JSON and Excel outputs
Code Flow:

Validates category against supported categories
Processes each uploaded file through the extraction pipeline
Extracts text content using LangChain document loaders
Generates LLM prompts based on category-specific schemas
Calls OpenAI API for structured data extraction
Validates and structures the LLM response
Converts results to Excel format
Packages all outputs into a ZIP file for download
Exceptions: JSONResponse(400) for invalid categories, file processing errors

Text Extraction Module (`utils/extract_text.py`)

Function: `extract_text_from_file(filepath: str) -> str`

Description: Extracts text content from PDF and DOCX documents using LangChain
Parameters:

filepath (str): Path to the document file
Return Value: str - Extracted text content from the document
Code Flow:

Determines file type based on extension
Selects appropriate LangChain loader (UnstructuredPDFLoader for PDFs, Docx2txtLoader for DOCX)
Loads document with optimal strategy for text extraction
Joins all document elements into unified text string
Returns cleaned text content for LLM processing
Exceptions: ValueError for unsupported file types
Example:

text = extract_text_from_file("document.pdf")
print(len(text))  # Character count of extracted text

LLM Processing Module (`utils/llm_processor.py`)

Function: `call_llm(prompt: str) -> str`

Description: Communicates with OpenAI's GPT-4 model for document data extraction
Parameters:

prompt (str): Formatted prompt containing document text and extraction instructions
Return Value: str - LLM response containing extracted structured data
Code Flow:

Loads OpenAI API key from environment variables
Configures OpenAI client with authentication
Sends structured prompt to GPT-4.1-mini-2025-04-14 model
Uses low temperature (0.2) for consistent, factual responses
Extracts and cleans response content
Returns structured JSON data
Exceptions: EnvironmentError for missing API key, OpenAI API errors
Example:

response = call_llm("Extract vendor information from this document...")
print(response)  # JSON formatted extracted data

Category Validation Module (`utils/validate_category.py`)

Function: `generate_prompt_from_schema(category: str, text: str) -> str`

Description: Creates structured prompts for LLM processing based on JSON schemas
Parameters:

category (str): Document category for schema selection
text (str): Extracted document text content
Return Value: str - Formatted prompt string for LLM processing
Code Flow:

Constructs path to category-specific JSON schema file
Validates schema file existence
Loads JSON schema defining expected output structure
Converts field names from snake_case to Title Case
Creates JSON example showing expected output format
Builds comprehensive prompt with extraction guidelines
Returns formatted prompt with clear instructions
Exceptions: FileNotFoundError for missing schema files
Example:

prompt = generate_prompt_from_schema('asset', document_text)
print(len(prompt))  # Length of generated prompt

Output Structuring Module (`utils/structure_output.py`)

Function: `structure_response(category: str, llm_response: str, content_extracted: str) -> dict`

Description: Validates and structures LLM responses according to predefined schemas
Parameters:

category (str): Document category for schema validation
llm_response (str): Raw JSON response from LLM
content_extracted (str): Original document text content
Return Value: dict - Structured output with category, content, and extracted data
Code Flow:

Parses LLM response as JSON
Loads category-specific schema for field validation
Creates output dictionary with all required schema fields
Handles missing fields by setting them to empty strings
Constructs final result structure with metadata
Saves complete result to JSON file for persistence
Returns structured result for immediate use
Exceptions: json.JSONDecodeError for invalid JSON, FileNotFoundError for missing schemas
Example:

result = structure_response('asset', llm_json, document_text)
print(result['output']['Vendor_Name'])  # Access extracted data

JSON to Excel Conversion Module (`utils/json_to_excel.py`)

Function: `json_to_excel(json_path: str, output_dir: str = "static/output") -> str`

Description: Converts structured JSON data into Excel format for analysis
Parameters:

json_path (str): Path to JSON file containing structured data
output_dir (str): Directory for Excel file output (default: "static/output")
Return Value: str - Path to the created Excel file
Code Flow:

Reads and parses JSON file containing structured data
Extracts category name for Excel sheet naming
Validates output data structure
Creates new Excel workbook with category as sheet name
Writes field names as column headers
Populates data row with extracted values
Generates descriptive filename based on category and original file
Ensures output directory exists
Saves Excel workbook and returns file path
Exceptions: ValueError for invalid output structure, FileNotFoundError for missing JSON
Example:

excel_path = json_to_excel("output.json")
print(f"Excel file created at: {excel_path}")

Testing

Sample Documents

The resources/ directory contains sample documents for testing:

Asset purchase orders
Electricity bills
Lease agreements
Rental invoices
Utility bills

Testing the API

Start the server: uvicorn main:app --reload
Open http://localhost:8000/docs
Use the interactive interface to upload sample documents
Download and verify the generated outputs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
schemas		schemas
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

rajarshidattapy/AI_document_intelligence

Folders and files

Latest commit

History

Repository files navigation

AI Powered Document Intelligence

Features

Supported Document Categories

Prerequisites

Installation

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Environment Configuration

5. Create Required Directories

Running the Application

Development Server

Production Server

API Documentation

API Usage

Endpoint: POST /extract

Request Parameters

Example Request

Example Python Request

Response

Project Structure

Processing Pipeline

Error Handling

Configuration

Environment Variables

Customizing Schemas

Detailed Function Documentation

Main Application (main.py)

Function: extract(category: str, files: list[UploadFile])

Text Extraction Module (utils/extract_text.py)

Function: extract_text_from_file(filepath: str) -> str

LLM Processing Module (utils/llm_processor.py)

Function: call_llm(prompt: str) -> str

Category Validation Module (utils/validate_category.py)

Function: generate_prompt_from_schema(category: str, text: str) -> str

Output Structuring Module (utils/structure_output.py)

Function: structure_response(category: str, llm_response: str, content_extracted: str) -> dict

JSON to Excel Conversion Module (utils/json_to_excel.py)

Function: json_to_excel(json_path: str, output_dir: str = "static/output") -> str

Testing

Sample Documents

Testing the API

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages