Skip to content

rajarshidattapy/AI_document_intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Powered Document Intelligence

A Python-based document processing system that uses Large Language Models (LLM) to extract structured data from various document types including PDFs and DOCX files. The system supports multiple document categories and provides both JSON and Excel outputs.

Features

  • Multi-format Support: Process PDF and DOCX documents
  • Category-based Processing: Support for asset, electricity, lease, rental, and utility documents
  • AI-Powered Extraction: Uses OpenAI's GPT-4 for intelligent data extraction
  • Structured Output: Provides both JSON and Excel formats
  • Batch Processing: Handle multiple files in a single request
  • Schema Validation: Ensures extracted data conforms to predefined schemas
  • RESTful API: Clean, documented API endpoints

Supported Document Categories

Category Description Schema Fields
asset Asset purchase orders and related documents 21 fields including vendor info, costs, delivery details
electricity Electricity bills and utility statements 45 fields including billing info, consumption data
lease Lease agreements and rental contracts Custom fields based on lease schema
rental Rental invoices and payment documents Custom fields based on rental schema
util General utility bills and service documents Custom fields based on utility schema

Prerequisites

  • Python 3.9 or higher
  • OpenAI API key
  • Windows/Linux/macOS

Installation

1. Clone the Repository

git clone https://github.com/rajarshidattapy/AI_document_intelligence.git
cd document_intelligence

2. Create Virtual Environment

# Windows
python -m venv venv
venv\Scripts\activate

# Linux/macOS
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Environment Configuration

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here

Important: Replace your_openai_api_key_here with your actual OpenAI API key. You can obtain one from OpenAI's platform.

5. Create Required Directories

mkdir -p static/output

Running the Application

Development Server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Production Server

uvicorn main:app --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

API Documentation

Once the server is running, you can access:

  • Interactive API Docs: http://localhost:8000/docs
  • ReDoc Documentation: http://localhost:8000/redoc

API Usage

Endpoint: POST /extract

Extract structured data from uploaded documents.

Request Parameters

  • category (form field): Document category (asset, electricity, lease, rental, util)
  • files (file upload): One or more document files (PDF, DOCX)

Example Request

curl -X POST "http://localhost:8000/extract" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "category=asset" \
  -F "files=@document1.pdf" \
  -F "files=@document2.docx" \
  --output output_files.zip

Example Python Request

import requests

url = "http://localhost:8000/extract"
files = [
    ('files', ('document1.pdf', open('document1.pdf', 'rb'), 'application/pdf')),
    ('files', ('document2.docx', open('document2.docx', 'rb'), 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'))
]
data = {'category': 'asset'}

response = requests.post(url, files=files, data=data)

# Save the ZIP file
with open('output_files.zip', 'wb') as f:
    f.write(response.content)

Response

Returns a ZIP file containing:

  • JSON files with extracted structured data
  • Excel files with the same data in spreadsheet format

Project Structure

document_intelligence/
├── main.py                 # FastAPI application entry point
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── .env                   # Environment variables (create this)
├── schemas/               # JSON schemas for each document category
│   ├── categories.py      # Supported categories configuration
│   ├── asset.json         # Asset document schema
│   ├── electricity.json   # Electricity bill schema
│   ├── lease.json         # Lease agreement schema
│   ├── rental.json        # Rental invoice schema
│   └── util.json          # Utility bill schema
├── utils/                 # Core processing modules
│   ├── extract_text.py    # Document text extraction
│   ├── validate_category.py # Category validation and prompt generation
│   ├── llm_processor.py   # OpenAI LLM integration
│   ├── structure_output.py # Response structuring and validation
│   └── json_to_excel.py   # JSON to Excel conversion
├── static/                # Static files and outputs
│   └── output/            # Generated output files
└── resources/             # Sample documents and templates
    ├── Asset PO/          # Asset purchase order samples
    ├── Electricity/       # Electricity bill samples
    ├── Lease/             # Lease agreement samples
    ├── Rental/            # Rental invoice samples
    └── Utility/           # Utility bill samples

Processing Pipeline

  1. Document Upload: Files are uploaded via the API endpoint
  2. Text Extraction: LangChain loaders extract text from PDF/DOCX files
  3. Prompt Generation: Category-specific prompts are generated from JSON schemas
  4. LLM Processing: OpenAI GPT-4 extracts structured data
  5. Response Validation: Extracted data is validated against schemas
  6. Output Generation: Both JSON and Excel formats are created
  7. File Delivery: Results are packaged in a ZIP file for download

Error Handling

The API includes comprehensive error handling for:

  • Invalid document categories
  • Unsupported file formats
  • Missing API keys
  • LLM processing errors
  • File I/O errors

Configuration

Environment Variables

Variable Description Required
OPENAI_API_KEY OpenAI API key for LLM processing Yes

Customizing Schemas

To add new document categories or modify existing ones:

  1. Add the category name to schemas/categories.py
  2. Create a corresponding JSON schema file in schemas/
  3. The schema should define all expected fields with empty string defaults

Example schema structure:

{
    "Field_Name_1": "",
    "Field_Name_2": "",
    "Field_Name_3": ""
}

Detailed Function Documentation

Main Application (main.py)

Function: extract(category: str, files: list[UploadFile])

Description: Primary API endpoint for document data extraction
Parameters:

  • category (str): Document category for schema selection
  • files (list[UploadFile]): List of uploaded document files
    Return Value: StreamingResponse containing ZIP file with JSON and Excel outputs
    Code Flow:
  1. Validates category against supported categories
  2. Processes each uploaded file through the extraction pipeline
  3. Extracts text content using LangChain document loaders
  4. Generates LLM prompts based on category-specific schemas
  5. Calls OpenAI API for structured data extraction
  6. Validates and structures the LLM response
  7. Converts results to Excel format
  8. Packages all outputs into a ZIP file for download
    Exceptions: JSONResponse(400) for invalid categories, file processing errors

Text Extraction Module (utils/extract_text.py)

Function: extract_text_from_file(filepath: str) -> str

Description: Extracts text content from PDF and DOCX documents using LangChain
Parameters:

  • filepath (str): Path to the document file
    Return Value: str - Extracted text content from the document
    Code Flow:
  1. Determines file type based on extension
  2. Selects appropriate LangChain loader (UnstructuredPDFLoader for PDFs, Docx2txtLoader for DOCX)
  3. Loads document with optimal strategy for text extraction
  4. Joins all document elements into unified text string
  5. Returns cleaned text content for LLM processing
    Exceptions: ValueError for unsupported file types
    Example:
text = extract_text_from_file("document.pdf")
print(len(text))  # Character count of extracted text

LLM Processing Module (utils/llm_processor.py)

Function: call_llm(prompt: str) -> str

Description: Communicates with OpenAI's GPT-4 model for document data extraction
Parameters:

  • prompt (str): Formatted prompt containing document text and extraction instructions
    Return Value: str - LLM response containing extracted structured data
    Code Flow:
  1. Loads OpenAI API key from environment variables
  2. Configures OpenAI client with authentication
  3. Sends structured prompt to GPT-4.1-mini-2025-04-14 model
  4. Uses low temperature (0.2) for consistent, factual responses
  5. Extracts and cleans response content
  6. Returns structured JSON data
    Exceptions: EnvironmentError for missing API key, OpenAI API errors
    Example:
response = call_llm("Extract vendor information from this document...")
print(response)  # JSON formatted extracted data

Category Validation Module (utils/validate_category.py)

Function: generate_prompt_from_schema(category: str, text: str) -> str

Description: Creates structured prompts for LLM processing based on JSON schemas
Parameters:

  • category (str): Document category for schema selection
  • text (str): Extracted document text content
    Return Value: str - Formatted prompt string for LLM processing
    Code Flow:
  1. Constructs path to category-specific JSON schema file
  2. Validates schema file existence
  3. Loads JSON schema defining expected output structure
  4. Converts field names from snake_case to Title Case
  5. Creates JSON example showing expected output format
  6. Builds comprehensive prompt with extraction guidelines
  7. Returns formatted prompt with clear instructions
    Exceptions: FileNotFoundError for missing schema files
    Example:
prompt = generate_prompt_from_schema('asset', document_text)
print(len(prompt))  # Length of generated prompt

Output Structuring Module (utils/structure_output.py)

Function: structure_response(category: str, llm_response: str, content_extracted: str) -> dict

Description: Validates and structures LLM responses according to predefined schemas
Parameters:

  • category (str): Document category for schema validation
  • llm_response (str): Raw JSON response from LLM
  • content_extracted (str): Original document text content
    Return Value: dict - Structured output with category, content, and extracted data
    Code Flow:
  1. Parses LLM response as JSON
  2. Loads category-specific schema for field validation
  3. Creates output dictionary with all required schema fields
  4. Handles missing fields by setting them to empty strings
  5. Constructs final result structure with metadata
  6. Saves complete result to JSON file for persistence
  7. Returns structured result for immediate use
    Exceptions: json.JSONDecodeError for invalid JSON, FileNotFoundError for missing schemas
    Example:
result = structure_response('asset', llm_json, document_text)
print(result['output']['Vendor_Name'])  # Access extracted data

JSON to Excel Conversion Module (utils/json_to_excel.py)

Function: json_to_excel(json_path: str, output_dir: str = "static/output") -> str

Description: Converts structured JSON data into Excel format for analysis
Parameters:

  • json_path (str): Path to JSON file containing structured data
  • output_dir (str): Directory for Excel file output (default: "static/output")
    Return Value: str - Path to the created Excel file
    Code Flow:
  1. Reads and parses JSON file containing structured data
  2. Extracts category name for Excel sheet naming
  3. Validates output data structure
  4. Creates new Excel workbook with category as sheet name
  5. Writes field names as column headers
  6. Populates data row with extracted values
  7. Generates descriptive filename based on category and original file
  8. Ensures output directory exists
  9. Saves Excel workbook and returns file path
    Exceptions: ValueError for invalid output structure, FileNotFoundError for missing JSON
    Example:
excel_path = json_to_excel("output.json")
print(f"Excel file created at: {excel_path}")

Testing

Sample Documents

The resources/ directory contains sample documents for testing:

  • Asset purchase orders
  • Electricity bills
  • Lease agreements
  • Rental invoices
  • Utility bills

Testing the API

  1. Start the server: uvicorn main:app --reload
  2. Open http://localhost:8000/docs
  3. Use the interactive interface to upload sample documents
  4. Download and verify the generated outputs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages