feat: Add Invoice Processing at Scale InfrastructureFeature/invoice processing at scale#32
feat: Add Invoice Processing at Scale InfrastructureFeature/invoice processing at scale#32mrrahman1517 wants to merge 7 commits intomainfrom
Conversation
Sprint 1 - Make invoice extraction reliable: - Invoice-specific pipeline entrypoint (InvoicePipeline) - Pydantic invoice model with strict JSON schema validation - Invoice validator and normalizer processors - Confidence routing + NEEDS_REVIEW status for human-in-the-loop Sprint 2 - Handle real invoices: - OCR processor with scanned PDF auto-detection - Two-stage line item extraction (heuristic + LLM refinement) - Vendor/customer normalization with deduplication keys Sprint 3 - Scale infrastructure: - Async job worker using existing Operation model - Job queue with priorities and dependencies - Rate limiting + batching for LLM calls - Cost tracking per model Sprint 4 - Export connectors: - Webhook connector with HMAC signing - S3 connector with encryption support - Database connector with upsert - CSV exporter with compression Key components: - docex/models/invoice.py: Pydantic models for invoice data - docex/processors/invoice/: Extractor, normalizer, validator, pipeline - docex/processors/pdf_ocr.py: OCR for scanned PDFs - docex/services/invoice_service.py: High-level service API - docex/jobs/: Worker, queue, rate limiter - docex/connectors/: Webhook, S3, database, CSV export - examples/invoice_processing_example.py: Complete usage example
- Add _call_llm method that supports different adapter APIs: - LocalLLMService.generate_completion() - OllamaAdapter.generate() - OpenAI-style chat() method - Fix confidence score handling when None - Fix example to use correct DocEX API - Pass actual document IDs to job queue example
|
Thanks for prepare this pull request. Can you first review the concept of connectors? DocEX already have storage components to support both S3 and file system. It also has a postgres backend as well. |
tommyGPT2S
left a comment
There was a problem hiding this comment.
please review if connectors are necessary. Ideally, we should leverage current storage components.
Address PR review comments: - Removed S3Connector (docex/connectors/s3.py) as it duplicates existing S3Storage functionality in docex/storage/s3_storage.py - Added StorageExporter that wraps existing storage infrastructure for structured data export use cases - Updated connector documentation to clarify the distinction between Connectors (structured data export) vs Storage (document content) - Added convenience functions export_to_s3() and export_to_filesystem() that use existing storage components For raw document storage, use docex.storage (S3Storage, FilesystemStorage) For structured data export, use docex.connectors (WebhookConnector, DatabaseConnector, CSVExporter, StorageExporter)
|
|
@tommyGPT2S Thanks for the review and happy new year! I've addressed the feedback: Changes Made
Clarification: Connectors vs StorageAfter review, I kept 3 connectors because they serve a different purpose than storage:
Example Use CaseWhen processing invoices:
Verification
Let me know if you'd like me to make any other adjustments! |
… storage Updated documentation to clearly distinguish: - DatabaseConnector: For exporting structured data to EXTERNAL systems (customer ERPs, data warehouses, reporting databases) - docex.db.*: For DocEX INTERNAL storage (documents, metadata, operations) This addresses PR feedback about potential confusion with existing PostgreSQL backend.
FileSystemStorage.save() expects a file-like object with read() method. Wrap JSON content in BytesIO for compatibility with existing storage.
|
@tommyGPT2S Thanks for the review and happy new year! I've addressed the feedback: Changes Made
Clarification: Connectors vs Storage vs DatabaseAfter review, the remaining connectors serve a different purpose than existing components:
Why We Did NOT Reimplement PostgreSQLThe
Example flow:
The VerificationAll tests pass:
Let me know if you'd like me to make any other adjustments! |
Invoice Processing at Scale
Summary
This PR introduces a comprehensive invoice processing infrastructure for DocEX, enabling reliable extraction, validation, and processing of invoices at scale with human-in-the-loop support.
Key Features
Reliable Invoice Extraction
NEEDS_REVIEWstatus for low-confidence extractionsReal-World Invoice Handling
Scalability Infrastructure
Export Connectors
Components Added
docex/
├── connectors/ # Export connectors (webhook, S3, DB, CSV)
├── jobs/ # Async worker, queue, rate limiter
├── models/invoice.py # Pydantic invoice models
├── processors/invoice/ # Extractor, normalizer, validator, pipeline
├── processors/pdf_ocr.py # OCR for scanned PDFs
└── services/invoice_service.py # High-level service API
examples/
└── invoice_processing_example.py # Complete usage example
Testing
All features tested end-to-end with Ollama (local LLM):
Usage Example
from docex.services.invoice_service import InvoiceService
from docex.models.invoice import InvoiceProcessingConfig
Configure
config = InvoiceProcessingConfig(
confidence_threshold=0.8,
auto_approve_threshold=0.95,
enable_ocr=True
)
Process invoice
service = InvoiceService(db, llm_adapter, config)
result = await service.process_invoice(document)
Breaking Changes
None - this is a new feature addition.
Dependencies
Optional dependencies for full functionality:
pytesseract+pdf2image: OCR supportboto3: S3 connectorhttpxoraiohttp: Webhook connector