Automates extraction and consolidation of financial data from ESG PDF reports using Gemini and a PDF processing library. Processes all reports in batches, extracts appendix sections, and returns structured JSON.
flowchart TB
A[Client] --> B[GET /process]
B --> C{Cache hit?}
C -- yes --> D[Return cached 200]
C -- no --> E[process_reports]
E --> F[Consolidate + create response data]
F --> G[Return 200]
flowchart LR
subgraph Artifacts
R[reports/*.pdf]
subgraph Parallel
PP[preprocessing/*.pdf]
PR[processed/*.json]
end
RJ[processed/result.json]
VJ[config/values.json]
end
R -- detect+extract --> PP
PP -- extract fields --> PR
PR -- consolidate --> RJ
VJ -. schema guides .-> PR
-
Requirements: Python 3.10+ (recommended 3.11+)
-
Install:
make dev-install
-
Create
.envat repo root (required at minimum):LAST_MODIFIED=2025-01-01T00:00:00Z # Optional GEMINI_API_KEY= # set to enable real Gemini calls GEMINI_MODEL=gemini-1.5-pro GEMINI_FALLBACK_ON_QUOTA=false PIPELINE_MAX_WORKERS=4 GEMINI_CONCURRENCY=2 GEMINI_TIMEOUT_SECONDS=30.0 PER_FILE_TIMEOUT_SECONDS=180.0
-
Ensure directories exist (created automatically on first run if missing):
./reports(input PDFs)./preprocessing(appendix-only PDFs)./processed(per-file JSONs andresult.json)./config/values.json(field schema)
-
Run server:
make run # or uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
- Endpoint:
GET /process- Query:
force(bool, default false). When true, bypasses cache and reprocesses even ifresult.jsonmatches.env LAST_MODIFIED. - Success (200): Consolidated JSON excluding LAST_MODIFIED.
- Error (500): Any file system, AI processing, or data processing failures per PRD.
- Query:
- Cache check against
.env LAST_MODIFIEDand existing./processed/result.json. - Parallel fan-out per PDF in
./reports(bounded byPIPELINE_MAX_WORKERS). - Per-file sequential steps: processed check, preprocessing check, appendix range via Gemini, PDF slice, field extraction via Gemini, save
./processed/[file].json. - Wait for all tasks and validate all expected JSONs.
- Consolidate to
./processed/result.jsonwithlast_modified=.env LAST_MODIFIED. - Return response omitting LAST_MODIFIED.
If GEMINI_API_KEY is empty, the pipeline uses a stub path that generates minimal JSON per file, enabling dev/test without external calls.
- Run tests:
make test - Lint/format:
make lint make format
Environment variables (via .env):
LAST_MODIFIED(required): The cache-busting timestamp token.GEMINI_API_KEY(optional): Enables real Gemini processing when set.GEMINI_MODEL(optional): Defaultgemini-1.5-pro.GEMINI_FALLBACK_ON_QUOTA(optional): When true, files that hit quota/rate-limit are stubbed instead of failing the batch.PIPELINE_MAX_WORKERS(optional): Max parallel files processed.GEMINI_CONCURRENCY(optional): Max concurrent Gemini calls.GEMINI_TIMEOUT_SECONDS,PER_FILE_TIMEOUT_SECONDS(optional): Timeouts in seconds.
reports/input PDFspreprocessing/appendix-only PDFsprocessed/per-file JSONs, plusresult.jsonconfig/values.jsonschema for field extraction
- See
troubleshooting/for known issues and mitigations (e.g., Gemini 429 quota). Use/process?force=trueafter updating.env.
Educational use only. See LICENSE for full terms.
