ESGExtract

Automates extraction and consolidation of financial data from ESG PDF reports using Gemini and a PDF processing library. Processes all reports in batches, extracts appendix sections, and returns structured JSON.

Flowchart

flowchart TB
  A[Client] --> B[GET /process]
  B --> C{Cache hit?}
  C -- yes --> D[Return cached 200]
  C -- no --> E[process_reports]
  E --> F[Consolidate + create response data]
  F --> G[Return 200]

flowchart LR
    subgraph Artifacts
      R[reports/*.pdf]
      subgraph Parallel
        PP[preprocessing/*.pdf]
        PR[processed/*.json]
      end
      RJ[processed/result.json]
      VJ[config/values.json]
    end
    R -- detect+extract --> PP
    PP -- extract fields --> PR
    PR -- consolidate --> RJ
    VJ -. schema guides .-> PR

Quickstart

Requirements: Python 3.10+ (recommended 3.11+)
Install:
```
make dev-install
```

Create .env at repo root (required at minimum):

LAST_MODIFIED=2025-01-01T00:00:00Z
# Optional
GEMINI_API_KEY= # set to enable real Gemini calls
GEMINI_MODEL=gemini-1.5-pro
GEMINI_FALLBACK_ON_QUOTA=false
PIPELINE_MAX_WORKERS=4
GEMINI_CONCURRENCY=2
GEMINI_TIMEOUT_SECONDS=30.0
PER_FILE_TIMEOUT_SECONDS=180.0

Ensure directories exist (created automatically on first run if missing):
- ./reports (input PDFs)
- ./preprocessing (appendix-only PDFs)
- ./processed (per-file JSONs and result.json)
- ./config/values.json (field schema)

Run server:

make run
# or
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

API

Endpoint: GET /process
- Query: force (bool, default false). When true, bypasses cache and reprocesses even if result.json matches .env LAST_MODIFIED.
- Success (200): Consolidated JSON excluding LAST_MODIFIED.
- Error (500): Any file system, AI processing, or data processing failures per PRD.

Processing behavior (Steps 1–6)

Cache check against .env LAST_MODIFIED and existing ./processed/result.json.
Parallel fan-out per PDF in ./reports (bounded by PIPELINE_MAX_WORKERS).
Per-file sequential steps: processed check, preprocessing check, appendix range via Gemini, PDF slice, field extraction via Gemini, save ./processed/[file].json.
Wait for all tasks and validate all expected JSONs.
Consolidate to ./processed/result.json with last_modified = .env LAST_MODIFIED.
Return response omitting LAST_MODIFIED.

If GEMINI_API_KEY is empty, the pipeline uses a stub path that generates minimal JSON per file, enabling dev/test without external calls.

Development

Run tests:
```
make test
```
Lint/format:
```
make lint
make format
```

Configuration

Environment variables (via .env):

LAST_MODIFIED (required): The cache-busting timestamp token.
GEMINI_API_KEY (optional): Enables real Gemini processing when set.
GEMINI_MODEL (optional): Default gemini-1.5-pro.
GEMINI_FALLBACK_ON_QUOTA (optional): When true, files that hit quota/rate-limit are stubbed instead of failing the batch.
PIPELINE_MAX_WORKERS (optional): Max parallel files processed.
GEMINI_CONCURRENCY (optional): Max concurrent Gemini calls.
GEMINI_TIMEOUT_SECONDS, PER_FILE_TIMEOUT_SECONDS (optional): Timeouts in seconds.

Directories

reports/ input PDFs
preprocessing/ appendix-only PDFs
processed/ per-file JSONs, plus result.json
config/values.json schema for field extraction

Troubleshooting

See troubleshooting/ for known issues and mitigations (e.g., Gemini 429 quota). Use /process?force=true after updating .env.

License

Educational use only. See LICENSE for full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
app		app
config		config
preprocessing		preprocessing
processed		processed
reports		reports
tests		tests
troubleshooting		troubleshooting
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
PHASE.md		PHASE.md
README.md		README.md
RUNBOOK.md		RUNBOOK.md
TODO.md		TODO.md
dev-requirements.txt		dev-requirements.txt
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ESGExtract

Flowchart

Quickstart

API

Processing behavior (Steps 1–6)

Development

Configuration

Directories

Troubleshooting

License

Teammates

About

Uh oh!

Releases

Packages

Languages

License

Chillgorithm/esgextract-be

Folders and files

Latest commit

History

Repository files navigation

ESGExtract

Flowchart

Quickstart

API

Processing behavior (Steps 1–6)

Development

Configuration

Directories

Troubleshooting

License

Teammates

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages