Skip to content

Junction Asia 2025 - Chilgorithm, Automates extraction and consolidation of financial data from ESG PDF reports using Gemini and a PDF processing library. Processes all reports in batches, extracts appendix sections, and returns structured JSON.

License

Notifications You must be signed in to change notification settings

Chillgorithm/esgextract-be

Repository files navigation

ESGExtract

Automates extraction and consolidation of financial data from ESG PDF reports using Gemini and a PDF processing library. Processes all reports in batches, extracts appendix sections, and returns structured JSON.

Flowchart

flowchart TB
  A[Client] --> B[GET /process]
  B --> C{Cache hit?}
  C -- yes --> D[Return cached 200]
  C -- no --> E[process_reports]
  E --> F[Consolidate + create response data]
  F --> G[Return 200]
Loading
flowchart LR
    subgraph Artifacts
      R[reports/*.pdf]
      subgraph Parallel
        PP[preprocessing/*.pdf]
        PR[processed/*.json]
      end
      RJ[processed/result.json]
      VJ[config/values.json]
    end
    R -- detect+extract --> PP
    PP -- extract fields --> PR
    PR -- consolidate --> RJ
    VJ -. schema guides .-> PR
Loading

FastAPI

Quickstart

  • Requirements: Python 3.10+ (recommended 3.11+)

  • Install:

    make dev-install
  • Create .env at repo root (required at minimum):

    LAST_MODIFIED=2025-01-01T00:00:00Z
    # Optional
    GEMINI_API_KEY= # set to enable real Gemini calls
    GEMINI_MODEL=gemini-1.5-pro
    GEMINI_FALLBACK_ON_QUOTA=false
    PIPELINE_MAX_WORKERS=4
    GEMINI_CONCURRENCY=2
    GEMINI_TIMEOUT_SECONDS=30.0
    PER_FILE_TIMEOUT_SECONDS=180.0
  • Ensure directories exist (created automatically on first run if missing):

    • ./reports (input PDFs)
    • ./preprocessing (appendix-only PDFs)
    • ./processed (per-file JSONs and result.json)
    • ./config/values.json (field schema)
  • Run server:

    make run
    # or
    uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

API

  • Endpoint: GET /process
    • Query: force (bool, default false). When true, bypasses cache and reprocesses even if result.json matches .env LAST_MODIFIED.
    • Success (200): Consolidated JSON excluding LAST_MODIFIED.
    • Error (500): Any file system, AI processing, or data processing failures per PRD.

Processing behavior (Steps 1–6)

  1. Cache check against .env LAST_MODIFIED and existing ./processed/result.json.
  2. Parallel fan-out per PDF in ./reports (bounded by PIPELINE_MAX_WORKERS).
  3. Per-file sequential steps: processed check, preprocessing check, appendix range via Gemini, PDF slice, field extraction via Gemini, save ./processed/[file].json.
  4. Wait for all tasks and validate all expected JSONs.
  5. Consolidate to ./processed/result.json with last_modified = .env LAST_MODIFIED.
  6. Return response omitting LAST_MODIFIED.

If GEMINI_API_KEY is empty, the pipeline uses a stub path that generates minimal JSON per file, enabling dev/test without external calls.

Development

  • Run tests:
    make test
  • Lint/format:
    make lint
    make format

Configuration

Environment variables (via .env):

  • LAST_MODIFIED (required): The cache-busting timestamp token.
  • GEMINI_API_KEY (optional): Enables real Gemini processing when set.
  • GEMINI_MODEL (optional): Default gemini-1.5-pro.
  • GEMINI_FALLBACK_ON_QUOTA (optional): When true, files that hit quota/rate-limit are stubbed instead of failing the batch.
  • PIPELINE_MAX_WORKERS (optional): Max parallel files processed.
  • GEMINI_CONCURRENCY (optional): Max concurrent Gemini calls.
  • GEMINI_TIMEOUT_SECONDS, PER_FILE_TIMEOUT_SECONDS (optional): Timeouts in seconds.

Directories

  • reports/ input PDFs
  • preprocessing/ appendix-only PDFs
  • processed/ per-file JSONs, plus result.json
  • config/values.json schema for field extraction

Troubleshooting

  • See troubleshooting/ for known issues and mitigations (e.g., Gemini 429 quota). Use /process?force=true after updating .env.

License

Educational use only. See LICENSE for full terms.

Teammates

About

Junction Asia 2025 - Chilgorithm, Automates extraction and consolidation of financial data from ESG PDF reports using Gemini and a PDF processing library. Processes all reports in batches, extracts appendix sections, and returns structured JSON.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published