Skip to content

ccmdi/osintbench

Repository files navigation

OSINTbench is a benchmark for evaluating how well large language models can perform open-source intelligence (OSINT) tasks. Categories include:

  • Geolocation: Spatial reasoning
  • Identification: Information synthesis, breadth of knowledge
  • Temporal: Temporal reasoning
  • Analysis: General reasoning

Installation

git clone https://github.com/ccmdi/osintbench.git
cd osintbench
pip install -r requirements.txt

Setup your .env based on SAMPLE.env for whichever model providers you wish to test for (e.g. ANTHROPIC_API_KEY must be set to test Claude).

You will need to manually create a dataset for this to work. Datasets follow this schema:

"cases": [
    {
      "id": <case_number>,
      "images": [
        "images/<image_number>.<ext>"
      ],
      "info": "<context given to the model about the case>",
      "tasks": [
        {
          "id": 1,
          "type": "location",
          "prompt": "Find the exact location of the photo.",
          "answer": {
            "lat": <true_lat>,
            "lng": <true_lng>
          }
        },
        {
            "id": 2,
            "type": "identification",
            "prompt": "Who is this?",
            "answer": "<person_name>"
        }
      ]
    },
    ...

The folder for a dataset should be in the structure:

dataset/
├─ basic/
│  ├─ metadata.json
│  ├─ images/
│  │  ├─ 2.jpg
│  │  ├─ 1.png
├─ advanced/
│  ├─ metadata.json

Where your dataset definition lives in metadata.json.

Test a model

Caution

Most outputs are evaluated by a judge model. Double-check responses before finalizing results.

python osintbench.py --dataset <test name> --model <model name>

Models go by their class name in models.py. Gemini 2.5 Flash goes by Gemini2_5Flash, for instance.

Roadmap

  • Tool use
    • Google Search
    • EXIF extraction
    • Reverse image search (Google Lens)
    • Visit website
    • Overpass turbo
    • Google Street View
    • Computer use (as replacement; long-term)
  • High quality, human-verified datasets
  • Higher prompt quality to improve performance
  • Prompt batching/parallel runs
  • Video support?
  • Recursive prompting/self-evaluation
  • Release

Note

Contributors are welcome! Check the roadmap.

About

OSINT benchmark for language models

Topics

Resources

License

Stars

Watchers

Forks

Languages