Skip to content

datologyai/DatBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DatBench Evaluation Library

Official evaluation harness for DatBench, a high-fidelity vision-language benchmark with exact scoring implementations.

Quick Start

from datasets import load_dataset
from datbench import DatBenchEvaluator, VLMResponse

# Load dataset
capability = "math"  # Options: chart, counting, document, general, grounding, math, scene, spatial, table
dataset = load_dataset("DatologyAI/DatBench", capability, split="test")

# Initialize evaluator
evaluator = DatBenchEvaluator(dataset, capability)

# Get inference tasks
tasks = evaluator.get_inference_tasks()

# Run your VLM
def run_my_vlm(task):
    # task.image is a PIL.Image.Image
    # task.question is the formatted prompt
    # Return your model's output string
    return model_output

# Create VLM responses
vlm_responses = [
    VLMResponse(id=task.id, raw_output=run_my_vlm(task))
    for task in tasks
]

# Compute metrics
report = evaluator.compute_metrics(vlm_responses)

# View results
print(f"Accuracy: {report.summary['overall_accuracy']:.2%}")
print(f"Per-dataset: {report.summary['dataset_metrics']}")

# Save results
report.save("results.json")

Dataset Information

Available Datasets

DatBench provides two versions:

  • DatologyAI/DatBench: High-fidelity subset (~5K samples per capability, ~45K total)
  • DatologyAI/DatBench-Full: Complete dataset (~205K samples total)

Capabilities

Nine evaluation capabilities covering diverse vision-language tasks:

  • chart: Chart understanding, infographic QA
  • counting: Object counting tasks
  • document: OCR, document parsing, KIE
  • general: General VQA, reasoning
  • grounding: Referring expression grounding, point localization
  • math: Mathematical reasoning, geometry
  • scene: Scene text recognition, multi-scene OCR
  • spatial: Spatial reasoning, real-world QA
  • table: Table understanding, diagram QA

Sample Structure

Each sample contains:

{
    "id": "db_math_000123",           # Unique identifier
    "image": PIL.Image,               # Image (loaded automatically by HF)
    "question": str,                  # Formatted prompt ready for inference
    "answer": str,                    # Ground truth answer
    "all_answers": List[str],         # Alternative valid answers
    "eval_mode": "direct",            # "direct" or "judge"
    "is_circular": bool,              # Circular evaluation variant
    "metadata": str,                  # JSON string with dataset-specific metadata
    "source_info": {
        "dataset": str,               # Source dataset name
        "original_idx": str           # Original sample ID
    }
}

API Reference

DatBenchEvaluator

Main evaluation class.

Methods:

  • __init__(hf_dataset, capability) - Initialize with HF dataset
  • get_inference_tasks() - Get list of InferenceTask objects
  • create_judge_tasks(vlm_responses) - Create judge evaluation tasks
  • compute_metrics(vlm_responses, judge_responses=None) - Score and generate report

Dataclasses

InferenceTask: Input for your VLM

  • id: Sample identifier
  • image: PIL.Image.Image
  • question: Formatted prompt string
  • eval_mode: "direct" or "judge"

VLMResponse: Your VLM output

  • id: Sample identifier
  • raw_output: Full model response
  • parsed_answer: Optional pre-extracted answer

DatBenchReport: Final results

  • summary: Dict with overall_accuracy, dataset_metrics, etc.
  • results: List[SampleScore] with per-sample details
  • save(path): Save report to JSON file

Citation

TBA

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages