Official evaluation harness for DatBench, a high-fidelity vision-language benchmark with exact scoring implementations.
from datasets import load_dataset
from datbench import DatBenchEvaluator, VLMResponse
# Load dataset
capability = "math" # Options: chart, counting, document, general, grounding, math, scene, spatial, table
dataset = load_dataset("DatologyAI/DatBench", capability, split="test")
# Initialize evaluator
evaluator = DatBenchEvaluator(dataset, capability)
# Get inference tasks
tasks = evaluator.get_inference_tasks()
# Run your VLM
def run_my_vlm(task):
# task.image is a PIL.Image.Image
# task.question is the formatted prompt
# Return your model's output string
return model_output
# Create VLM responses
vlm_responses = [
VLMResponse(id=task.id, raw_output=run_my_vlm(task))
for task in tasks
]
# Compute metrics
report = evaluator.compute_metrics(vlm_responses)
# View results
print(f"Accuracy: {report.summary['overall_accuracy']:.2%}")
print(f"Per-dataset: {report.summary['dataset_metrics']}")
# Save results
report.save("results.json")DatBench provides two versions:
- DatologyAI/DatBench: High-fidelity subset (~5K samples per capability, ~45K total)
- DatologyAI/DatBench-Full: Complete dataset (~205K samples total)
Nine evaluation capabilities covering diverse vision-language tasks:
- chart: Chart understanding, infographic QA
- counting: Object counting tasks
- document: OCR, document parsing, KIE
- general: General VQA, reasoning
- grounding: Referring expression grounding, point localization
- math: Mathematical reasoning, geometry
- scene: Scene text recognition, multi-scene OCR
- spatial: Spatial reasoning, real-world QA
- table: Table understanding, diagram QA
Each sample contains:
{
"id": "db_math_000123", # Unique identifier
"image": PIL.Image, # Image (loaded automatically by HF)
"question": str, # Formatted prompt ready for inference
"answer": str, # Ground truth answer
"all_answers": List[str], # Alternative valid answers
"eval_mode": "direct", # "direct" or "judge"
"is_circular": bool, # Circular evaluation variant
"metadata": str, # JSON string with dataset-specific metadata
"source_info": {
"dataset": str, # Source dataset name
"original_idx": str # Original sample ID
}
}Main evaluation class.
Methods:
__init__(hf_dataset, capability)- Initialize with HF datasetget_inference_tasks()- Get list of InferenceTask objectscreate_judge_tasks(vlm_responses)- Create judge evaluation taskscompute_metrics(vlm_responses, judge_responses=None)- Score and generate report
InferenceTask: Input for your VLM
id: Sample identifierimage: PIL.Image.Imagequestion: Formatted prompt stringeval_mode: "direct" or "judge"
VLMResponse: Your VLM output
id: Sample identifierraw_output: Full model responseparsed_answer: Optional pre-extracted answer
DatBenchReport: Final results
summary: Dict with overall_accuracy, dataset_metrics, etc.results: List[SampleScore] with per-sample detailssave(path): Save report to JSON file
TBA