Zengbin Wang1,2*, Xuecai Hu1†, Yong Wang1†, Feng Xiong1, Man Zhang2, Xiangxiang Chu1
1AMAP, Alibaba Group,
2Beijing University of Posts and Telecommunications
†Project leads and corresponding authors.
- 🔥 [2026.01]: 🎉🎉🎉 Congratulations! Our paper is accepted by ICLR 2026.
- 🔥 [2026.01]: The evaluation data and code is released.
- We introduce SpatialGenEval, a new benchmark to systematically evaluate complex spatial intelligence in T2I models. It leverages 1,230 information-dense prompts, each covering 10 spatial sub-domains and paired with 12,300 Omni-dimensional multiple-choice questions to evaluate a model's understanding beyond what to generate, to where, how, and why.
- Our extensive evaluation of 23 state-of-the-art models reveals a universal performance bottleneck in spatial reasoning. While models excel at basic object composition, their accuracy falls when faced with tasks requiring higher-order spatial understanding, such as relative positioning, occlusion, and causality, revealing this as a primary barrier to current T2I capabilities.
- Beyond evaluation, we explore a spatial-aware dataset (SpatialT2I), designed as a practical data-centric solution to improve the spatial intelligence of existing models. Fine-tuning results yield significant and consistent performance gains.
-
Clone and install the required environment.
git clone https://github.com/AMAP-ML/SpatialGenEval.git cd SpatialGenEval # You can also use your own envrionment. conda create -n spatialgeneval python==3.10 -y conda activate spatialgeneval pip install torch transformers diffusers modelscope vllm
-
Use T2I prompts from
eval/SpatialGenEval_T2I_Prompts.jsonlto generate images with your T2I models."id": "index of prompt." "scene": "name of real-world scene." "prompt": "text-to-image prompt (the averaged length is around 66 words to balance compatibility with CLIP encoders (limited by 77 tokens) and the need for high information density)." "question_type": "10 spatial dims." "questions": "10 omni-dims questions." "answers": "10 ground-truth answers."
-
The image generation demo is in
scripts/generation_demo.py. It support spliting the generation tasks to each GPU for parallel generation.json_file="eval/SpatialGenEval_T2I_Prompts.jsonl" model_name="Qwen/Qwen-Image" # e.g., Qwen/Qwen-Image save_folder="<YOUR_GENERATED_IMAGE_PATH>" # e.g., ./images/Qwen-Image TOTAL_GPUS=8 # i.e., The expected used GPUs for i in $(seq 0 ${TOTAL_GPUS}) do CUDA_VISIBLE_DEVICES=$i \ python scripts/generation_demo.py \ --model_name=${model_name} \ --json_file=${json_file} \ --save_folder=${save_folder} \ --total_gpus=${TOTAL_GPUS} \ --gpu_id=$i & done wait
-
The generated image folder should be in the format of
<YOUR_GENERATED_IMAGE_PATH>/<ID>.pngas follows.<YOUR_GENERATED_IMAGE_PATH>/ ├── 000001.png ├── 000002.png ├── ... ├── 001230.png
🚀 Run Evaluation with Qwen2.5-VL-72B-Instruct (based on VLLM)
-
The evaluation script based on VLLM is in
eval_with_qwen25_vl_72b.sh.EVAL_MODEL_PATH: the evaluation model path (./models/Qwen2.5-VL-72B-Instruct).YOUR_IMAGE_FOLDER: the generated image folder path.OUTPUT_JSON_PATH: the output json path of evaluation results.
EVAL_MODEL_PATH="<EVAL_MODEL_PATH>" # e.g., ./models/Qwen2.5-VL-72B-Instruct YOUR_GENERATED_IMAGE_PATH="<YOUR_GENERATED_IMAGE_PATH>" # e.g., ./images/Qwen-Image OUTPUT_JSON_PATH="<OUTPUT_JSON_PATH>" # e.g., ./logs/Qwen-Image-results.jsonl bash eval_with_qwen25_vl_72b.sh \ "${YOUR_GENERATED_IMAGE_PATH}" \ "${OUTPUT_JSON_PATH}" \ "${EVAL_MODEL_PATH}"
-
The final evaluation json results will be saved in
<OUTPUT_JSON_PATH>.
- Evaluation results based on Qwen2.5-VL-72B-Instruct as the judge model.
- Evaluation results based on GPT-4o as the judge model.
If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:
@article{wang2026everything,
title={Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models},
author={Wang, Zengbin and Hu, Xuecai and Wang, Yong and Xiong, Feng and Zhang, Man and Chu, Xiangxiang},
journal={arXiv preprint arXiv:2601.20354},
year={2026}
}






