Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang^1,2*, Xuecai Hu^1†, Yong Wang^1†, Feng Xiong¹, Man Zhang², Xiangxiang Chu¹

¹AMAP, Alibaba Group, ²Beijing University of Posts and Telecommunications
^†Project leads and corresponding authors.

🔥 News

🔥 [2026.01]: 🎉🎉🎉 Congratulations! Our paper is accepted by ICLR 2026.
🔥 [2026.01]: The evaluation data and code is released.

📖 Overview of SpatialGenEval

We introduce SpatialGenEval, a new benchmark to systematically evaluate complex spatial intelligence in T2I models. It leverages 1,230 information-dense prompts, each covering 10 spatial sub-domains and paired with 12,300 Omni-dimensional multiple-choice questions to evaluate a model's understanding beyond what to generate, to where, how, and why.
Our extensive evaluation of 23 state-of-the-art models reveals a universal performance bottleneck in spatial reasoning. While models excel at basic object composition, their accuracy falls when faced with tasks requiring higher-order spatial understanding, such as relative positioning, occlusion, and causality, revealing this as a primary barrier to current T2I capabilities.
Beyond evaluation, we explore a spatial-aware dataset (SpatialT2I), designed as a practical data-centric solution to improve the spatial intelligence of existing models. Fine-tuning results yield significant and consistent performance gains.

💡 Examples

🔧 Requirements

Clone and install the required environment.

git clone https://github.com/AMAP-ML/SpatialGenEval.git
cd SpatialGenEval

# You can also use your own envrionment.
conda create -n spatialgeneval python==3.10 -y
conda activate spatialgeneval
pip install torch transformers diffusers modelscope vllm

🎑 Image Generation

Use T2I prompts from eval/SpatialGenEval_T2I_Prompts.jsonl to generate images with your T2I models.

"id": "index of prompt."
"scene": "name of real-world scene."
"prompt": "text-to-image prompt (the averaged length is around 66 words to balance compatibility with CLIP encoders (limited by 77 tokens) and the need for high information density)."
"question_type": "10 spatial dims."
"questions": "10 omni-dims questions."
"answers": "10 ground-truth answers."

The image generation demo is in scripts/generation_demo.py. It support spliting the generation tasks to each GPU for parallel generation.

json_file="eval/SpatialGenEval_T2I_Prompts.jsonl"
model_name="Qwen/Qwen-Image"               # e.g., Qwen/Qwen-Image
save_folder="<YOUR_GENERATED_IMAGE_PATH>"  # e.g., ./images/Qwen-Image
TOTAL_GPUS=8                               # i.e., The expected used GPUs
for i in $(seq 0 ${TOTAL_GPUS})
do
    CUDA_VISIBLE_DEVICES=$i \
    python scripts/generation_demo.py \
        --model_name=${model_name} \
        --json_file=${json_file} \
        --save_folder=${save_folder} \
        --total_gpus=${TOTAL_GPUS} \
        --gpu_id=$i &
done
wait

The generated image folder should be in the format of <YOUR_GENERATED_IMAGE_PATH>/<ID>.png as follows.

<YOUR_GENERATED_IMAGE_PATH>/
├── 000001.png
├── 000002.png
├── ...
├── 001230.png

🚀 Run Evaluation with Qwen2.5-VL-72B-Instruct (based on VLLM)

The evaluation script based on VLLM is in eval_with_qwen25_vl_72b.sh.

EVAL_MODEL_PATH: the evaluation model path (./models/Qwen2.5-VL-72B-Instruct).
YOUR_IMAGE_FOLDER: the generated image folder path.
OUTPUT_JSON_PATH: the output json path of evaluation results.

EVAL_MODEL_PATH="<EVAL_MODEL_PATH>"              # e.g., ./models/Qwen2.5-VL-72B-Instruct
YOUR_GENERATED_IMAGE_PATH="<YOUR_GENERATED_IMAGE_PATH>"  # e.g., ./images/Qwen-Image
OUTPUT_JSON_PATH="<OUTPUT_JSON_PATH>"            # e.g., ./logs/Qwen-Image-results.jsonl

bash eval_with_qwen25_vl_72b.sh \
    "${YOUR_GENERATED_IMAGE_PATH}" \
    "${OUTPUT_JSON_PATH}" \
    "${EVAL_MODEL_PATH}"

The final evaluation json results will be saved in <OUTPUT_JSON_PATH>.

📊 Evaluation Results

Evaluation results based on Qwen2.5-VL-72B-Instruct as the judge model.

Evaluation results based on GPT-4o as the judge model.

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:

@article{wang2026everything,
  title={Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models},
  author={Wang, Zengbin and Hu, Xuecai and Wang, Yong and Xiong, Feng and Zhang, Man and Chu, Xiangxiang},
  journal={arXiv preprint arXiv:2601.20354},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
asset		asset
eval		eval
logs		logs
scripts		scripts
README.md		README.md
eval_with_qwen25_vl_72b.sh		eval_with_qwen25_vl_72b.sh
generation_demo.sh		generation_demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

🔥 News

📖 Overview of SpatialGenEval

💡 Examples

🔧 Requirements

🎑 Image Generation

🚀 Run Evaluation with Qwen2.5-VL-72B-Instruct (based on VLLM)

📊 Evaluation Results

✍️ Citation

About

Uh oh!

Releases

Packages

Languages

AMAP-ML/SpatialGenEval

Folders and files

Latest commit

History

Repository files navigation

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

🔥 News

📖 Overview of SpatialGenEval

💡 Examples

🔧 Requirements

🎑 Image Generation

🚀 Run Evaluation with Qwen2.5-VL-72B-Instruct (based on VLLM)

📊 Evaluation Results

✍️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages