[ICCV 2025] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7B model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.
-
A slightly modified version of LLaVA, in accordance with RL4VLM.
-
GymCards environment for Points24 task, in accordance with RL4VLM.
-
GTR_gymcards, code for training agent on GymCards tasks, including Points24. -
GTR_alf, code for training agent on ALFWorld tasks.
The SFT-initialized models for each task can be found in here.
- Setup the environment
cd <path-to-this-repo>
pip install -e ../LLaVA
pip install -e ../gym-cards
pip install gymnasium[atari,accept-rom-license]
pip install stable-baselines3 nltk deepspeed sentencepiece git+https://github.com/openai/CLIP.git
pip install xformers- Setting your OpenAI API key in
gpt4o_interface.py. - Run the script
cd scripts
bash run_p24.shFor multi-GPU training, you may change the num_processes in config_zero2.yaml.
- Setup the environment
cd <path-to-this-repo>
conda env create -f alf_conda.yml
conda activate vrenv-alf
pip install -e ../LLaVA
pip install -e ../gym-cards
pip install git+https://github.com/openai/CLIP.git
pip install numpy==1.23.5
pip install protobuf==3.20.3
pip install pydantic==1.10.14
pip install pydantic-core==2.16.3
pip install nltk
pip uninstall frozenlist gradio murmurhash preshed spacy srsly thinc weasel aiosignal annotated-types blis catalogue cloudpathlib cymem
export ALFWORLD_DATA=<storage_path>
alfworld-download You may test the installation by running:
alfworld-play-thor-
Setting your OpenAI API key in
gpt4o_interface.py. -
Run the script
cd scripts
bash run_alf.sh We strongly recommend only using 1 GPU to prevent NCCL time out during the synchronization.
If you find our work useful, please kindly cite:
@article{wei2025gtr,
title={GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training},
author={Wei, Tong and Yang, Yijun and Xing, Junliang and Shi, Yuanchun and Lu, Zongqing and Ye, Deheng},
journal={arXiv preprint arXiv:2503.08525},
year={2025}
}
Our code adopts the basic environment setting and RL framework from RL4VLM, which uses LLaVA as a backbone and PPO as RL algorithm implementation.