[ICCV 2025] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Abstract

Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7B model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.

Code Structure

A slightly modified version of LLaVA, in accordance with RL4VLM.
GymCards environment for Points24 task, in accordance with RL4VLM.
GTR_gymcards, code for training agent on GymCards tasks, including Points24.
GTR_alf, code for training agent on ALFWorld tasks.

Getting Started

The SFT-initialized models for each task can be found in here.

Points24

Setup the environment

cd <path-to-this-repo>
pip install -e ../LLaVA
pip install -e ../gym-cards
pip install gymnasium[atari,accept-rom-license]
pip install stable-baselines3 nltk deepspeed sentencepiece git+https://github.com/openai/CLIP.git
pip install xformers

Setting your OpenAI API key in gpt4o_interface.py.
Run the script

cd scripts
bash run_p24.sh

For multi-GPU training, you may change the num_processes in config_zero2.yaml.

ALFWorld

Setup the environment

cd <path-to-this-repo>
conda env create -f alf_conda.yml
conda activate vrenv-alf
pip install -e ../LLaVA
pip install -e ../gym-cards
pip install git+https://github.com/openai/CLIP.git
pip install numpy==1.23.5
pip install protobuf==3.20.3
pip install pydantic==1.10.14
pip install pydantic-core==2.16.3
pip install nltk
pip uninstall frozenlist gradio murmurhash preshed spacy srsly thinc weasel aiosignal annotated-types blis catalogue cloudpathlib cymem
export ALFWORLD_DATA=<storage_path>
alfworld-download

You may test the installation by running:

alfworld-play-thor

Setting your OpenAI API key in gpt4o_interface.py.
Run the script

cd scripts
bash run_alf.sh

We strongly recommend only using 1 GPU to prevent NCCL time out during the synchronization.

Citation

If you find our work useful, please kindly cite:

@article{wei2025gtr,
  title={GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training},
  author={Wei, Tong and Yang, Yijun and Xing, Junliang and Shi, Yuanchun and Lu, Zongqing and Ye, Deheng},
  journal={arXiv preprint arXiv:2503.08525},
  year={2025}
}

Acknowledgement

Our code adopts the basic environment setting and RL framework from RL4VLM, which uses LLaVA as a backbone and PPO as RL algorithm implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
GTR_alf		GTR_alf
GTR_gymcards		GTR_gymcards
LLaVA		LLaVA
gym-cards		gym-cards
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICCV 2025] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Abstract

Table of Contents

Code Structure

Getting Started

Points24

ALFWorld

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

weit123/GTR

Folders and files

Latest commit

History

Repository files navigation

[ICCV 2025] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Abstract

Table of Contents

Code Structure

Getting Started

Points24

ALFWorld

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages