KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

Wei Sun, Wen Yang, Pu Jian,

Qianlong Du, Fuwei Cui, Shuo Ren, Jiajun Zhang
Institute of Automation Chinese Academy of Sciences

🔖 Overview

Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models (LLMs), even without supervised fine-tuning (SFT). However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions. To address this limitation, we propose Key-token Advantage Estimation ($\textit{KTAE}$)—a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.

In summary, the KTAE algorithm offers several advantages:

KTAE provides more fine-grained advantage information without introducing extra models, resulting in lower training costs.
KTAE directly computes the importance differences between tokens using statistical analysis methods, offering strong interpretability.
KTAE's key-token value is computed based on the correctness of the final answer and retains the original rollout-level advantage, making it less susceptible to reward hacking.
KTAE can make the model pay more attention to key tokens and reduce the learning of irrelevant tokens, which can effectively reduce the response length.

🔥 Update

[21/05/2025]🔥Key-token Advantage Estimation is coming!

🧠 Available Models

Model Size	DAPO+KTAE	GRPO+KTAE
1.5B	DAPO_KTAE_1.5B	GRPO_KTAE_1.5B
7B	DAPO_KTAE-7B	GRPO_KTAE-7B

📷 Setup

Please follow the instructions below to install the required packages.

Clone this repository

https://github.com/xiaolizh1/KTAE.git

Install Package

conda create -n KTAE python=3.10 -y
conda activate KTAE
cd KTAE
pip install -r requirements.txt

📈 Train

Our training is mostly performed on Verl code base and makes some changes.

📌 GRPO+KTAE

bash examples/grpo_trainer/run_qwen2.5_7b.sh    #train 7b model
bash examples/grpo_trainer/run_qwen2.5_math_1.5b.sh   #train 1.5b model

📌 DAPO+KTAE

bash recipe/dapo/run_dapo_qwen2.5_7b.sh    #train 7b model
bash recipe/dapo/run_dapo_qwen2.5_1.5b.sh    #train 1.5b model

📌 Merge Model

cd scripts
bash merge_model.sh      #merge checkpoint

✅ Evaluation

Our evaluate code is base on Dr.GRPO

cd eval
bash run_eval.sh

👀 Experiments

We provide some results in this section. More detailed results can be found in our paper.

Main Results

Method validation result.

Comparison with baselines on Accuracy.

Comparison with baselines on Efficiency.

📊 More Analysis

Ablation analysis.

Visualization example.

🔗 Citation

If you find this repo useful for your research, please consider citing the paper

@misc{sun2025ktaemodelfreealgorithmkeytokens,
      title={KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning}, 
      author={Wei Sun and Wen Yang and Pu Jian and Qianlong Du and Fuwei Cui and Shuo Ren and Jiajun Zhang},
      year={2025},
      eprint={2505.16826},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.16826}, 
}

🌈 Acknowledgement

We would like to thank the following repos for their great work:

Verl for providing the training framework
Vllm for the efficient inference engine with high throughput
transformers for providing the model-base and fune-tuning framework

🔎 License

This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
docker		docker
docs		docs
eval		eval
eval_result		eval_result
examples		examples
patches		patches
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

🔖 Overview

🔥 Update

📃 Contents

🧠 Available Models

📷 Setup

📈 Train

📌 GRPO+KTAE

📌 DAPO+KTAE

📌 Merge Model

✅ Evaluation

👀 Experiments

Main Results

📊 More Analysis

🔗 Citation

🌈 Acknowledgement

🔎 License

About

Uh oh!

Releases

Packages

Languages

License

ZNLP/KTAE

Folders and files

Latest commit

History

Repository files navigation

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

🔖 Overview

🔥 Update

📃 Contents

🧠 Available Models

📷 Setup

📈 Train

📌 GRPO+KTAE

📌 DAPO+KTAE

📌 Merge Model

✅ Evaluation

👀 Experiments

Main Results

📊 More Analysis

🔗 Citation

🌈 Acknowledgement

🔎 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages