Wei Sun, Wen Yang, Pu Jian,
Qianlong Du,
Fuwei Cui,
Shuo Ren,
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models (LLMs), even without supervised fine-tuning (SFT). However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions. To address this limitation, we propose Key-token Advantage Estimation (
In summary, the KTAE algorithm offers several advantages:
-
KTAE provides more fine-grained advantage information without introducing extra models, resulting in lower training costs.
-
KTAE directly computes the importance differences between tokens using statistical analysis methods, offering strong interpretability.
-
KTAE's key-token value is computed based on the correctness of the final answer and retains the original rollout-level advantage, making it less susceptible to reward hacking.
-
KTAE can make the model pay more attention to key tokens and reduce the learning of irrelevant tokens, which can effectively reduce the response length.
- [21/05/2025]🔥Key-token Advantage Estimation is coming!
| Model Size | DAPO+KTAE | GRPO+KTAE |
|---|---|---|
| 1.5B | ||
| 7B |
Please follow the instructions below to install the required packages.
- Clone this repository
https://github.com/xiaolizh1/KTAE.git- Install Package
conda create -n KTAE python=3.10 -y
conda activate KTAE
cd KTAE
pip install -r requirements.txtOur training is mostly performed on Verl code base and makes some changes.
bash examples/grpo_trainer/run_qwen2.5_7b.sh #train 7b model
bash examples/grpo_trainer/run_qwen2.5_math_1.5b.sh #train 1.5b modelbash recipe/dapo/run_dapo_qwen2.5_7b.sh #train 7b model
bash recipe/dapo/run_dapo_qwen2.5_1.5b.sh #train 1.5b modelcd scripts
bash merge_model.sh #merge checkpointOur evaluate code is base on Dr.GRPO
cd eval
bash run_eval.shWe provide some results in this section. More detailed results can be found in our paper.
- Method validation result.
- Comparison with baselines on Accuracy.
- Comparison with baselines on Efficiency.
- Ablation analysis.
- Visualization example.
If you find this repo useful for your research, please consider citing the paper
@misc{sun2025ktaemodelfreealgorithmkeytokens,
title={KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning},
author={Wei Sun and Wen Yang and Pu Jian and Qianlong Du and Fuwei Cui and Shuo Ren and Jiajun Zhang},
year={2025},
eprint={2505.16826},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.16826},
}
We would like to thank the following repos for their great work:
- Verl for providing the training framework
- Vllm for the efficient inference engine with high throughput
- transformers for providing the model-base and fune-tuning framework
This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.






