Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
Yanqi Dai1,2,
Yuxiang Ji3,
Xiao Zhang4,
Yong Wang2†,
Guanhua Chen3,
Xiangxiang Chu2,
Zhiwu Lu1
1Gaoling School of Artificial Intelligence, Renmin University of China
2AMAP, Alibaba Group
3Xiamen University
4Dalian University of Technology
†Project lead.
- [Jan 31, 2026]: 🛠️ Code and augmented data are released.
- [Jan 29, 2026]: 🔥 Our paper is published on arXiv and HuggingFace, and becomes #1 Paper of the day in HuggingFace Daily Papers.
- [Jan 26, 2026]: 🎉 Our paper is accepted by ICLR 2026.
We propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data.
Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation (DGAE), and further prioritizes harder questions by difficulty-aware question-level weighting (DQW).
Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. The core instructions for these strategies are as follows:
The main comparative results on the MATH dataset using Qwen2.5-Math-7B are presented in the following table, demonstrating significant effectiveness of DGPO, MQR, and the overall MathForge framework.
| Methods | AIME24 | AIME25 | AMC23 | MATH500 | Minerva | Olympiad | Avg. / |
|---|---|---|---|---|---|---|---|
| Base Model | 12.19 | 4.79 | 35.23 | 48.60 | 15.07 | 16.33 | 22.04 |
| GRPO | 20.94 | 8.44 | 58.98 | 72.20 | 27.76 | 37.33 | 37.61 |
| Dr.GRPO | 21.04 | 8.23 | 58.59 | 72.05 | 28.58 | 35.89 | 37.40 (−0.21) |
| GPG | 21.98 | 9.06 | 59.61 | 72.05 | 27.21 | 37.67 | 37.93 (+0.32) |
| DAPO | 21.25 | 8.75 | 58.20 | 72.70 | 29.50 | 37.22 | 37.94 (+0.33) |
| GSPO | 19.38 | 8.33 | 60.16 | 73.00 | 28.12 | 37.26 | 37.71 (+0.10) |
| GRPO-AD | 21.56 | 9.48 | 59.06 | 73.25 | 29.14 | 37.07 | 38.26 (+0.65) |
| DGPO | 23.85 | 10.21 | 61.02 | 74.25 | 31.07 | 38.33 | 39.79 (+2.18) |
| MQR | 25.00 | 11.77 | 59.38 | 77.85 | 31.43 | 40.81 | 41.04 (+3.43) |
| MathForge | 24.58 | 12.60 | 59.84 | 79.95 | 33.36 | 42.67 | 42.17 (+4.56) |
You can find the datasets constructed by this work in the following links:
- MathForge_MATH-augmented: We augmented the training questions of the MATH dataset using our proposed MQR strategy, resulting in a dataset that is 4 times as large as the original training set.
- MathForge_GEOQA-R1V-revised: We revised the GEOQA_R1V_Train_8K dataset by correcting unit errors in the original gold answers, reformatting the data, and randomly splitting it into training and test sets.
- YanqiDai/MathForge_NuminaMath-CoT-sample80k: We randomly sampled 80k data from the NuminaMath-CoT dataset for supervised fine-tuning of DeepSeek-Math-7B.
Create a conda environment with the required dependencies:
conda create -n mathforge python=3.10
conda activate mathforge
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
pip install vllm==0.8.5.post1
pip install flash-attn==2.8.2 --no-build-isolationClone this repository and install open-r1 and trl from our modified branches:
git clone https://github.com/AMAP-ML/MathForge.git
# install open-r1
cd MathForge
pip install -e ".[dev]"
# install trl==0.20.0
cd trl-0.20.0
pip install -e .Please refer to the scripts in the scripts_mathforge folder for training various models using GRPO, DGPO, or MathForge (DGPO + MQR).
To quickly start training, you can use the following command as an example:
bash scripts_mathforge/Qwen2.5-7B_MATH/run_mathforge.shFor mathematical reasoning evaluation, we recommend using the Lighteval toolkit. You can use the following command to evaluate a trained model on multiple mathematical benchmarks, including AIME24, AIME25, AMC23, MATH500, Minerva, and Olympiad:
CUDA_VISIBLE_DEVICES=0 bash eval/evaluate_math.sh <model_path>For geographical reasoning evaluation on the GEOQA dataset, you can use the following command:
CUDA_VISIBLE_DEVICES=0 bash eval/evaluate_geoqa.sh <model_path>This work was built upon several open-source projects, including Open-R1, TRL, R1-V, MATH, and Lighteval. We express our gratitude to these projects.
If you find MathForge useful for your research and applications, please cite using this BibTeX:
@article{dai2026harder,
title={Harder is better: Boosting mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation},
author={Dai, Yanqi and Ji, Yuxiang and Zhang, Xiao and Wang, Yong and Chu, Xiangxiang and Lu, Zhiwu},
journal={arXiv preprint arXiv:2601.20614},
year={2026}
}

