Skip to content

(ICLR 2026) Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Notifications You must be signed in to change notification settings

AMAP-ML/MathForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai1,2, Yuxiang Ji3, Xiao Zhang4, Yong Wang2†, Guanhua Chen3, Xiangxiang Chu2, Zhiwu Lu1
1Gaoling School of Artificial Intelligence, Renmin University of China    2AMAP, Alibaba Group    3Xiamen University    4Dalian University of Technology
Project lead.    

News

  • [Jan 31, 2026]: 🛠️ Code and augmented data are released.
  • [Jan 29, 2026]: 🔥 Our paper is published on arXiv and HuggingFace, and becomes #1 Paper of the day in HuggingFace Daily Papers.
  • [Jan 26, 2026]: 🎉 Our paper is accepted by ICLR 2026.

Contents

Introduction

We propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data.

Difficulty-Aware Group Policy Optimization (DGPO)

Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation (DGAE), and further prioritizes harder questions by difficulty-aware question-level weighting (DQW).

Multi-Aspect Question Reformulation (MQR)

Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. The core instructions for these strategies are as follows:

Main Results

The main comparative results on the MATH dataset using Qwen2.5-Math-7B are presented in the following table, demonstrating significant effectiveness of DGPO, MQR, and the overall MathForge framework.

Methods AIME24 AIME25 AMC23 MATH500 Minerva Olympiad Avg. / $Δ_\text{GRPO}$
Base Model 12.19 4.79 35.23 48.60 15.07 16.33 22.04
GRPO 20.94 8.44 58.98 72.20 27.76 37.33 37.61
Dr.GRPO 21.04 8.23 58.59 72.05 28.58 35.89 37.40 (−0.21)
GPG 21.98 9.06 59.61 72.05 27.21 37.67 37.93 (+0.32)
DAPO 21.25 8.75 58.20 72.70 29.50 37.22 37.94 (+0.33)
GSPO 19.38 8.33 60.16 73.00 28.12 37.26 37.71 (+0.10)
GRPO-AD 21.56 9.48 59.06 73.25 29.14 37.07 38.26 (+0.65)
DGPO 23.85 10.21 61.02 74.25 31.07 38.33 39.79 (+2.18)
MQR 25.00 11.77 59.38 77.85 31.43 40.81 41.04 (+3.43)
MathForge 24.58 12.60 59.84 79.95 33.36 42.67 42.17 (+4.56)

Datasets

You can find the datasets constructed by this work in the following links:

Installation

Create a conda environment with the required dependencies:

conda create -n mathforge python=3.10
conda activate mathforge
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0  
pip install vllm==0.8.5.post1
pip install flash-attn==2.8.2 --no-build-isolation

Clone this repository and install open-r1 and trl from our modified branches:

git clone https://github.com/AMAP-ML/MathForge.git
# install open-r1
cd MathForge
pip install -e ".[dev]"
# install trl==0.20.0
cd trl-0.20.0
pip install -e .

Training

Please refer to the scripts in the scripts_mathforge folder for training various models using GRPO, DGPO, or MathForge (DGPO + MQR). To quickly start training, you can use the following command as an example:

bash scripts_mathforge/Qwen2.5-7B_MATH/run_mathforge.sh

Evaluation

For mathematical reasoning evaluation, we recommend using the Lighteval toolkit. You can use the following command to evaluate a trained model on multiple mathematical benchmarks, including AIME24, AIME25, AMC23, MATH500, Minerva, and Olympiad:

CUDA_VISIBLE_DEVICES=0 bash eval/evaluate_math.sh <model_path>

For geographical reasoning evaluation on the GEOQA dataset, you can use the following command:

CUDA_VISIBLE_DEVICES=0 bash eval/evaluate_geoqa.sh <model_path>

Acknowledgement

This work was built upon several open-source projects, including Open-R1, TRL, R1-V, MATH, and Lighteval. We express our gratitude to these projects.

Citation

If you find MathForge useful for your research and applications, please cite using this BibTeX:

@article{dai2026harder,
    title={Harder is better: Boosting mathematical reasoning via difficulty-aware grpo and multi-aspect question reformulation}, 
    author={Dai, Yanqi and Ji, Yuxiang and Zhang, Xiao and Wang, Yong and Chu, Xiangxiang and Lu, Zhiwu},
    journal={arXiv preprint arXiv:2601.20614},
    year={2026}
}

About

(ICLR 2026) Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published