RLVRDecomposition.mp4
This animation visualizes the decomposition of the RLVR objective into PSR (Positive Sample Reinforcement) and NSR (Negative Sample Reinforcement) components.
This repository implements and investigates the effectiveness of Negative Sample Reinforcement (NSR), Weighted REINFORCE (W-REINFORCE), and Positive Sample Reinforcement (PSR) objectives on mathematical reasoning tasks using the MATH dataset. The work is an extension of the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning" (Zhu et al., 2025) and explores how these reinforcement learning objectives perform across different model sizes and capabilitie (priors).
How do NSR and W-REINFORCE objectives change with the model's prior knowledge and capabilities?
We investigate whether smaller models with less prior mathematical knowledge benefit differently from negative reinforcement compared to larger, more capable models. This study examines the relationship between model size, mathematical reasoning ability, and the effectiveness of different RL objectives.
We conduct experiments across the Qwen2.5 instruction-tuned model family:
- Qwen2.5-0.5B-Instruct (0.49B parameters)
- Qwen2.5-1.5B-Instruct (1.5B parameters)
- Qwen2.5-3B-Instruct (3B parameters)
These models provide a good range of capabilities while maintaining the same architecture and training methodology, allowing for controlled comparison of how model priors may affect RL objective effectiveness.
- Max Tokens: 512
- Training Steps: 80
- Rollout Batch Size: 64
- Mini-batch Size: 32 (64/2 = 32)
- Temperature: 1.0 (consistent across all experiments)
- Rollouts per Prompt: 2
- Learning Rate: 1e-6
- Gradient Accumulation Steps: 2
We couldn't replicate the paper's parameters due to a lack of compute.
- Source: MATH dataset (Hendrycks et al.)
- Processing: Stratified sampling (25% subset) maintaining original train/test distributions
- Format: Problems with step-by-step solutions and
\boxed{}final answers - Evaluation: Answer verification using
math-verifylibrary
- NSR (Negative Sample Reinforcement): Trains only on incorrect samples with -1.0 reward
- PSR (Positive Sample Reinforcement): Trains only on correct samples with +1.0 reward
- W-REINFORCE: Weighted approach with +λ for correct, -1.0 for incorrect samples
The following visualization shows the training progress across different model sizes and RL objectives. The original data points were noisy, so we smoothed the training curves using moving average with window = 2.
The bar chart below compares the final performance of different RL objectives across model sizes:
The following tables show the detailed accuracy measurements at each training step for all RL objectives and model sizes:
| Model | Step 0 | Step 10 | Step 20 | Step 30 | Step 40 | Step 50 | Step 60 | Step 70 | Step 80 |
|---|---|---|---|---|---|---|---|---|---|
| 0.5B | 9.1% | 9.3% | 10.4% | 9.0% | 8.9% | 9.6% | 9.3% | 10.1% | 10.3% |
| 1.5B | 26.2% | 26.6% | 24.5% | 24.5% | 24.6% | 24.6% | 26.8% | 25.7% | 26.3% |
| 3B | 38.4% | 36.4% | 38.2% | 38.0% | 36.7% | 36.7% | 37.7% | 36.2% | 35.1% |
| Model | Step 0 | Step 10 | Step 20 | Step 30 | Step 40 | Step 50 | Step 60 | Step 70 | Step 80 |
|---|---|---|---|---|---|---|---|---|---|
| 0.5B | 9.1% | 8.7% | 9.5% | 9.6% | 10.7% | 8.9% | 9.1% | 9.6% | 10.0% |
| 1.5B | 26.2% | 25.0% | 26.9% | 25.8% | 25.7% | 25.3% | 25.9% | 26.4% | 25.8% |
| 3B | 38.4% | 38.0% | 38.4% | 37.6% | 38.6% | 38.7% | 37.4% | 37.2% | 37.5% |
| Model | Step 0 | Step 10 | Step 20 | Step 30 | Step 40 | Step 50 | Step 60 | Step 70 | Step 80 |
|---|---|---|---|---|---|---|---|---|---|
| 0.5B | 9.1% | 9.5% | 8.6% | 9.7% | 9.8% | 10.1% | 10.4% | 9.5% | 9.8% |
| 1.5B | 23.7% | 25.4% | 24.9% | 27.6% | 24.6% | 25.8% | 26.2% | 27.2% | 27.6% |
| 3B | 35.9% | 38.0% | 36.3% | 36.6% | 36.9% | 37.6% | 37.5% | 36.0% | 36.9% |
Overall, we find that under a resource-constrained setting, the three reinforcement learning methods—NSR, W-REINFORCE, and PSR—either fail to yield meaningful gains or only marginally improve upon the base model.
- For the 0.5B and 1.5B models, all three methods largely maintained baseline performance, achieving only minor accuracy increases over 80 training steps.
- For the 3B model, performance degraded.
We attribute these outcomes primarily to the limited number of effective training samples. The rollout batch size was already small, and both NSR and PSR further reduced usable data by filtering trajectories. W-REINFORCE suffers from a similar issue: although it retains all samples, it heavily down-weights positive examples (by 0.1×), concentrating updates on a small subset of negative samples. This reduction in effective signal weakens gradient updates and results in minimal or negative performance changes.
Given the resource constraints, it remains inconclusive whether model-specific priors influence the relative effectiveness of these methods.
# NSR with 0.5B model
python MATH_experiment.py --objective NSR --model_id Qwen/Qwen2.5-0.5B-Instruct
# W-REINFORCE with 3B model
python MATH_experiment.py --objective W_REINFORCE --model_id Qwen/Qwen2.5-3B-Instruct
# PSR with 1.5B model
python MATH_experiment.py --objective PSR --model_id Qwen/Qwen2.5-1.5B-Instruct- PyTorch >= 2.0.0
- Transformers >= 4.35.0
- vLLM >= 0.2.0
- math-verify >= 0.1.0
- Datasets >= 2.14.0
- TensorBoard >= 2.14.0
- Inspired by "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning" by Zhu et al.
- Built on the MATH dataset by Hendrycks et al.
- Uses Qwen2.5 models by Alibaba Cloud

