SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling
Official implementation of "SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling" accepted at AAAI 2026.
📄 Paper: SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling
🤗 Dataset: YangXiao-nlp/DualThinking
This repository contains the official implementation of our AAAI 2026 paper on SCALE (Selective Resource Allocation), a novel three-stage adaptive reasoning framework for mathematical problem solving. Our approach dynamically selects between thinking and non-thinking modes based on step-level difficulty assessment, achieving significant improvements in computational efficiency while maintaining high accuracy.
The SCALE (Selective Resource Allocation) system implements a four-stage mathematical reasoning pipeline:
- Problem Decomposition: Generates multiple candidate decompositions and selects the most coherent step-by-step outline via majority voting
- Difficulty Assessment: Assigns a difficulty score in ([0, 1]) to each sub-problem conditioned on accumulated context
- Adaptive Mode Selection: Chooses between fast System 1 processing and deliberate System 2 reasoning based on a user-configurable threshold
- Sequential Execution with Context Propagation: Solves sub-problems in order while propagating intermediate results to maintain a coherent reasoning chain
- Adaptive Mode Selection: Dynamically chooses between thinking and non-thinking modes based on step difficulty
- Majority Voting: Generates multiple outline versions and selects the optimal one
- Extensible Architecture: Base class design supports easy integration of different language models
- Incremental Processing: Supports checkpointing and resumption of long-running inference tasks
The core implementation is in inference/cot_dual_thinking.py, which provides:
System3Base (Base Class)
├── thinking_outline_majority_voting() # Stage 1: Problem decomposition via majority voting
├── thinking_scores() # Stage 2: Difficulty assessment for each sub-problem
├── thinking_details() # Stages 3-4: Adaptive mode selection and sequential execution
└── extract_final_answer() # Extract and validate final answer
System3Base (cot_dual_thinking.py)
├── System3Distill (cot_dual_thinking_distill.py) # DeepSeek Distill variant
└── System3QwQ (cot_dual_thinking_qwq.py) # QwQ model variant
Subclasses override get_thinking_request_func() and get_non_thinking_request_func() to use different language models.
- Python 3.8+
- OpenAI-compatible API endpoint (vLLM recommended)
# Clone the repository
git clone https://github.com/yourusername/DualThinking.git
cd DualThinking
# Install dependencies
pip install -r requirements.txtopenai>=1.40.0
requests>=2.31.0
Set the following environment variables for model endpoints:
# For Qwen models (default)
export QWEN_THINKING_BASE="http://localhost:8000/v1"
export QWEN_NONTHINKING_BASE="http://localhost:8000/v1"
export QWEN_OPENAI_API_KEY="your-api-key"
# For DeepSeek Distill variant
export DEEPSEEK_BASE="http://localhost:8001/v1"
export DEEPSEEK_TOKENIZE="http://localhost:8001/tokenize"
# For QwQ variant
export QWQ_BASE="http://localhost:8002/v1"from inference.cot_dual_thinking import inference_sysytem3
import argparse
# Configure arguments
args = argparse.Namespace(
output_folder="./output",
max_tokens=32768,
top_k=20,
difficulty=0.5, # Difficulty threshold for thinking mode
cot_name="dual_thinking",
model_name="qwen",
prompt_version="v1"
)
# Prepare question
question = {
"id": "math_001",
"sampling_id": 0,
"problem": "Your mathematical problem here",
"condition": "Given conditions",
"question": "What to find?",
"answer": "Expected answer"
}
# Run inference
success = inference_sysytem3(question, args)# Example: Run dual thinking inference on a dataset
python -m inference.cot_dual_thinking \
--input data/eval/aime24/test_converted.jsonl \
--output ./output \
--max_tokens 32768 \
--difficulty 0.5 \
--model_name qwen# Using DeepSeek Distill variant
from inference.cot_dual_thinking_distill import System3Distill
system3 = System3Distill(question, args)
# ... run inference
# Using QwQ variant
from inference.cot_dual_thinking_qwq import System3QwQ
system3 = System3QwQ(question, args)
# ... run inferenceDualThinking/
├── inference/
│ ├── cot_dual_thinking.py # Core implementation (System3Base)
│ ├── cot_dual_thinking_distill.py # DeepSeek Distill variant
│ ├── cot_dual_thinking_qwq.py # QwQ model variant
│ ├── qwen.py # Qwen model interface
│ ├── deepseek_distill.py # DeepSeek Distill interface
│ ├── qwq.py # QwQ model interface
│ ├── util.py # Utility functions
│ └── convert_problem_format.py # Data preprocessing
├── data/
│ ├── eval/ # Evaluation datasets
│ │ ├── aime24/ # AIME 2024 test set
│ │ └── amc23/ # AMC 2023 test set
│ └── train/ # SCALE-generated reasoning traces
│ ├── limo/ # Traces from LIMOPro dataset using QwQ
│ └── limo_v2/ # Enhanced version of LIMOPro traces
├── paper/
│ └── AAAI26.pdf # Paper PDF
├── requirements.txt # Python dependencies
└── README.md # This file
SCALE samples eight alternative outlines for each problem and selects the most coherent version via a self-evaluation prompt.
def thinking_outline_majority_voting(self) -> None:
outlines = {}
for version_id in range(1, self.step_outline_versions + 1):
version = self.thinking_outline_single(generate_id=version_id * sampling_id)
outlines[f"version{version_id}"] = json.loads(version)
selected_version = evaluate_and_select(outlines, problem)
self.steps = outlines[selected_version]This mirrors the paper's decomposition module by producing an ordered reasoning outline before any detailed solving begins.
Each sub-problem receives a difficulty score in ([0, 1]) conditioned on the accumulated solution context:
- Scores near 0 denote routine operations.
- Scores near 1 denote challenging reasoning steps.
The difficulty threshold is user-configurable through --difficulty (default: 0.5), allowing practitioners to tune how selectively SCALE engages deliberate reasoning.
SCALE compares each difficulty score against the threshold to decide whether to invoke fast System 1 processing or deliberate System 2 reasoning.
enable_thinking = self.steps_scores[step] > self.threshold
request_func = (
self.get_thinking_request_func()
if enable_thinking
else self.get_non_thinking_request_func()
)Lower thresholds (e.g., --difficulty 0.3) encourage broader use of System 2, while higher thresholds (e.g., --difficulty 0.7) reserve it for the most demanding sub-problems.
Reasoning proceeds step-by-step while propagating intermediate results so that downstream sub-problems retain full contextual knowledge.
for step_idx, step in enumerate(self.steps, start=1):
enable_thinking = self.steps_scores[step] > self.threshold
request_func = (
self.get_thinking_request_func()
if enable_thinking
else self.get_non_thinking_request_func()
)
current_process = " ".join(
f"{self.steps[f'step{i}']}\nstep answer:{self.steps_content[f'step{i}']}"
for i in range(1, step_idx)
)
query = f"Given conditions: {conditions}\nCurrent process: {current_process}\nQuestion: {step}"
results = request_func(thinking=enable_thinking, messages=messages, ...)
self.steps_detail[f"step{step_idx}"] = results["reasoning_content"]
self.steps_content[f"step{step_idx}"] = results["content"]This stage ensures the final answer is derived from a coherent reasoning chain that accumulates both sub-problems and their solutions.
The training data consists of high-quality synthetic reasoning traces generated using our SCALE framework:
- Source: Problems from the LIMOPro (Xiao et al. 2025) dataset
- Generation: Applied SCALE framework on QwQ model to synthesize reasoning traces
- Filtering: Removed instances where SCALE-generated answers differ from original LIMOPro answers
- Result: 800 high-quality question-response pairs with SCALE-generated reasoning traces
This synthetic data is used for supervised fine-tuning of base models (Qwen2.5-14B/32B/72B-Instruct, Llama3.3-70B-Instruct) to enhance non-reasoning model performance.
The system has been evaluated on the following benchmarks:
- AIME 2024: American Invitational Mathematics Examination
- AMC 2023: American Mathematics Competition
Dataset and evaluation data are available at: 🤗 YangXiao-nlp/DualThinking
Results and detailed analysis are available in our paper.
The system saves results in JSON format with the following structure:
{
"id": "math_001",
"sampling_id": 0,
"problem": "Original problem text",
"steps": {
"step1": "First subproblem",
"step2": "Second subproblem",
...
},
"steps_scores": {
"step1": 0.3,
"step2": 0.7,
...
},
"steps_detail": {
"step1": "Detailed reasoning for step 1",
...
},
"steps_content": {
"step1": "Solution for step 1",
...
},
"final_answer": "42",
"is_correct": true,
"cot_name": "dual_thinking",
"model_name": "qwen"
}| Parameter | Default | Description |
|---|---|---|
max_tokens |
32768 | Maximum tokens for response generation |
difficulty |
0.5 | Difficulty threshold for thinking mode activation (0.0-1.0). Steps with scores > threshold use thinking mode. Adjust based on resource constraints: lower values (e.g., 0.3) use thinking mode more aggressively; higher values (e.g., 0.7) use it more conservatively. |
top_k |
20 | Top-k sampling parameter |
step_outline_versions |
8 | Number of outline versions for majority voting |
If you find this work useful, please cite our paper:
@misc{xiao2025scaleselectiveresourceallocation,
title={SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling},
author={Yang Xiao and Chunpu Xu and Ruifeng Yuan and Jiashuo Wang and Wenjie Li and Pengfei Liu},
year={2025},
eprint={2512.00466},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.00466},
}Please also cite the LIMOPro dataset which provides the source problems for our training data:
@article{xiao2025limopro,
title={LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling},
author={Xiao, Yang and Wang, Jiashuo and Yuan, Ruifeng and Xu, Chunpu and Xu, Kaishuai and Li, Wenjie and Liu, Pengfei},
journal={arXiv preprint arXiv:2505.19187},
year={2025}
}Our training and evaluation datasets are publicly available on Hugging Face:
The dataset includes:
- Training data: High-quality synthetic reasoning traces generated by applying the SCALE framework on QwQ model using problems from LIMOPro dataset (800 curated question-response pairs)
- Evaluation sets: AIME 2024 and AMC 2023 test sets for benchmark evaluation
- Converted formats: Preprocessed data formats for easy integration with the inference pipeline
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and feedback, please open an issue on GitHub or contact the authors.
We thank the AAAI 2026 reviewers for their valuable feedback and suggestions.