Discussion on Effective Training Strategies for Small Models

In the DeepSeek-R1 paper, it's mentioned that for smaller models (<=32B), using knowledge distillation is more effective than using RL. Based on this, can we hypothesize that training DeepSeek-R1 on the KK dataset, then generating a large amount of CoT data (or directly generating CoT data by o3-mini-high), and finally distilling it to the Qwen2.5-7B model via SFT would theoretically yield better results?

Meanwhile, this project achieved excellent performance on the KK dataset using a combination of curriculum learning and pure reinforcement learning. This demonstrates the effectiveness of using pure RL on small models, which seems somewhat "in conflict" with the view presented in the DeepSeek-R1 paper.

Given these points, how can we systematically outline strategies for improving the performance of smaller models now?

For tasks where large models (like DeepSeek-R1) perform well (including both reasoning and non-reasoning tasks), directly use the large model to generate data and distill it to the small model, and for custom tasks where large models perform poorly (e.g., reasoning tasks),  we can try training small models ourselves using reinforcement learning? 

Any help would be appreciated !



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on Effective Training Strategies for Small Models #70

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Discussion on Effective Training Strategies for Small Models #70

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions