-
Notifications
You must be signed in to change notification settings - Fork 164
Description
In the DeepSeek-R1 paper, it's mentioned that for smaller models (<=32B), using knowledge distillation is more effective than using RL. Based on this, can we hypothesize that training DeepSeek-R1 on the KK dataset, then generating a large amount of CoT data (or directly generating CoT data by o3-mini-high), and finally distilling it to the Qwen2.5-7B model via SFT would theoretically yield better results?
Meanwhile, this project achieved excellent performance on the KK dataset using a combination of curriculum learning and pure reinforcement learning. This demonstrates the effectiveness of using pure RL on small models, which seems somewhat "in conflict" with the view presented in the DeepSeek-R1 paper.
Given these points, how can we systematically outline strategies for improving the performance of smaller models now?
For tasks where large models (like DeepSeek-R1) perform well (including both reasoning and non-reasoning tasks), directly use the large model to generate data and distill it to the small model, and for custom tasks where large models perform poorly (e.g., reasoning tasks), we can try training small models ourselves using reinforcement learning?
Any help would be appreciated !