You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear author, I'm a bit confused with the training procedure of LRM, there is a graph indicating after and before RL performance, are we first SFT the QWQ model and then RL, if so will we release the SFT dataset?