Question about LRM model training

Dear author,  I'm a bit confused with the training procedure of LRM, there is a graph indicating after and before RL performance, are we first  SFT the QWQ model and then RL, if so will we release the SFT dataset?