Thank you for your excellent work and the insightful paper. We are currently attempting to reproduce your results using the kk-datasets configuration, and have observed that no language-mixing phenomena occur even after model convergence.
What is the average response length (in tokens) observed during language-mixing events?
Some Exp Settings:
- model:qwen2.5-7b-base
- max_response_length: 4096
- datasets: kk 5-8ppl