Observation: Training loss occasionally diverges when using high learning rates or large batches during the flow-matching optimization process.
Proposed Fix:
Add torch.nn.utils.clip_grad_norm_ to the training loop (suggested max norm: 1.0).
Implement Exponential Moving Average (EMA) for model weights to improve the robustness of the generated speech samples.
Verification: Monitor the vector field loss to ensure smoother convergence over 50k+ steps.