Objective: Extend the current implementation to support cross-lingual style transfer (e.g., using an English voice prompt to generate Spanish speech).
Technical Requirements:
Integration of a multilingual phonemizer (e.g., espeak-ng or gruut).
Update the audio_to_text alignment logic to handle non-English character sets.
Verification of the flow-matching objective's performance across different language embeddings.
Context: This would bring the repo closer to the full functionality described in the Meta Voicebox paper (Le et al., 2023).