Optimized scripts for running HeartMuLa music generation with maximum performance on NVIDIA GPUs, especially A100
- Flash Attention 2 integration for faster transformer inference
- TF32 optimization for A100 GPU
- Mixed precision support (FP16/BF16)
- Optimized CUDA memory management
- Ready-to-use Jupyter notebook and Python script
- Simple lyrics and tags customization
- NVIDIA GPU with at least 24GB VRAM (tested on A100 80GB)
- CUDA 11.8 or higher
- Python 3.8+
- PyTorch 2.0+
- CUDA Toolkit
- FFmpeg
- Upload
heartlib.ipynbto your environment - Run the cells in order:
- Cell 1: Setup Environment
- Cell 2: Download Checkpoints
- Cell 3: Run Generation
# Clone this repo
git clone https://github.com/theelderemo/HeartLib-Google-Colab.git
cd HeartLib-Google-Colab
# Run the script
python heartlib.pyEdit the lyrics and tags in the generation cell/script:
my_lyrics = """
[Verse]
Your lyrics here
[Chorus]
Your chorus here
"""
my_tags = "piano,happy,pop"The script automatically applies these optimizations:
- TF32 precision: Enabled by default on A100
- Flash Attention 2: Auto-enabled if installed
- Memory optimization: CUDA allocator configured for efficiency
--model_path (required): Path to the pretrained model checkpoint
--lyrics: Path to lyrics file (default: ./assets/lyrics.txt)
--tags: Path to tags file (default: ./assets/tags.txt)
--save_path: Output audio file path (default: ./assets/output.mp3)
--max_audio_length_ms: Maximum audio length in milliseconds (default: 240000)
--topk: Top-k sampling parameter for generation (default: 50)
--temperature: Sampling temperature for generation (default: 1.0)
--cfg_scale: Classifier-free guidance scale (default: 1.5)
--version: The version of HeartMuLa, choose between [3B, 7B]. (default: 3B) # 7B version not released yet.
As it sits, with what I have, it generates great output, and auto adjust output lenght based on the amount of text/lyrics.
Available versions:
3B(default) - 3 billion parameter model- Check HeartMuLa repo for other versions
All optimizations are enabled by default.
- V100/A10: TF32 not available, but Flash Attention still helps
- RTX 3090/4090: Excellent performance with FP16
- T4/Smaller GPUs: May need to reduce batch size or use smaller models
- Check GPU utilization:
nvidia-smi -l 1 - Verify CUDA version:
nvcc --version - Monitor memory: Ensure you're not hitting OOM errors
- Try without Flash Attention: Comment out the flash-attn installation if it causes issues
heartlib-optimized/
├── README.md # This file
├── heartlib.ipynb # Jupyter notebook version
├── heartlib.py # Standalone Python script
Generated music is saved to ./heartlib/assets/output.mp3 by default.
# Use xformers as fallback
pip install xformers- Reduce lyrics length
- Use a smaller model version
- Clear GPU cache:
torch.cuda.empty_cache()
The first run compiles kernels and may take longer. Subsequent runs will be faster.
Tested on various hardware configurations:
| GPU | VRAM | Time (baseline) | Time (optimized) | Speedup |
|---|---|---|---|---|
| A100 80GB | 80GB | ~120s | ~35s | 3.4x |
| A100 40GB | 40GB | ~120s | ~38s | 3.2x |
| RTX 4090 | 24GB | ~180s | ~65s | 2.8x |
| V100 | 32GB | ~200s | ~85s | 2.4x |
Times are approximate for a 2-minute song with standard lyrics
- Original model: HeartMuLa
- Optimizations: Based on PyTorch 2.0+ best practices and Flash Attention 2
This repository contains only wrapper scripts. Please refer to the original HeartMuLa license for model usage terms.
Contributions are welcome! Please feel free to submit a Pull Request.
- Support for batch generation
- Additional optimization techniques
- Integration with other music generation tools
- Web UI wrapper
- Issues: Open an issue in this repo
- Original Model Issues: Report to HeartMuLa repo
- Optimization Questions: Check PyTorch and Flash Attention documentation
If you find this useful, please star the repo!
Note: This is an optimization wrapper for HeartMuLa. All model weights and core functionality belong to the original HeartMuLa project.