Skip to content

Fix: unwrap DDP before enabling gradient checkpointing for HF compatibility#31

Open
magictut (magicwang1111) wants to merge 1 commit intoLightricks:mainfrom
magicwang1111:main
Open

Fix: unwrap DDP before enabling gradient checkpointing for HF compatibility#31
magictut (magicwang1111) wants to merge 1 commit intoLightricks:mainfrom
magicwang1111:main

Conversation

@magicwang1111

This PR introduces three key improvements:

  1. Dependency Standardization

    • Added a requirements.txt file based on the original uv.lock to ensure reproducible installations across environments.
    • Locks versions for diffusers, accelerate, transformers, CUDA toolkits, and other core libraries.
  2. Support for Custom Master Port

    • Added a new --main_process_port option in scripts/train_distributed.py to allow explicit control over the master port used by Accelerate’s distributed launcher.
    • Prevents port conflicts when launching multiple distributed training jobs on the same host.

    python scripts/train_distributed.py
    configs/your_config.yaml
    --num_processes 2
    --main_process_port 29600

  3. **Fix for HF DDP Compatibility

In src/ltxv_trainer/trainer.py, unwrap the base model from DistributedDataParallel before calling the gradient-checkpointing API.

Prevents the runtime AttributeError: 'DistributedDataParallel' object has no attribute 'enable_gradient_checkpointing'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant