Skip to content

RuntimeError: CUDA error: invalid argument #27

@horrybe

Description

@horrybe

(animatediff) root@ops:/data/hbx/txt2vid/AnimateDiff-MotionDirector# CUDA_LAUNCH_BLOCKING=1 python train.py --config configs/training/motion_director/my_video.yaml
{'image_finetune': False, 'output_dir': 'outputs', 'pretrained_model_path': 'models/StableDiffusion/stable-diffusion-v1-5', 'motion_module_path': 'models/Motion_Module/animatediff/v3_sd15_mm.ckpt', 'domain_adapter_path': 'models/Motion_Module/animatediff/v3_sd15_adapter.ckpt', 'unet_additional_kwargs': {'use_inflated_groupnorm': True, 'use_motion_module': True, 'motion_module_resolutions': [1, 2, 4, 8], 'motion_module_mid_block': False, 'motion_module_type': 'Vanilla', 'motion_module_kwargs': {'num_attention_heads': 8, 'num_transformer_block': 1, 'attention_block_types': ['Temporal_Self', 'Temporal_Self'], 'temporal_position_encoding': True, 'temporal_position_encoding_max_len': 32, 'temporal_attention_dim_div': 1, 'zero_initialize': True}}, 'noise_scheduler_kwargs': {'num_train_timesteps': 1000, 'beta_start': 0.00085, 'beta_end': 0.012, 'beta_schedule': 'linear', 'clip_sample': False}, 'use_text_augmenter': False, 'dataset_types': ['single_video'], 'cfg_random_null_text_ratio': 0, 'train_data': {'manual_sample_size': True, 'sample_size': [480, 480], 'width': 384, 'height': 384, 'use_bucketing': True, 'sample_start_idx': 0, 'fps': 0, 'frame_step': 3, 'n_sample_frames': 16, 'sample_n_frames': 16, 'single_video_path': 'examples/birdstep.mp4', 'single_video_prompt': 'a bird is walking on the grass ', 'fallback_prompt': '', 'max_chunks': 1}, 'validation_data': {'prompts': ['a bird is walking on the grass'], 'num_inference_steps': 25, 'guidance_scale': 9, 'spatial_scale': 0.5, 'validation_seed': 44}, 'lora_name': 'birdwalking', 'use_motion_lora_format': True, 'lora_rank': 64, 'lora_unet_dropout': 0.1, 'single_spatial_lora': True, 'train_sample_validation': False, 'unet_checkpoint_path': '', 'learning_rate': 0.0005, 'learning_rate_spatial': 0.0001, 'adam_weight_decay': 0.01, 'cache_latents': True, 'train_batch_size': 1, 'use_lion_optim': True, 'use_offset_noise': False, 'max_train_epoch': 500, 'max_train_steps': -1, 'checkpointing_epochs': -1, 'checkpointing_steps': 100, 'validation_steps': 50, 'validation_steps_tuple': [2, 50], 'global_seed': 33, 'mixed_precision_training': True, 'enable_xformers_memory_efficient_attention': True, 'gradient_checkpointing': True, 'is_debug': False}
loaded 3D unet's pretrained weights from models/StableDiffusion/stable-diffusion-v1-5/unet ...

missing keys: 560;

unexpected keys: 0;

Motion Module Parameters: 417.1376 M

Caching Latents.: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.01s/it]
load motion module from models/Motion_Module/animatediff/v3_sd15_mm.ckpt
load domain lora from models/Motion_Module/animatediff/v3_sd15_adapter.ckpt
Using LoRA Version: cloneofsimo
Lora successfully injected into UNet3DConditionModel.
Using LoRA Version: cloneofsimo
Lora successfully injected into UNet3DConditionModel.
07/04/2024 09:32:26 - INFO - root - ***** Running training *****
07/04/2024 09:32:26 - INFO - root - Num examples = 1
07/04/2024 09:32:26 - INFO - root - Num Epochs = 500
07/04/2024 09:32:26 - INFO - root - Instantaneous batch size per device = 1
07/04/2024 09:32:26 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 1
07/04/2024 09:32:26 - INFO - root - Gradient Accumulation steps = 1
07/04/2024 09:32:26 - INFO - root - Total optimization steps = 500
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:01<00:00, 13.47it/s]
Dataset sanity check...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.85s/it]
Traceback (most recent call last):███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.85s/it]
File "/data/hbx/txt2vid/AnimateDiff-MotionDirector/train.py", line 1038, in
main(name=name, use_wandb=args.wandb, **config)
File "/data/hbx/txt2vid/AnimateDiff-MotionDirector/train.py", line 868, in main
scaler.scale(loss_spatial).backward(retain_graph=True)
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/torch/autograd/function.py", line 414, in wrapper
outputs = fn(ctx, *args)
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/xformers/ops/fmha/init.py", line 111, in backward
grads = _memory_efficient_attention_backward(
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/xformers/ops/fmha/init.py", line 382, in _memory_efficient_attention_backward
grads = op.apply(ctx, inp, grad)
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 184, in apply
(grad_q, grad_k, grad_v,) = cls.OPERATOR(
File "/data/hbx/envs/animatediff/lib/python3.10/site-packages/torch/_ops.py", line 442, in call
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument

Just running the output looks like the configuration is fine, but there is this problem: Save the kids

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions