Skip to content

Multi-GPU Inference on H20 Causes Rotary Position Embedding Dimension Mismatch #26

@pikanzz

Description

@pikanzz

Title:
Dimension mismatch in rotary position embedding when using multi-GPU inference with VideoAlign-based video reward model

Body:
Hi,

I am working with a video reward model based on VideoAlign. My training configuration uses FPS = 4.
During inference on videos longer than 5 seconds, single-GPU (A100-80G) inference runs out of memory (OOM).

To avoid this, I tried multi-GPU inference on H20.
I modified my inference code to wrap the model in nn.DataParallel when multiple GPUs are detected:

class VideoVLMRewardInference:
    def __init__(self, load_from_pretrained, load_from_pretrained_step=-1, device=None, dtype=torch.bfloat16):
        config_path = os.path.join(load_from_pretrained, "VideoReward/model_config.json")
        data_config, _, model_config, peft_lora_config, inference_config = load_configs_from_json(config_path)
        data_config = DataConfig(**data_config)
        model_config = ModelConfig(**model_config)
        peft_lora_config = PEFTLoraConfig(**peft_lora_config)

        training_args = TrainingConfig(
            load_from_pretrained=load_from_pretrained,
            load_from_pretrained_step=load_from_pretrained_step,
            gradient_checkpointing=False,
            disable_flash_attn2=False,
            bf16=True if dtype == torch.bfloat16 else False,
            fp16=True if dtype == torch.float16 else False,
            output_dir="",
        )
        
        model, processor, peft_config = create_model_and_processor(
            model_config=model_config,
            peft_lora_config=peft_lora_config,
            training_args=training_args,
        )

        model, checkpoint_step = load_model_from_checkpoint(
            model,
            load_from_pretrained,
            load_from_pretrained_step
        )
        model.eval()

        if torch.cuda.device_count() > 1:
            print(f"Using {torch.cuda.device_count()} GPUs for inference...")
            self.device = "cuda"
            model = nn.DataParallel(model)
        else:
            self.device = "cuda:0" if torch.cuda.is_available() else "cpu"

        self.model = model.to(self.device)
        self.processor = processor
        self.data_config = data_config
        self.inference_config = inference_config

When running multi-GPU inference, I get the following error:

  File ".../transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 372, in forward
    q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
  File ".../transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 255, in apply_rotary_pos_emb_vision
    output = (tensor * cos) + (rotate_half(tensor) * sin)
RuntimeError: The size of tensor a (8960) must match the size of tensor b (20480) at non-singleton dimension 1

Question:
How can I fix the dimension mismatch in rotary position embedding when switching from single-GPU to multi-GPU inference?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions