Skip to content

issue in multi-gpu training with DreamSim as loss #32

@JHLew

Description

@JHLew

DreamSim as a perceptual loss seems to work as a perceptual loss, as demonstrated in the demo in README.md.
However, when it comes to multi-gpu training, although it does run without errors rising, I spot that there seems to be an issue.

I tried running a 4-gpu process. Each gpu has like 60G of VRAM running for each process, which is perfectly correct, but one gpu, namely gpu:0, which would be the main process of multi-gpu training, has 3 more processes which take up extra VRAM space. It seems like the distributed processes have some parts accessing to the main processes' gpu. Replacing the loss with other losses never causes such issues, so I am pretty sure it's on the dreamsim implementation, but not sure how to get over this. I tried to manage this issue for days trying various ways to bypass it, but failed to do so.

I was wondering if there is anyone who ran into similar issues and resolved it.

Below is the nvidia-smi shots of how it actually looks like.

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|===================================================================|
| 0 N/A N/A 3683677 C python 63888MiB |
| 0 N/A N/A 3683678 C python 526MiB |
| 0 N/A N/A 3683679 C python 526MiB |
| 0 N/A N/A 3683680 C python 526MiB |
| 1 N/A N/A 3683678 C python 63908MiB |
| 2 N/A N/A 3683679 C python 63908MiB |
| 3 N/A N/A 3683680 C python 63908MiB |
+----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions