issue in multi-gpu training with DreamSim as loss

DreamSim as a perceptual loss seems to work as a perceptual loss, as demonstrated in the demo in README.md.
However, when it comes to multi-gpu training, although it does run without errors rising, I spot that there seems to be an issue.

I tried running a 4-gpu process. Each gpu has like 60G of VRAM running for each process, which is perfectly correct, but one gpu, namely gpu:0, which would be the main process of multi-gpu training, has 3 more processes which take up extra VRAM space. It seems like the distributed processes have some parts accessing to the main processes' gpu. Replacing the loss with other losses never causes such issues, so I am pretty sure it's on the dreamsim implementation, but not sure how to get over this. I tried to manage this issue for days trying various ways to bypass it, but failed to do so.

I was wondering if there is anyone who ran into similar issues and resolved it.

Below is the nvidia-smi shots of how it actually looks like.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|===================================================================|
|    0   N/A  N/A   3683677      C   python      63888MiB |
|    0   N/A  N/A   3683678      C   python       526MiB |
|    0   N/A  N/A   3683679      C   python       526MiB |
|    0   N/A  N/A   3683680      C   python        526MiB |
|    1   N/A  N/A   3683678      C   python      63908MiB |
|    2   N/A  N/A   3683679      C   python      63908MiB |
|    3   N/A  N/A   3683680      C   python      63908MiB |
+----------------------------------------------------------------------------------------+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue in multi-gpu training with DreamSim as loss #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

issue in multi-gpu training with DreamSim as loss #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions