-
Notifications
You must be signed in to change notification settings - Fork 31
Description
DreamSim as a perceptual loss seems to work as a perceptual loss, as demonstrated in the demo in README.md.
However, when it comes to multi-gpu training, although it does run without errors rising, I spot that there seems to be an issue.
I tried running a 4-gpu process. Each gpu has like 60G of VRAM running for each process, which is perfectly correct, but one gpu, namely gpu:0, which would be the main process of multi-gpu training, has 3 more processes which take up extra VRAM space. It seems like the distributed processes have some parts accessing to the main processes' gpu. Replacing the loss with other losses never causes such issues, so I am pretty sure it's on the dreamsim implementation, but not sure how to get over this. I tried to manage this issue for days trying various ways to bypass it, but failed to do so.
I was wondering if there is anyone who ran into similar issues and resolved it.
Below is the nvidia-smi shots of how it actually looks like.
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|===================================================================|
| 0 N/A N/A 3683677 C python 63888MiB |
| 0 N/A N/A 3683678 C python 526MiB |
| 0 N/A N/A 3683679 C python 526MiB |
| 0 N/A N/A 3683680 C python 526MiB |
| 1 N/A N/A 3683678 C python 63908MiB |
| 2 N/A N/A 3683679 C python 63908MiB |
| 3 N/A N/A 3683680 C python 63908MiB |
+----------------------------------------------------------------------------------------+