-
Notifications
You must be signed in to change notification settings - Fork 254
Description
First, I want to say thanks @city96 for your GGUF loaders for CLIP and Model. They are a godsend.
I am running a 7900XTX ROCm 7.1 Windows, and I found no way to accelerate FP8, they auto converts to BF16 acceleration that uses 2B per parameter and lots of VRAM. And ROCM isn't good at VRAM, AMD is improving, slowly.
With your GGUF loader my 7900XTX is able to use the INT8 hardware acceleration, and finally get that memory footprint down and stabilize workflow execution. At least with Zimage it works so much better using GGUF with both CLIP and Model.
VAE Issues under ROCm
A persistent issue I have with ROCm acceleration, is poor performance on VAE decode. I'm under the impression VAE decode is almost instant under Nvidia CUDA, meaning nobody looked into doing GGUF quantization for the VAE model as far as I can tell, I looked into found no quants, those are FP32, FP16 or BF16 models.
On AMD ROCm, VAE decode is a slow and expensive step requiring lots of extra VRAM and causing RAM spillage.
On 6.4 I found a workaround, on 7.1 Flux and Zimage Turbo VAE decode work, even if they spill into RAM even at moderate resolution 1024px.
Qwen Image VAE for some reasons requires much more RAM, even at 1024px it fills 24GB VRAM and 64GB RAM on my system and goes into OOM and segmentation fault.
VAE GGUF Loader
Would it be possible to have VAE GGUF loader to feed the VAE encode and VAE decode nodes and do the GGUF quantization of VAE models?