Skip to content

Conversation

@tvukovic-amd
Copy link

Adding memory precheck before VAE decode to prevent Windows 0xC0000005 access violation crashes, particularly on devices with limited VRAM.

Problem

VAE decode loading could trigger 0xC0000005 (access violation) crashes when:

  • GPU memory was insufficient for full decode
  • Models couldn't be offloaded to CPU (due to --highvram, --gpu-only, or insufficient CPU RAM)

The existing OOM exception handling couldn't catch these crashes because they occur at the driver/system level before PyTorch can raise an exception.

Solution

Added a proactive memory check (use_tiled_vae_decode()) that evaluates memory conditions before attempting decode:

  1. Check if GPU has enough space for full decode (with reserves)
  2. Check if models can be offloaded to CPU (respects --highvram, --gpu-only flags)
  3. Check if CPU has enough space to receive offloaded models (respects --disable-smart-memory)

If any condition fails, switch to tiled VAE decode preemptively.

@rattus128
Copy link
Contributor

We have made a lot of VAE VRAM fixes recently so pre-emptively tiling is going to false positive without a fairly major audit of the VAE estimates. Some of them (like LTX) have non-trivial non-linear VRAM consumption patterns.

The comfy-aimdo project is trying to take things the other way and control the allocator under VRAM pressure to get us away from the maintainence of accurate model estimates.

#11845
https://pypi.org/project/comfy-aimdo/

No AMD support yet though.

Is there a path forward on pytorch being able to allocate with clean exception on OOM?

Which VAEs are the worst offenders and how much VRAM are you generally trying to support?

@MeiYi-dev
Copy link

MeiYi-dev commented Jan 29, 2026

We have made a lot of VAE VRAM fixes recently so pre-emptively tiling is going to false positive without a fairly major audit of the VAE estimates. Some of them (like LTX) have non-trivial non-linear VRAM consumption patterns.

The comfy-aimdo project is trying to take things the other way and control the allocator under VRAM pressure to get us away from the maintainence of accurate model estimates.

#11845 https://pypi.org/project/comfy-aimdo/

No AMD support yet though.

Is there a path forward on pytorch being able to allocate with clean exception on OOM?

Which VAEs are the worst offenders and how much VRAM are you generally trying to support?

Not the OP but, the most VRAM is consumed by the video VAEs, all the new image VAEs are very efficient by default. On another note, since you recently reduced the LTX 2 VAE VRAM consumption by 1/3 , ComfyUI still offloads the whole model before decoding with the LTX 2 VAE (likely because the VAE memory estimation wasn't changed), this is critical for low 16gb VRAM users since the model and TEs cannot be kept within 32GB RAM and the model spills onto pagefile. A custom node that sets the amount of model to offload before decoding will be so nice to have since LTX 2 VAE literally takes only 3GB VRAM when decoding with tiled decoding, but ComfyUI still offloads the whole model into RAM and pagefile.

#NOTE: We don't know what tensors were allocated to stack variables at the time of the
#exception and the exception itself refs them all until we get out of this except block.
#So we just set a flag for tiler fallback so that tensor gc can happen once the
#exception is fully off the books.
Copy link
Contributor

@asagi4 asagi4 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to set do_tile = True here to actually do the tiled VAE retry.

I think this patch would be fairly helpful on AMD especially. Some VAE VRAM estimates with AMD seem to be kind of bonkers; the Flux VAE requests 11.6GB of VRAM to decode a 1 megapixel image and somehow I don't think it actually uses anywhere near that much.

EDIT: I just did a quick memory dump after a VAE decode. Torch maximum memory usage was about 6.6GB, and that would probably include the loaded VAE model and anything else that might be in VRAM. I'm not sure how to accurately tell what the actual VAE decoding used, but clearly not 11.6GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants