Skip to content

Conversation

@maggie-lou
Copy link
Collaborator

@maggie-lou maggie-lou commented Dec 8, 2025

When investigating corrupted snapshots that have Failed to update balloon stats, missing descriptor in the guest kernel logs, I noticed a trend where:

  1. A workload finishes running
  2. We try to expand the balloon to reclaim 90% available memory in the VM
  3. The balloon fails to fully inflate, and then deflate back down
  4. We save the snapshot
  5. The next time a workload tries to start from this snapshot, it fails to connect to the vmexec server

I think there's either a bug in our balloon handler implementation, or in the firecracker balloon code. I created a firecracker issue to help debug the issue further

In the meantime, we shouldn't save the snapshot in these cases. Hopefully this will help reduce snapshot corruption

@maggie-lou maggie-lou marked this pull request as ready for review December 8, 2025 21:17
@maggie-lou maggie-lou requested a review from bduffany December 9, 2025 18:25
@maggie-lou maggie-lou merged commit 6a80528 into master Dec 11, 2025
8 of 9 checks passed
@maggie-lou maggie-lou deleted the balloon_log branch December 11, 2025 22:25
maggie-lou added a commit that referenced this pull request Dec 16, 2025
…10850)" (#10936)

This fully reverts commit 6a80528 (even
with #10927 snapshot
sharing was broken, because there's one additional callsite of
`updateBalloon` that was still propagating the error)

After this revert is cherry-picked into the release, I'll put up an
alternate fix
maggie-lou added a commit that referenced this pull request Dec 16, 2025
…10850)" (#10936)

This fully reverts commit 6a80528 (even
with #10927 snapshot
sharing was broken, because there's one additional callsite of
`updateBalloon` that was still propagating the error)

After this revert is cherry-picked into the release, I'll put up an
alternate fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants