Skip to content

Conversation

@yaroslav-tykhonchuk
Copy link
Collaborator

Problem

We had a race condition during quarantine that could create duplicate messages in quarantine with a reference to one blob.

Problem Analysis

Failure Scenario:

  1. Message with BlobRef quarantined → quarantine queue has copy
  2. DeleteMessage fails (service crash/network issue) → message stays in main queue
  3. Message becomes visible again → dequeued with DequeueCount++
  4. Same message quarantined again → 2 identical messages in quarantine with same blob ref
  5. First quarantine message processed → blob deleted
  6. Second quarantine message processed → blob not found → poison message

Solution

Create a new blob on message quarantine and copy its content there. This way - every message in quarantine queue has a unique ref to blob. We do not deserialize message content - we check if it's a blob or not.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition during message quarantine where duplicate messages could reference the same blob, leading to "blob not found" errors when processing the second quarantine message. The solution creates a copy of the blob content for each quarantined message, ensuring unique blob references.

Key Changes:

  • Modified QuarantineMessage to detect blob-backed messages and copy blob content before quarantining
  • Added CopyBlobForQuarantine method to create new blob copies with unique names
  • Updated message deletion logic to use the original blob reference

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@vansha vansha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. On the one hand, it feels safer to duplicate blob content and not have to worry later when receiving a message without a blob (is it empty due to a bug?).

On the other hand, we are intentionally doing a wasteful job by copying/duplicating (and removing) blob content and making the code more complicated.

Perhaps we could ignore those messages with missing blobs in the client code?

@yaroslav-tykhonchuk
Copy link
Collaborator Author

yaroslav-tykhonchuk commented Nov 18, 2025

I don't know. On the one hand, it feels safer to duplicate blob content and not have to worry later when receiving a message without a blob (is it empty due to a bug?).

@vansha It's not empty. Please read the PR description. We can have 2 (or more, but less likely) duplicate messages in quarantine with a reference to a single blob. When the first message is processed, we delete the blob. Then the second message has an orphan reference.

@tarasparfeniuk
Copy link

On the other hand, we are intentionally doing a wasteful job by copying/duplicating (and removing) blob content and making the code more complicated.

I think this solution is a good middle ground -- it doesn't bring too much complexity, but preserves safety. I'm a bit scared of the possibility of a non-existent blob in the message. If we skip them on the client side, we can easily miss another bug (if it appears), being left with no chance to check what the message content is

@yaroslav-tykhonchuk
Copy link
Collaborator Author

yaroslav-tykhonchuk commented Nov 18, 2025

We have very few cases of quarantining the message during Receive (because we also try to quarantine the message during Flush in TimerBatch). And not all messages have refs to blobs.
image

So this operation is very rare by design. I think it's fine for us to download/upload the blob content once in a while, but be sure to not loose the data.

 We had a race condition during quarantine, that could create duplicate messages in quarantine with a reference to one blob.

Problem Analysis

Failure Scenario:
  1. Message with BlobRef quarantined → quarantine queue has copy
  2. DeleteMessage fails (service crash/network issue) → message stays in main queue
  3. Message becomes visible again → dequeued with DequeueCount++
  4. Same message quarantined again → 2 identical messages in quarantine with same blob ref
  5. First quarantine message processed → blob deleted
  6. Second quarantine message processed → blob not found → poison message

Solution: create new blob on quarantine and copy its content there
@yaroslav-tykhonchuk yaroslav-tykhonchuk force-pushed the prevent-duplicate-blob-refs-in-quarantine branch from c532b8e to 09e352e Compare November 20, 2025 09:57
@yaroslav-tykhonchuk yaroslav-tykhonchuk merged commit 7365e67 into main Nov 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants