Skip to content

Conversation

@erskingardner
Copy link
Member

@erskingardner erskingardner commented Jan 22, 2026

Summary

This PR implements better handling for retryable message states and ensures message context (like message_event_id) is preserved when messages fail processing.

Fixes #160

Changes

  • Updated process_message in mdk-core to look up existing records and preserve message_event_id, epoch, and mls_group_id when creating a Failed state record.
  • Added logic to allow reprocessing of messages marked as Retryable.
  • Updated storage traits and implementations to support these state transitions.
  • Fixed tests to align with the new behavior.

Reasoning

Previously, failing messages might lose context about their origin, making it difficult to retry them correctly or correlate them with the original send event. This change ensures that even if processing fails, we keep the link to the original message event.

This PR adds a Retryable processed-message state and updates message processing to preserve original message context (message_event_id, epoch, mls_group_id) when failures occur, enabling safe retries and better correlation with the original send event. It also adds storage support to mark failed messages as retryable and updates tests and changelogs to reflect the new behavior.

What changed:

  • Implemented preservation of message context when constructing Failed ProcessedMessage records so message_event_id, epoch, and mls_group_id are retained (mdk-core).
  • Added support for reprocessing messages in the Retryable state: cached content is recovered and messages can transition back to Processed on success (mdk-core).
  • Introduced ProcessedMessageState::Retryable and wiring for serialization/parsing (mdk-storage-traits).
  • Added MessageStorage::mark_processed_message_retryable(&self, event_id: &EventId) -> Result<(), MessageError> to the storage trait (mdk-storage-traits).
  • Implemented mark_processed_message_retryable in MdkMemoryStorage and MdkSqliteStorage, updating persistence and preserving failure_reason when transitioning Failed -> Retryable (mdk-memory-storage, mdk-sqlite-storage).
  • Added MessageError::NotFound to represent missing messages or unexpected states (mdk-storage-traits).
  • Added and updated unit tests validating identifier preservation, Retryable transitions, and error paths (mdk-core, mdk-memory-storage, mdk-sqlite-storage).

Security impact:

  • No cryptographic or key-management changes detected; no changes to SQLCipher or file-permission handling; no new error messages that leak sensitive identifiers were introduced.

Protocol changes:

  • None to MLS protocol semantics or external protocol integrations.

API surface:

  • Breaking change: new enum variant ProcessedMessageState::Retryable is added to a non-#[non_exhaustive] enum, requiring downstream exhaustive matches to be updated (mdk-storage-traits).
  • New public trait method: MessageStorage::mark_processed_message_retryable(&self, event_id: &EventId) -> Result<(), MessageError> (mdk-storage-traits).
  • New public error variant: MessageError::NotFound (mdk-storage-traits).
  • Storage implementations now persist the Retryable state and must implement the new method (mdk-memory-storage, mdk-sqlite-storage).

Testing:

  • Tests added/updated to verify preservation of message_event_id, epoch, and mls_group_id when saving Failed ProcessedMessage records (mdk-core).
  • Tests added for mark_processed_message_retryable covering successful Failed -> Retryable transitions and NotFound error cases (mdk-memory-storage, mdk-sqlite-storage).
  • Rollback/epoch-recovery code paths covered by updated tests to ensure Retryable flow behaves as expected.

@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Adds a Retryable processed-message state, a storage API to mark failed messages retryable, and updates core message-processing failure and recovery paths to preserve and reuse message context (message_event_id, mls_group_id, epoch) across failures and retries.

Changes

Cohort / File(s) Summary
Storage trait & types
crates/mdk-storage-traits/src/messages/types.rs, crates/mdk-storage-traits/src/messages/mod.rs, crates/mdk-storage-traits/src/messages/error.rs
Added ProcessedMessageState::Retryable with parsing/serde support, added MessageStorage::mark_processed_message_retryable(&EventId) signature, and introduced MessageError::NotFound.
Core message processing
crates/mdk-core/src/messages.rs
Preserve existing ProcessedMessage context when persisting failures; handle Retryable to attempt reprocessing from cached Message content; mark related messages retryable during rollback/epoch-change paths; added tests for context preservation and retry flows.
Memory storage implementation
crates/mdk-memory-storage/src/messages.rs, crates/mdk-memory-storage/CHANGELOG.md
Implemented mark_processed_message_retryable for in-memory storage and tests for success / not-found / invalid-state cases; added changelog entry.
SQLite storage implementation
crates/mdk-sqlite-storage/src/messages.rs, crates/mdk-sqlite-storage/CHANGELOG.md
Implemented SQL-backed mark_processed_message_retryable (updates only when state == 'failed'), with tests validating transition, NotFound on missing or non-failed rows; added changelog entry.
Changelogs
crates/mdk-core/CHANGELOG.md, crates/mdk-storage-traits/CHANGELOG.md, crates/mdk-sqlite-storage/CHANGELOG.md
Added "Retryable/Message Retry Support" entries; noted breaking change in storage-traits due to enum and trait addition.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Processor as Processor (mdk-core)
participant Storage as MessageStorage (trait impl)
participant DB as DB backend (SQLite / Memory)
Processor->>Storage: mark_processed_message_retryable(event_id)
Storage->>DB: SELECT processed_message WHERE event_id AND state='failed'
alt found failed
DB-->>Storage: processed_message row
Storage->>DB: UPDATE state='retryable' (preserve failure_reason)
DB-->>Storage: OK
Storage-->>Processor: Ok(())
else not found or not failed
DB-->>Storage: 0 rows
Storage-->>Processor: Err(MessageError::NotFound)
end

mermaid
sequenceDiagram
participant Processor as Processor (mdk-core)
participant Storage as MessageStorage
participant Cache as Message Cache / Messages table
participant DB as DB backend
Processor->>Storage: on ProcessedMessage::Retryable(event_id)
Storage->>DB: load processed_message by event_id (get message_event_id)
Storage->>Cache: load original Message by message_event_id
alt cached message found
Cache-->>Storage: Message content
Storage->>DB: mark Message as Processed
DB-->>Storage: OK
Storage->>DB: update processed_messages state -> "processed", clear failure_reason, set timestamps
DB-->>Storage: OK
Storage-->>Processor: reprocessed -> continue
else no cached content
Cache-->>Storage: Not found
Storage-->>Processor: Unprocessable(group_id)
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

breaking-change, storage

Suggested reviewers

  • dannym-arx
  • mubarakcoded
  • jgmontoya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: implementing a retryable state for messages and preserving message context during failures, which aligns with the extensive modifications across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
No Sensitive Identifier Leakage ✅ Passed No sensitive identifiers (mls_group_id, nostr_group_id, GroupId, encryption keys) are exposed in logging, error messages, panic statements, or debug implementations across all modified message files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

✅ Coverage: 88.74% → 89.99% (+1.25%)

@github-actions
Copy link

✅ Coverage: 88.74% → 90.68% (+1.94%)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/mdk-core/src/messages.rs (1)

1787-1804: Preserve epoch/mls_group_id from the existing record when transitioning to Failed.
Right now only message_event_id is preserved; epoch and mls_group_id are overwritten with the current group values. For wrong-epoch/mismatch paths this records the wrong context and defeats the preservation goal. Consider reusing existing values with fallback to the current group.

💡 Suggested pattern (apply to the failure branches above)
-                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let mls_group_id = existing_record
+                    .as_ref()
+                    .and_then(|r| r.mls_group_id.clone())
+                    .or_else(|| Some(group.mls_group_id.clone()));
+                let epoch = existing_record
+                    .as_ref()
+                    .and_then(|r| r.epoch)
+                    .or_else(|| Some(group.epoch));
...
-                    epoch: Some(group.epoch),
-                    mls_group_id: Some(group.mls_group_id.clone()),
+                    epoch,
+                    mls_group_id,

Also applies to: 1839-1856, 1868-1885, 1897-1914, 1939-1956

🤖 Fix all issues with AI agents
In `@crates/mdk-core/src/messages.rs`:
- Around line 1615-1656: When marking a Retryable processed_message as
MessageState::Processed (in the retry branch around
processed_message.message_event_id handling), also clear/overwrite its prior
failure metadata: set processed_message.failure_reason to None (or empty) and
update processed_message.processed_at to the current timestamp before calling
self.storage().save_processed_message(processed_message.clone()). Likewise
ensure the corresponding stored message state is updated to
MessageState::Processed (already done via message.state) and persisted with
save_message; use the same timestamp source you use elsewhere (e.g., Utc::now())
so downstream consumers do not see stale failure_reason/processed_at when
ProcessedMessageState::Processed is stored.
🧹 Nitpick comments (1)
crates/mdk-memory-storage/src/messages.rs (1)

751-847: Move use statements to module scope (guideline).
In-function imports at Lines 753, 800, and 814 violate the “use statements at top of scope” rule. Consider hoisting them to the test module level.

♻️ Proposed refactor
 #[cfg(test)]
 mod tests {
     use std::collections::BTreeSet;

     use mdk_storage_traits::groups::GroupStorage;
     use mdk_storage_traits::groups::types::{Group, GroupState};
+    use mdk_storage_traits::messages::error::MessageError;
+    use mdk_storage_traits::messages::types::ProcessedMessage;
     use nostr::Keys;

     use super::*;
@@
     #[test]
     fn test_mark_processed_message_retryable() {
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_nonexistent_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_non_failed_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();
As per coding guidelines, keep `use` statements at the top of their scope.

@erskingardner erskingardner marked this pull request as draft January 22, 2026 17:44
@github-actions
Copy link

✅ Coverage: 88.74% → 90.63% (+1.89%)

- Preserve message_event_id, epoch, and mls_group_id when transitioning to Failed state
- Allow reprocessing of messages in Retryable state
- Update message processing logic to handle retry scenarios
- Fix message state persistence in storage implementations
@github-actions
Copy link

✅ Coverage: 88.74% → 90.63% (+1.89%)

@erskingardner erskingardner marked this pull request as ready for review January 23, 2026 07:34
@erskingardner
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@crates/mdk-core/src/messages.rs`:
- Around line 1615-1658: The current pattern "if let Some(message_event_id) =
processed_message.message_event_id && let Ok(Some(mut message)) =
self.get_message(...)" hides storage errors by treating Err as a missing
message; change this to explicitly match the result of
self.get_message(&group.mls_group_id, &message_event_id) (or use the ? operator)
so that Err(e) is propagated as a storage error (convert to Error::Message or
propagate the underlying Error) while preserving the Ok(Some(_)) branch for
using cached content and treating Ok(None) as the unprocessable case; update
references in this block (ProcessedMessageState::Retryable, get_message,
storage().save_message, save_processed_message, and
MessageProcessingResult::Unprocessable) accordingly.

In `@crates/mdk-storage-traits/CHANGELOG.md`:
- Around line 28-30: Move the "Retryable Message State" changelog entry into the
"Breaking changes" section and update the text to explicitly state that adding
the `Retryable` variant to `ProcessedMessageState` (without `#[non_exhaustive]`)
and adding the `mark_processed_message_retryable` method to the `MessageStorage`
trait are breaking changes; reference both `ProcessedMessageState::Retryable`
and the `MessageStorage::mark_processed_message_retryable` method so downstream
users know to update exhaustive matches and trait implementations.

…ng changes

- Changed get_message error handling in Retryable branch from if-let pattern
  that silently swallowed Err to explicit match with proper error propagation
- Moved Retryable Message State changelog entry to Breaking changes section
- Updated changelog to explicitly reference ProcessedMessageState::Retryable
  and MessageStorage::mark_processed_message_retryable() as breaking changes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link

✅ Coverage: 88.74% → 90.6% (+1.86%)

@github-actions
Copy link

✅ Coverage: 88.74% → 90.71% (+1.97%)

Copy link
Contributor

@jgmontoya jgmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Leaving some comments 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is huge. Even if most of it is tests it's still a couple thousand lines of non-test code. Is there a way we could refactor it so that its easier to work with?

Comment on lines +1640 to +1691
// Retryable messages are ones that previously failed due to wrong epoch keys
// but have been marked for retry after a rollback. For our own messages,
// we should have cached content - try to retrieve and return it.
tracing::debug!(target: "mdk_core::messages::process_message", "Retrying own message after rollback");

if let Some(message_event_id) = processed_message.message_event_id {
match self
.get_message(&group.mls_group_id, &message_event_id)
.map_err(|e| Error::Message(e.to_string()))?
{
Some(mut message) => {
// Update states to mark as successfully processed
message.state = message_types::MessageState::Processed;
self.storage()
.save_message(message)
.map_err(|e| Error::Message(e.to_string()))?;

processed_message.state =
message_types::ProcessedMessageState::Processed;
processed_message.failure_reason = None;
processed_message.processed_at = Timestamp::now();
self.storage()
.save_processed_message(processed_message.clone())
.map_err(|e| Error::Message(e.to_string()))?;

tracing::info!(
target: "mdk_core::messages::process_message",
"Successfully retried own cached message after rollback"
);
let message = self
.get_message(&group.mls_group_id, &message_event_id)
.map_err(|e| Error::Message(e.to_string()))?
.ok_or(Error::MessageNotFound)?;
return Ok(MessageProcessingResult::ApplicationMessage(
message,
));
}
None => {
// No cached content available - fall through to Unprocessable
}
}
}

// No cached content available - this shouldn't happen for our own messages,
// but if it does, we can't recover
tracing::warn!(
target: "mdk_core::messages::process_message",
"Retryable own message has no cached content - cannot recover"
);
Ok(MessageProcessingResult::Unprocessable {
mls_group_id: group.mls_group_id.clone(),
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps be extracted? This whole method is now over 300 lines ☠️ It looks similar to the message_types::ProcessedMessageState::Created branch 👀

Comment on lines +1669 to +1675
let message = self
.get_message(&group.mls_group_id, &message_event_id)
.map_err(|e| Error::Message(e.to_string()))?
.ok_or(Error::MessageNotFound)?;
return Ok(MessageProcessingResult::ApplicationMessage(
message,
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to refetch it here? Can't we return the message we already had in scope?

Comment on lines +1491 to +1492
/// Helper to save a failed message while preserving context from existing records
/// or using explicit context if provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just above these lines there was this piece of comment:

/// Saves a failed processed message record to prevent reprocessing
///

This used to "belong" to the save_failed_processed_message method. Should we move it back there?

Comment on lines +12 to +13
/// Message not found or not in expected state
NotFound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe have two distinct errors (NotFound / InvalidState) instead of using NotFound for both?

}

#[test]
fn test_save_failed_processed_message_preserves_message_event_id() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! How about also asserting that failure_reason is set to the expected sanitized value ("own_message" per classify_failure_reason). This would catch regressions if someone changes the sanitization logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add "Retryable" state for failed messages.

3 participants