feat: implement retryable state and preserve message context on failure #161

erskingardner · 2026-01-22T17:23:24Z

Summary

This PR implements better handling for retryable message states and ensures message context (like message_event_id) is preserved when messages fail processing.

Fixes #160

Changes

Updated process_message in mdk-core to look up existing records and preserve message_event_id, epoch, and mls_group_id when creating a Failed state record.
Added logic to allow reprocessing of messages marked as Retryable.
Updated storage traits and implementations to support these state transitions.
Fixed tests to align with the new behavior.

Reasoning

Previously, failing messages might lose context about their origin, making it difficult to retry them correctly or correlate them with the original send event. This change ensures that even if processing fails, we keep the link to the original message event.

This PR adds a Retryable processed-message state and updates message processing to preserve original message context (message_event_id, epoch, mls_group_id) when failures occur, enabling safe retries and better correlation with the original send event. It also adds storage support to mark failed messages as retryable and updates tests and changelogs to reflect the new behavior.

What changed:

Implemented preservation of message context when constructing Failed ProcessedMessage records so message_event_id, epoch, and mls_group_id are retained (mdk-core).
Added support for reprocessing messages in the Retryable state: cached content is recovered and messages can transition back to Processed on success (mdk-core).
Introduced ProcessedMessageState::Retryable and wiring for serialization/parsing (mdk-storage-traits).
Added MessageStorage::mark_processed_message_retryable(&self, event_id: &EventId) -> Result<(), MessageError> to the storage trait (mdk-storage-traits).
Implemented mark_processed_message_retryable in MdkMemoryStorage and MdkSqliteStorage, updating persistence and preserving failure_reason when transitioning Failed -> Retryable (mdk-memory-storage, mdk-sqlite-storage).
Added MessageError::NotFound to represent missing messages or unexpected states (mdk-storage-traits).
Added and updated unit tests validating identifier preservation, Retryable transitions, and error paths (mdk-core, mdk-memory-storage, mdk-sqlite-storage).

Security impact:

No cryptographic or key-management changes detected; no changes to SQLCipher or file-permission handling; no new error messages that leak sensitive identifiers were introduced.

Protocol changes:

None to MLS protocol semantics or external protocol integrations.

API surface:

Breaking change: new enum variant ProcessedMessageState::Retryable is added to a non-#[non_exhaustive] enum, requiring downstream exhaustive matches to be updated (mdk-storage-traits).
New public trait method: MessageStorage::mark_processed_message_retryable(&self, event_id: &EventId) -> Result<(), MessageError> (mdk-storage-traits).
New public error variant: MessageError::NotFound (mdk-storage-traits).
Storage implementations now persist the Retryable state and must implement the new method (mdk-memory-storage, mdk-sqlite-storage).

Testing:

Tests added/updated to verify preservation of message_event_id, epoch, and mls_group_id when saving Failed ProcessedMessage records (mdk-core).
Tests added for mark_processed_message_retryable covering successful Failed -> Retryable transitions and NotFound error cases (mdk-memory-storage, mdk-sqlite-storage).
Rollback/epoch-recovery code paths covered by updated tests to ensure Retryable flow behaves as expected.

coderabbitai · 2026-01-22T17:23:45Z

📝 Walkthrough

Walkthrough

Adds a Retryable processed-message state, a storage API to mark failed messages retryable, and updates core message-processing failure and recovery paths to preserve and reuse message context (message_event_id, mls_group_id, epoch) across failures and retries.

Changes

Cohort / File(s)	Summary
Storage trait & types `crates/mdk-storage-traits/src/messages/types.rs`, `crates/mdk-storage-traits/src/messages/mod.rs`, `crates/mdk-storage-traits/src/messages/error.rs`	Added `ProcessedMessageState::Retryable` with parsing/serde support, added `MessageStorage::mark_processed_message_retryable(&EventId)` signature, and introduced `MessageError::NotFound`.
Core message processing `crates/mdk-core/src/messages.rs`	Preserve existing ProcessedMessage context when persisting failures; handle `Retryable` to attempt reprocessing from cached Message content; mark related messages retryable during rollback/epoch-change paths; added tests for context preservation and retry flows.
Memory storage implementation `crates/mdk-memory-storage/src/messages.rs`, `crates/mdk-memory-storage/CHANGELOG.md`	Implemented `mark_processed_message_retryable` for in-memory storage and tests for success / not-found / invalid-state cases; added changelog entry.
SQLite storage implementation `crates/mdk-sqlite-storage/src/messages.rs`, `crates/mdk-sqlite-storage/CHANGELOG.md`	Implemented SQL-backed `mark_processed_message_retryable` (updates only when state == 'failed'), with tests validating transition, NotFound on missing or non-failed rows; added changelog entry.
Changelogs `crates/mdk-core/CHANGELOG.md`, `crates/mdk-storage-traits/CHANGELOG.md`, `crates/mdk-sqlite-storage/CHANGELOG.md`	Added "Retryable/Message Retry Support" entries; noted breaking change in storage-traits due to enum and trait addition.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Processor as Processor (mdk-core)
participant Storage as MessageStorage (trait impl)
participant DB as DB backend (SQLite / Memory)
Processor->>Storage: mark_processed_message_retryable(event_id)
Storage->>DB: SELECT processed_message WHERE event_id AND state='failed'
alt found failed
DB-->>Storage: processed_message row
Storage->>DB: UPDATE state='retryable' (preserve failure_reason)
DB-->>Storage: OK
Storage-->>Processor: Ok(())
else not found or not failed
DB-->>Storage: 0 rows
Storage-->>Processor: Err(MessageError::NotFound)
end

mermaid
sequenceDiagram
participant Processor as Processor (mdk-core)
participant Storage as MessageStorage
participant Cache as Message Cache / Messages table
participant DB as DB backend
Processor->>Storage: on ProcessedMessage::Retryable(event_id)
Storage->>DB: load processed_message by event_id (get message_event_id)
Storage->>Cache: load original Message by message_event_id
alt cached message found
Cache-->>Storage: Message content
Storage->>DB: mark Message as Processed
DB-->>Storage: OK
Storage->>DB: update processed_messages state -> "processed", clear failure_reason, set timestamps
DB-->>Storage: OK
Storage-->>Processor: reprocessed -> continue
else no cached content
Cache-->>Storage: Not found
Storage-->>Processor: Unprocessable(group_id)
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Fix: Return Unprocessable instead of error for previously failed messages #156 — Related changes to core failure handling and previously-failed message behavior; overlaps retry/state flow adjustments.
Fix: Persist failed message processing state to prevent DoS via repeated reprocessing #116 — Modifies message persistence and failure paths in core processing; closely related to context-preservation changes.
Fix: Add admin authorization check for MLS commit messages #130 — Adjusts persistence of sanitized/failed ProcessedMessage records; relates to how failures are saved and recovered.

Suggested labels

breaking-change, storage

Suggested reviewers

dannym-arx
mubarakcoded
jgmontoya

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: implementing a retryable state for messages and preserving message context during failures, which aligns with the extensive modifications across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
No Sensitive Identifier Leakage	✅ Passed	No sensitive identifiers (mls_group_id, nostr_group_id, GroupId, encryption keys) are exposed in logging, error messages, panic statements, or debug implementations across all modified message files.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-01-22T17:26:57Z

✅ Coverage: 88.74% → 89.99% (+1.25%)

github-actions · 2026-01-22T17:33:56Z

✅ Coverage: 88.74% → 90.68% (+1.94%)

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/mdk-core/src/messages.rs (1)

1787-1804: Preserve epoch/mls_group_id from the existing record when transitioning to Failed.
Right now only message_event_id is preserved; epoch and mls_group_id are overwritten with the current group values. For wrong-epoch/mismatch paths this records the wrong context and defeats the preservation goal. Consider reusing existing values with fallback to the current group.

💡 Suggested pattern (apply to the failure branches above)

-                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let message_event_id = existing_record.as_ref().and_then(|r| r.message_event_id);
+                let mls_group_id = existing_record
+                    .as_ref()
+                    .and_then(|r| r.mls_group_id.clone())
+                    .or_else(|| Some(group.mls_group_id.clone()));
+                let epoch = existing_record
+                    .as_ref()
+                    .and_then(|r| r.epoch)
+                    .or_else(|| Some(group.epoch));
...
-                    epoch: Some(group.epoch),
-                    mls_group_id: Some(group.mls_group_id.clone()),
+                    epoch,
+                    mls_group_id,

Also applies to: 1839-1856, 1868-1885, 1897-1914, 1939-1956

🤖 Fix all issues with AI agents

In `@crates/mdk-core/src/messages.rs`:
- Around line 1615-1656: When marking a Retryable processed_message as
MessageState::Processed (in the retry branch around
processed_message.message_event_id handling), also clear/overwrite its prior
failure metadata: set processed_message.failure_reason to None (or empty) and
update processed_message.processed_at to the current timestamp before calling
self.storage().save_processed_message(processed_message.clone()). Likewise
ensure the corresponding stored message state is updated to
MessageState::Processed (already done via message.state) and persisted with
save_message; use the same timestamp source you use elsewhere (e.g., Utc::now())
so downstream consumers do not see stale failure_reason/processed_at when
ProcessedMessageState::Processed is stored.

🧹 Nitpick comments (1)

crates/mdk-memory-storage/src/messages.rs (1)

751-847: Move use statements to module scope (guideline).
In-function imports at Lines 753, 800, and 814 violate the “use statements at top of scope” rule. Consider hoisting them to the test module level.

♻️ Proposed refactor

 #[cfg(test)]
 mod tests {
     use std::collections::BTreeSet;

     use mdk_storage_traits::groups::GroupStorage;
     use mdk_storage_traits::groups::types::{Group, GroupState};
+    use mdk_storage_traits::messages::error::MessageError;
+    use mdk_storage_traits::messages::types::ProcessedMessage;
     use nostr::Keys;

     use super::*;
@@
     #[test]
     fn test_mark_processed_message_retryable() {
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_nonexistent_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-
         let storage = MdkMemoryStorage::new();
@@
     #[test]
     fn test_mark_non_failed_message_retryable_fails() {
-        use mdk_storage_traits::messages::error::MessageError;
-        use mdk_storage_traits::messages::types::ProcessedMessage;
-
         let storage = MdkMemoryStorage::new();

As per coding guidelines, keep `use` statements at the top of their scope.

github-actions · 2026-01-23T07:25:11Z

✅ Coverage: 88.74% → 90.63% (+1.89%)

- Preserve message_event_id, epoch, and mls_group_id when transitioning to Failed state - Allow reprocessing of messages in Retryable state - Update message processing logic to handle retry scenarios - Fix message state persistence in storage implementations

github-actions · 2026-01-23T07:29:50Z

✅ Coverage: 88.74% → 90.63% (+1.89%)

erskingardner · 2026-01-23T07:34:16Z

@coderabbitai review

coderabbitai · 2026-01-23T07:34:24Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@crates/mdk-core/src/messages.rs`:
- Around line 1615-1658: The current pattern "if let Some(message_event_id) =
processed_message.message_event_id && let Ok(Some(mut message)) =
self.get_message(...)" hides storage errors by treating Err as a missing
message; change this to explicitly match the result of
self.get_message(&group.mls_group_id, &message_event_id) (or use the ? operator)
so that Err(e) is propagated as a storage error (convert to Error::Message or
propagate the underlying Error) while preserving the Ok(Some(_)) branch for
using cached content and treating Ok(None) as the unprocessable case; update
references in this block (ProcessedMessageState::Retryable, get_message,
storage().save_message, save_processed_message, and
MessageProcessingResult::Unprocessable) accordingly.

In `@crates/mdk-storage-traits/CHANGELOG.md`:
- Around line 28-30: Move the "Retryable Message State" changelog entry into the
"Breaking changes" section and update the text to explicitly state that adding
the `Retryable` variant to `ProcessedMessageState` (without `#[non_exhaustive]`)
and adding the `mark_processed_message_retryable` method to the `MessageStorage`
trait are breaking changes; reference both `ProcessedMessageState::Retryable`
and the `MessageStorage::mark_processed_message_retryable` method so downstream
users know to update exhaustive matches and trait implementations.

crates/mdk-core/src/messages.rs

crates/mdk-storage-traits/CHANGELOG.md

…ng changes - Changed get_message error handling in Retryable branch from if-let pattern that silently swallowed Err to explicit match with proper error propagation - Moved Retryable Message State changelog entry to Breaking changes section - Updated changelog to explicitly reference ProcessedMessageState::Retryable and MessageStorage::mark_processed_message_retryable() as breaking changes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-23T08:14:05Z

✅ Coverage: 88.74% → 90.6% (+1.86%)

github-actions · 2026-01-23T08:48:16Z

✅ Coverage: 88.74% → 90.71% (+1.97%)

jgmontoya

Nice! Leaving some comments 👀

jgmontoya · 2026-01-26T03:08:46Z

crates/mdk-core/src/messages.rs

This file is huge. Even if most of it is tests it's still a couple thousand lines of non-test code. Is there a way we could refactor it so that its easier to work with?

jgmontoya · 2026-01-26T03:10:32Z

crates/mdk-core/src/messages.rs

+                        // Retryable messages are ones that previously failed due to wrong epoch keys
+                        // but have been marked for retry after a rollback. For our own messages,
+                        // we should have cached content - try to retrieve and return it.
+                        tracing::debug!(target: "mdk_core::messages::process_message", "Retrying own message after rollback");
+
+                        if let Some(message_event_id) = processed_message.message_event_id {
+                            match self
+                                .get_message(&group.mls_group_id, &message_event_id)
+                                .map_err(|e| Error::Message(e.to_string()))?
+                            {
+                                Some(mut message) => {
+                                    // Update states to mark as successfully processed
+                                    message.state = message_types::MessageState::Processed;
+                                    self.storage()
+                                        .save_message(message)
+                                        .map_err(|e| Error::Message(e.to_string()))?;
+
+                                    processed_message.state =
+                                        message_types::ProcessedMessageState::Processed;
+                                    processed_message.failure_reason = None;
+                                    processed_message.processed_at = Timestamp::now();
+                                    self.storage()
+                                        .save_processed_message(processed_message.clone())
+                                        .map_err(|e| Error::Message(e.to_string()))?;
+
+                                    tracing::info!(
+                                        target: "mdk_core::messages::process_message",
+                                        "Successfully retried own cached message after rollback"
+                                    );
+                                    let message = self
+                                        .get_message(&group.mls_group_id, &message_event_id)
+                                        .map_err(|e| Error::Message(e.to_string()))?
+                                        .ok_or(Error::MessageNotFound)?;
+                                    return Ok(MessageProcessingResult::ApplicationMessage(
+                                        message,
+                                    ));
+                                }
+                                None => {
+                                    // No cached content available - fall through to Unprocessable
+                                }
+                            }
+                        }
+
+                        // No cached content available - this shouldn't happen for our own messages,
+                        // but if it does, we can't recover
+                        tracing::warn!(
+                            target: "mdk_core::messages::process_message",
+                            "Retryable own message has no cached content - cannot recover"
+                        );
+                        Ok(MessageProcessingResult::Unprocessable {
+                            mls_group_id: group.mls_group_id.clone(),
+                        })


Could this perhaps be extracted? This whole method is now over 300 lines ☠️ It looks similar to the message_types::ProcessedMessageState::Created branch 👀

jgmontoya · 2026-01-26T03:11:08Z

crates/mdk-core/src/messages.rs

+                                    let message = self
+                                        .get_message(&group.mls_group_id, &message_event_id)
+                                        .map_err(|e| Error::Message(e.to_string()))?
+                                        .ok_or(Error::MessageNotFound)?;
+                                    return Ok(MessageProcessingResult::ApplicationMessage(
+                                        message,
+                                    ));


Why do we need to refetch it here? Can't we return the message we already had in scope?

jgmontoya · 2026-01-26T03:16:48Z

crates/mdk-core/src/messages.rs

+    /// Helper to save a failed message while preserving context from existing records
+    /// or using explicit context if provided.


Just above these lines there was this piece of comment:

/// Saves a failed processed message record to prevent reprocessing ///

This used to "belong" to the save_failed_processed_message method. Should we move it back there?

jgmontoya · 2026-01-26T03:19:43Z

crates/mdk-storage-traits/src/messages/error.rs

+    /// Message not found or not in expected state
+    NotFound,


Should we maybe have two distinct errors (NotFound / InvalidState) instead of using NotFound for both?

jgmontoya · 2026-01-26T03:21:27Z

crates/mdk-core/src/messages.rs

    }

+    #[test]
+    fn test_save_failed_processed_message_preserves_message_event_id() {


Nice test! How about also asserting that failure_reason is set to the expected sanitized value ("own_message" per classify_failure_reason). This would catch regressions if someone changes the sanitization logic.

erskingardner force-pushed the fix-retryable-state branch from fd4fc49 to 067ca95 Compare January 22, 2026 17:31

coderabbitai bot requested changes Jan 22, 2026

View reviewed changes

erskingardner marked this pull request as draft January 22, 2026 17:44

erskingardner force-pushed the fix-retryable-state branch from 067ca95 to d5d1b8d Compare January 23, 2026 07:22

erskingardner force-pushed the fix-retryable-state branch from d5d1b8d to b0c6793 Compare January 23, 2026 07:27

erskingardner marked this pull request as ready for review January 23, 2026 07:34

coderabbitai bot approved these changes Jan 23, 2026

View reviewed changes

coderabbitai bot requested changes Jan 23, 2026

View reviewed changes

crates/mdk-core/src/messages.rs Show resolved Hide resolved

crates/mdk-storage-traits/CHANGELOG.md Outdated Show resolved Hide resolved

coderabbitai bot approved these changes Jan 23, 2026

View reviewed changes

refactor(core): use helper for saving failed messages with context

453f133

erskingardner requested review from dannym-arx, jgmontoya and mubarakcoded January 23, 2026 09:00

jgmontoya reviewed Jan 26, 2026

View reviewed changes

		/// Helper to save a failed message while preserving context from existing records
		/// or using explicit context if provided.

feat: implement retryable state and preserve message context on failure #161

Are you sure you want to change the base?

feat: implement retryable state and preserve message context on failure #161

Conversation

erskingardner commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Reasoning

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

github-actions bot commented Jan 22, 2026

✅ Coverage: 88.74% → 89.99% (+1.25%)

Uh oh!

github-actions bot commented Jan 22, 2026

✅ Coverage: 88.74% → 90.68% (+1.94%)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 23, 2026

✅ Coverage: 88.74% → 90.63% (+1.89%)

Uh oh!

github-actions bot commented Jan 23, 2026

✅ Coverage: 88.74% → 90.63% (+1.89%)

Uh oh!

erskingardner commented Jan 23, 2026

Uh oh!

coderabbitai bot commented Jan 23, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 23, 2026

✅ Coverage: 88.74% → 90.6% (+1.86%)

Uh oh!

github-actions bot commented Jan 23, 2026

✅ Coverage: 88.74% → 90.71% (+1.97%)

Uh oh!

jgmontoya left a comment

Choose a reason for hiding this comment

Uh oh!

jgmontoya Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

jgmontoya Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

jgmontoya Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

jgmontoya Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

jgmontoya Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

jgmontoya Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erskingardner commented Jan 22, 2026 •

edited

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading