Skip to content

Conversation

@ericevans-nv
Copy link
Contributor

@ericevans-nv ericevans-nv commented Feb 2, 2026

Description

Closes #1513

Summary

  • Persist WebSocketMessageHandler instances in a module-level dictionary keyed by conversation_id
  • On WebSocket reconnect, swap the socket reference on the disconnected handler and copy its state to the new handler
  • Re-send pending HITL prompts when a client reconnects so the UI displays them again

Problem

When a user refreshes the chat UI during a Human-In-The-Loop (HITL) interaction, the WebSocket connection resets and creates a new handler instance. The original workflow remains blocked waiting for a response, but there was no mechanism to route the user's response from the new connection to the original pending interaction. This forced users to abandon in-progress runs and restart from scratch.

Solution

Introduced a _conversation_handlers dictionary that persists handler instances by conversation_id. When a new WebSocket connection is established with an existing conversation_id:

  1. The disconnected handler's _socket is updated to the new connection (so the running workflow can send messages through it)
  2. The new handler copies the disconnected handler's state (_user_interaction, _running_workflow_task, etc.)
  3. Any pending HITL prompt is re-sent to trigger the UI to display it again

Test Plan

  • Start a workflow with HITL (e.g., simple_calculator_hitl)
  • Trigger the HITL prompt and wait for it to display
  • Refresh the browser page
  • Verify the HITL prompt reappears after reconnection
  • Submit a response and verify the workflow completes successfully

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Per-conversation WebSocket reconnection support to resume in-progress workflows after disconnects.
  • Bug Fixes

    • Re-delivery of Human-In-The-Loop prompts on reconnect to prevent lost prompts.
  • Improvements

    • More reliable handling of pending human responses and conversation state to ensure prompts and responses persist across reconnects.

Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
@ericevans-nv ericevans-nv self-assigned this Feb 2, 2026
@ericevans-nv ericevans-nv requested a review from a team as a code owner February 2, 2026 03:21
@ericevans-nv ericevans-nv added feature request New feature or request non-breaking Non-breaking change labels Feb 2, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Walkthrough

Persist pending HITL interaction state across WebSocket reconnections by registering conversation-scoped handlers and carrying forward in-flight prompt futures and content so a reconnected client with the same conversation_id can resume the original workflow.

Changes

Cohort / File(s) Summary
HITL state model & handler updates
packages/nvidia_nat_core/src/nat/front_ends/fastapi/message_handler.py
Added UserInteraction BaseModel (contains future and prompt_content, allows arbitrary types). Replaced raw future with self._user_interaction, added _initialize_workflow_request and _restore_execution_state, integrated worker reference into handler, updated HITL flow to set/clear UserInteraction and re-send pending prompts on reconnect.
Conversation handler registry & worker integration
packages/nvidia_nat_core/src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
Added _conversation_handlers: dict[str, WebSocketMessageHandler] and accessors get_conversation_handler, set_conversation_handler, remove_conversation_handler. WebSocket handler is now constructed with the worker instance to enable restoration/registration.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant WebSocket as WebSocket Handler
    participant Store as Conversation Store
    participant Workflow as Workflow Engine

    rect rgba(100, 150, 200, 0.5)
    Note over Client,Workflow: Initial HITL Interaction
    Client->>WebSocket: connect (conversation_id)
    WebSocket->>Store: set_conversation_handler(conversation_id, handler)
    WebSocket->>Workflow: process_workflow_request
    Workflow->>WebSocket: human_interaction_callback (HITL prompt)
    WebSocket->>Store: store UserInteraction (future + prompt)
    WebSocket->>Client: send HITL prompt
    end

    rect rgba(200, 100, 100, 0.5)
    Note over Client,WebSocket: Connection Loss & Reconnection
    Client->>Client: refresh/reconnect
    Client->>WebSocket: new connection (same conversation_id)
    WebSocket->>Store: get_conversation_handler(conversation_id)
    WebSocket->>WebSocket: _restore_execution_state (attach to original handler, re-send prompt if pending)
    end

    rect rgba(100, 200, 100, 0.5)
    Note over Client,Workflow: Resume Workflow
    Client->>WebSocket: submit response
    WebSocket->>Store: resolve UserInteraction.future
    Workflow->>Workflow: resume from await
    Workflow->>Client: continue workflow
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Preserve workflow state across WebSocket reconnections' is concise, descriptive, uses imperative mood, and directly aligns with the main change of persisting workflow state during reconnections.
Linked Issues check ✅ Passed All coding requirements from issue #1513 are met: workflow state persisted via conversation_id tracking, reconnection handler restoration with state copying, pending HITL prompts re-sent on reconnect.
Out of Scope Changes check ✅ Passed All changes are directly scoped to WebSocket reconnection support for HITL workflows; no unrelated modifications detected in the two modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@packages/nvidia_nat_core/src/nat/front_ends/fastapi/message_handler.py`:
- Around line 102-112: The code stores self into _conversation_handlers using
self._conversation_id without checking for None; update
_initialize_workflow_request to validate message.conversation_id (and thus
self._conversation_id) before using it as a dict key: if message.conversation_id
is not None (or truthy) then set _conversation_handlers[self._conversation_id] =
self, otherwise skip the mapping and optionally emit a warning or handle the
missing conversation id path so you don't register under a None key.
- Around line 168-171: Replace the runtime assert with explicit validation: in
the WebSocketUserInteractionResponseMessage branch (handling validated_message
and calling _process_websocket_user_interaction_response_message), check that
self._user_interaction is not None and raise a clear exception (or
return/handle) instead of using assert; before calling
self._user_interaction.future.set_result(user_content) guard with if not
self._user_interaction.future.done() to avoid InvalidStateError (otherwise
log/ignore the late message or set_exception as appropriate).
- Around line 114-134: The handler registry isn't updated after restoring state
so future lookups still return the old handler; after you copy the disconnected
handler's fields in _restore_execution_state (conversation_id,
_user_interaction, _message_parent_id, _workflow_schema_type,
_running_workflow_task) update _conversation_handlers[conversation_id] = self so
the registry points to the new handler; ensure you perform this replacement
immediately after the field copies (and before re-sending any pending HITL
prompt) so subsequent reconnections and HITL prompts operate against the new
handler instance.
- Around line 62-73: The module-level _conversation_handlers dict retains
WebSocketMessageHandler instances forever; update _initialize_workflow_request
to store the handler keyed by the conversation_id as before but add cleanup to
remove that entry when the workflow ends: in _run_workflow (or attach a done
callback to the asyncio.Task you spawn) ensure you delete
_conversation_handlers[conversation_id] in the task's finally block (or
callback) so the handler is removed on normal completion, cancellation, or
error; reference WebSocketMessageHandler and the conversation_id key when
implementing the removal to avoid leaking handlers.
🧹 Nitpick comments (1)
packages/nvidia_nat_core/src/nat/front_ends/fastapi/message_handler.py (1)

66-72: Consider expanding the docstring with attribute descriptions.

While functional, the docstring could be more informative for maintainability.

📝 Suggested improvement
 class UserInteraction(BaseModel):
-    """User interaction state."""
+    """
+    User interaction state for pending HITL prompts.
+
+    Attributes:
+        future: Awaitable future that resolves when user responds.
+        prompt_content: The prompt content sent to the user.
+    """

     model_config = ConfigDict(arbitrary_types_allowed=True)

     future: asyncio.Future[TextContent]
     prompt_content: HumanPrompt

Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
@ericevans-nv ericevans-nv changed the title Preserve workflow state across WebSocket reconnections Add WebSocket session recovery for chat reconnections Feb 2, 2026
@ericevans-nv ericevans-nv changed the title Add WebSocket session recovery for chat reconnections Preserve workflow state across WebSocket reconnections Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support WebSocket reconnection for HITL interactions

2 participants