fix(network): fallback peer lookup for Put response routing #2282

sanity · 2025-12-13T03:04:55Z

Summary

Adds fallback peer lookup in connection_manager.get_peer_location_by_addr() when peek_target_peer returns None for Put/Get operations
Changes log level from error to warn since connection timeouts can legitimately occur

Root Cause

peek_target_peer returns None for all non-Connect operations (op_state_manager.rs:510-521) because Put/Get operations only store upstream addresses, not full PeerKeyLocation info. When routing responses, this caused the code to fail even when the connection still existed.

The fix adds a fallback lookup in connection_manager by address. This works when:

✅ Connection exists but operation state lacks full peer info (the fix)
❌ Connection was already dropped (neither lookup will find it - logged at warn level)

Test plan

Existing test_put_contract passes
Existing test_put_contract_three_hop_returns_response passes
All operations tests pass (9 passed, 3 ignored)
Clippy passes

Fixes #2279

🤖 Generated with Claude Code

[AI-assisted - Claude]

sanity · 2025-12-13T03:27:26Z

@claude Please do a skeptical review of this PR

claude · 2025-12-13T03:27:46Z

Claude finished @sanity's task in 1m 53s —— View job

Skeptical Code Review

Tasks

Read PR changes and understand modifications
Read issue fix: Put operations from peers behind NAT timeout - gateway doesn't send PutResponse #2279 for background context
Analyze the fix for potential issues
Provide skeptical review feedback

Analysis

I've conducted a thorough review of this PR. Here are my findings:

⚠️ Critical Issues

1. The fallback won't work for the stated problem

The PR description claims this fixes routing when "a connection is dropped before a Put response can be routed back." However, the fallback lookup in connection_manager.get_peer_location_by_addr() only searches connections_by_location - which tracks currently active connections (see connection_manager.rs:382-390).

If the connection was dropped (which is the stated problem), it won't be in connections_by_location anymore. The fallback will return None just like peek_target_peer did.

Evidence:

// connection_manager.rs:380-394
pub fn get_peer_location_by_addr(&self, addr: SocketAddr) -> Option<PeerKeyLocation> {
    let connections = self.connections_by_location.read();
    for conns in connections.values() {
        for conn in conns {
            if conn.location.socket_addr() == Some(addr) {
                return Some(conn.location.clone());
            }
        }
    }
    drop(connections);
    
    // Transient connections don't have full PeerKeyLocation info
    None  // ← If connection dropped, we still get None
}

2. Logic contradiction in the comments

The new comment says:

"This can happen when:

A transient connection was dropped before response routing"

But then logs:

"This may indicate the upstream connection was dropped"

If a connection is dropped, it's removed from connections_by_location (see connection_manager.rs:628-636), so the fallback lookup won't find it. The fix only helps when peek_target_peer returns None but the connection still exists - which happens for Put/Get operations that don't store full peer info, not for dropped connections.

🤔 What This Actually Fixes

Looking at op_state_manager.rs:510-521, peek_target_peer returns None for all operations except Connect:

pub fn peek_target_peer(&self, id: &Transaction) -> Option<PeerKeyLocation> {
    // ...
    match id.transaction_type() {
        TransactionType::Connect => {
            self.ops.connect.get(id).and_then(|op| op.get_target_peer())
        }
        // Other operations only store addresses, not full peer info
        _ => None,  // ← Always None for Put/Get
    }
}

So this PR actually fixes: Put/Get responses being dropped when the upstream connection still exists but peek_target_peer can't find peer info because Put/Get operations only store addresses, not full PeerKeyLocation.

This is different from the "connection dropped" scenario described in the PR.

📝 Additional Concerns

3. Performance: Linear scan through all connections

get_peer_location_by_addr() does a linear scan through all connections in all location buckets. For a node with many connections, this could be slow. Consider adding a reverse index addr -> PeerKeyLocation if this becomes a hot path.

4. Error handling inconsistency

The code marks the operation as completed (ctx.bridge.op_manager.completed(tx)) when it can't find peer info, but this might be premature. What if the connection comes back? Should there be retry logic instead?

5. Test coverage gap

The PR says tests pass, but I don't see evidence that the specific scenario this fixes (Put response routing when peek_target_peer returns None but connection exists) is actually tested. The existing tests (test_put_contract, test_put_contract_three_hop_returns_response) might pass for other reasons.

✅ What's Good

Changing log level from error to warn is appropriate since this can legitimately happen
The improved comment explains the situation better
Tests do pass, indicating no regressions

🔧 Recommendations

Fix the PR description - The root cause explanation is misleading. It should say: "Put operations only store upstream addresses, not full PeerKeyLocation info. When routing responses, peek_target_peer returns None, so we need to look up the peer by address in the connection manager."
Update the code comment at p2p_protoc.rs:481-495 - Remove mention of "dropped connections" since that's not what this fixes. Focus on the actual issue: operations storing addresses vs full peer info.
Add a specific test - Create a test that verifies Put response routing works when peek_target_peer returns None but the connection exists.
Consider performance - If this lookup becomes frequent, add a reverse index in ConnectionManager.
Question the design - Why do Put/Get operations only store addresses while Connect stores full peer info? Should we make operation state tracking consistent? The current inconsistency is what necessitates this workaround.

Verdict: The fix itself is reasonable and improves the situation, but the explanation of what it fixes is incorrect. The code will help with Put/Get response routing when connections exist but operation state doesn't have full peer info - not when connections are dropped.

|

Copilot

Pull request overview

This PR fixes a bug where Put/Get operation responses were silently dropped when the upstream connection was closed before the response could be routed back. The fix adds a fallback peer lookup mechanism and adjusts logging to reflect that dropped connections are a legitimate occurrence rather than an error condition.

Key Changes:

Adds fallback to connection_manager.get_peer_location_by_addr() when peek_target_peer returns None for Put/Get operations
Changes log level from error to warn since connection drops can legitimately occur with transient connections
Improves log message and comments to explain when and why this fallback is needed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-13T03:30:00Z

crates/core/src/node/network_bridge/p2p_protoc.rs

                                            active_connections = ctx.connections.len(),
-                                            phase = "error",
-                                            "Cannot establish connection - no peer info available in operation state"
+                                            phase = "warn",


The phase field should describe the operational stage, not the log level. Based on the context and similar logging at line 505, this should be phase = "connect" since the code is attempting to establish a connection. Other phase values in this file include "send", "connect", "error", "timeout", "abort", etc., but never a log level like "warn".

Suggested change

phase = "warn",

phase = "connect",

Good catch - fixed in 5e234cf. Changed phase = "warn" to phase = "connect" to correctly describe the operational stage.

[AI-assisted - Claude]

When a connection is dropped before a Put response can be routed back, peek_target_peer returns None (since Put operations only store addresses, not full peer info). This caused responses to be silently dropped. Add fallback lookup in connection_manager.get_peer_location_by_addr() when peek_target_peer returns None. Also change log level from error to warn since this can happen legitimately with transient connections. Fixes #2279 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sanity · 2025-12-13T03:40:06Z

Response to Skeptical Review

Thanks for the thorough analysis. You've raised several valid points that I've addressed:

✅ Fixed: Misleading comments about "dropped connections"

You correctly identified that the fallback lookup in connection_manager.get_peer_location_by_addr() only searches active connections. The original comments were misleading - this fix helps when:

The connection still exists but peek_target_peer returns None (because Put/Get ops only store addresses, not full PeerKeyLocation)

NOT when:

The connection was already dropped (in which case neither lookup will find it)

Updated comments in 5e234cf to clarify this distinction.

✅ Fixed: `phase = "warn"` → `phase = "connect"`

Per Copilot's inline suggestion, changed the phase field to describe the operational stage, not the log level.

Acknowledged: What this actually fixes

Your analysis is correct. peek_target_peer returns None for all non-Connect operations (op_state_manager.rs:510-521). This fix enables Put/Get response routing when the connection still exists but operation state lacks full peer info.

The error path (when both lookups fail) now correctly states the connection was dropped.

Re: Performance concern

Good point about the linear scan. If this becomes a hot path, we could add a reverse index. For now, the number of active connections per node is typically small enough that this shouldn't be a bottleneck.

Re: Design question

You asked why Put/Get operations only store addresses while Connect stores full peer info. This is historical - Connect needs PeerKeyLocation for ring topology decisions, while Put/Get just need to know where to route responses. The inconsistency could be addressed in a future refactor, but this fix addresses the immediate symptom.

[AI-assisted - Claude]

iduartgomez · 2025-12-13T22:13:44Z

@claude after all last changes, have been the PR review issues been addressed successfully?

claude · 2025-12-13T22:14:02Z

Claude finished @iduartgomez's task in 1m 12s —— View job

PR Review Issues - Assessment

Tasks

Read the current state of modified file
Review original concerns from skeptical review
Verify if fixes address the concerns
Check if PR description was updated
Provide final assessment

Assessment: ⚠️ Partially Addressed

I've reviewed the changes and compared them against the skeptical review feedback. Here's what I found:

✅ Successfully Fixed

1. Code comments are now accurate (p2p_protoc.rs:467-471)

The new comments correctly explain what the fallback does:

"If peek_target_peer returns None (e.g., for Put/Get operations which only store addresses, not full peer info), try looking up the peer in the connection_manager by address. This enables response routing when the connection still exists but operation state lacks full peer info."

This accurately describes the fix - it helps when the connection exists but operation state lacks full peer info.

2. Error path comment clarified (p2p_protoc.rs:481-486)

The comment when both lookups fail now correctly states:

"This happens when the connection was dropped before we could route the response. Since the peer is no longer in connection_manager, we can't re-establish the connection."

This is accurate - if we reach this point, the connection is truly gone.

3. Log level changed from error to warn (p2p_protoc.rs:487)

Appropriately reflects that connection timeouts are legitimate, not errors.

4. Phase field fixed (p2p_protoc.rs:491)

Changed from phase = "warn" to phase = "connect" as suggested by Copilot's review.

❌ Not Addressed

Critical: PR description still contains misleading explanation

The PR description's "Root Cause" section still says:

"peek_target_peer returns None for all non-Connect operations (op_state_manager.rs:510-521) because Put/Get operations only store upstream addresses, not full PeerKeyLocation info. When routing responses, this caused the code to fail even when the connection still existed."

But then the "Summary" bullet says:

"Adds fallback peer lookup in connection_manager.get_peer_location_by_addr() when peek_target_peer returns None for Put/Get operations"

And:

"Changes log level from error to warn since connection timeouts can legitimately occur"

The issue: While the Root Cause section is now accurate (thanks to @sanity's update), the opening summary is confusing because it doesn't clearly explain the problem being solved. A reader might think this is just about timeouts, when it's really about Put/Get operations not storing full peer info.

📝 Recommendation

The code changes are correct and well-commented. The only remaining issue is communication clarity in the PR description. Consider updating the Summary section to be more explicit:

Suggested Summary:

## Summary

**Problem:** Put/Get operations only store upstream addresses, not full `PeerKeyLocation` info. When routing responses, `peek_target_peer` returns None, causing responses to be dropped even when the connection still exists.

**Fix:** Added fallback lookup in `connection_manager.get_peer_location_by_addr()` to find peer info by address when `peek_target_peer` returns None.

**Also:** Changed log level from error to warn since connection drops can legitimately occur before response routing.

Conclusion

All code-level review issues have been successfully addressed. The comments are accurate, the logic is sound, and the fix correctly handles the scenario where Put/Get operations lack full peer info but the connection still exists.

The only remaining improvement would be clarifying the PR description to help future readers understand the problem and solution more quickly - but this is optional polish, not a blocking issue.

sanity requested a review from Copilot December 13, 2025 03:27

Copilot started reviewing on behalf of sanity December 13, 2025 03:27 View session

Copilot AI reviewed Dec 13, 2025

View reviewed changes

sanity force-pushed the fix-2279-put-response branch from 3219dfb to 5e234cf Compare December 13, 2025 03:39

iduartgomez approved these changes Dec 13, 2025

View reviewed changes

iduartgomez added this pull request to the merge queue Dec 13, 2025

Merged via the queue into main with commit fab8b66 Dec 13, 2025
11 checks passed

iduartgomez deleted the fix-2279-put-response branch December 13, 2025 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(network): fallback peer lookup for Put response routing #2282

fix(network): fallback peer lookup for Put response routing #2282

Uh oh!

sanity commented Dec 13, 2025 •

edited

Loading

Uh oh!

sanity commented Dec 13, 2025

Uh oh!

claude bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 13, 2025

Uh oh!

sanity Dec 13, 2025

Uh oh!

sanity commented Dec 13, 2025

Uh oh!

iduartgomez commented Dec 13, 2025

Uh oh!

claude bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(network): fallback peer lookup for Put response routing #2282

fix(network): fallback peer lookup for Put response routing #2282

Uh oh!

Conversation

sanity commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Test plan

Uh oh!

sanity commented Dec 13, 2025

Uh oh!

claude bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Skeptical Code Review

Tasks

Analysis

⚠️ Critical Issues

🤔 What This Actually Fixes

📝 Additional Concerns

✅ What's Good

🔧 Recommendations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

sanity Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

sanity commented Dec 13, 2025

Response to Skeptical Review

✅ Fixed: Misleading comments about "dropped connections"

✅ Fixed: phase = "warn" → phase = "connect"

Acknowledged: What this actually fixes

Re: Performance concern

Re: Design question

Uh oh!

iduartgomez commented Dec 13, 2025

Uh oh!

claude bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Issues - Assessment

Tasks

Assessment: ⚠️ Partially Addressed

✅ Successfully Fixed

❌ Not Addressed

📝 Recommendation

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanity commented Dec 13, 2025 •

edited

Loading

claude bot commented Dec 13, 2025 •

edited

Loading

✅ Fixed: `phase = "warn"` → `phase = "connect"`

claude bot commented Dec 13, 2025 •

edited

Loading