Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Dec 13, 2025

Summary

  • Adds fallback peer lookup in connection_manager.get_peer_location_by_addr() when peek_target_peer returns None for Put/Get operations
  • Changes log level from error to warn since connection timeouts can legitimately occur

Root Cause

peek_target_peer returns None for all non-Connect operations (op_state_manager.rs:510-521) because Put/Get operations only store upstream addresses, not full PeerKeyLocation info. When routing responses, this caused the code to fail even when the connection still existed.

The fix adds a fallback lookup in connection_manager by address. This works when:

  • ✅ Connection exists but operation state lacks full peer info (the fix)
  • ❌ Connection was already dropped (neither lookup will find it - logged at warn level)

Test plan

  • Existing test_put_contract passes
  • Existing test_put_contract_three_hop_returns_response passes
  • All operations tests pass (9 passed, 3 ignored)
  • Clippy passes

Fixes #2279

🤖 Generated with Claude Code

[AI-assisted - Claude]

@sanity sanity requested a review from Copilot December 13, 2025 03:27
@sanity
Copy link
Collaborator Author

sanity commented Dec 13, 2025

@claude Please do a skeptical review of this PR

@claude
Copy link

claude bot commented Dec 13, 2025

Claude finished @sanity's task in 1m 53s —— View job


Skeptical Code Review

Tasks


Analysis

I've conducted a thorough review of this PR. Here are my findings:

⚠️ Critical Issues

1. The fallback won't work for the stated problem

The PR description claims this fixes routing when "a connection is dropped before a Put response can be routed back." However, the fallback lookup in connection_manager.get_peer_location_by_addr() only searches connections_by_location - which tracks currently active connections (see connection_manager.rs:382-390).

If the connection was dropped (which is the stated problem), it won't be in connections_by_location anymore. The fallback will return None just like peek_target_peer did.

Evidence:

// connection_manager.rs:380-394
pub fn get_peer_location_by_addr(&self, addr: SocketAddr) -> Option<PeerKeyLocation> {
    let connections = self.connections_by_location.read();
    for conns in connections.values() {
        for conn in conns {
            if conn.location.socket_addr() == Some(addr) {
                return Some(conn.location.clone());
            }
        }
    }
    drop(connections);
    
    // Transient connections don't have full PeerKeyLocation info
    None  // ← If connection dropped, we still get None
}

2. Logic contradiction in the comments

The new comment says:

"This can happen when:

  1. A transient connection was dropped before response routing"

But then logs:

"This may indicate the upstream connection was dropped"

If a connection is dropped, it's removed from connections_by_location (see connection_manager.rs:628-636), so the fallback lookup won't find it. The fix only helps when peek_target_peer returns None but the connection still exists - which happens for Put/Get operations that don't store full peer info, not for dropped connections.

🤔 What This Actually Fixes

Looking at op_state_manager.rs:510-521, peek_target_peer returns None for all operations except Connect:

pub fn peek_target_peer(&self, id: &Transaction) -> Option<PeerKeyLocation> {
    // ...
    match id.transaction_type() {
        TransactionType::Connect => {
            self.ops.connect.get(id).and_then(|op| op.get_target_peer())
        }
        // Other operations only store addresses, not full peer info
        _ => None,  // ← Always None for Put/Get
    }
}

So this PR actually fixes: Put/Get responses being dropped when the upstream connection still exists but peek_target_peer can't find peer info because Put/Get operations only store addresses, not full PeerKeyLocation.

This is different from the "connection dropped" scenario described in the PR.

📝 Additional Concerns

3. Performance: Linear scan through all connections

get_peer_location_by_addr() does a linear scan through all connections in all location buckets. For a node with many connections, this could be slow. Consider adding a reverse index addr -> PeerKeyLocation if this becomes a hot path.

4. Error handling inconsistency

The code marks the operation as completed (ctx.bridge.op_manager.completed(tx)) when it can't find peer info, but this might be premature. What if the connection comes back? Should there be retry logic instead?

5. Test coverage gap

The PR says tests pass, but I don't see evidence that the specific scenario this fixes (Put response routing when peek_target_peer returns None but connection exists) is actually tested. The existing tests (test_put_contract, test_put_contract_three_hop_returns_response) might pass for other reasons.

✅ What's Good

  • Changing log level from error to warn is appropriate since this can legitimately happen
  • The improved comment explains the situation better
  • Tests do pass, indicating no regressions

🔧 Recommendations

  1. Fix the PR description - The root cause explanation is misleading. It should say: "Put operations only store upstream addresses, not full PeerKeyLocation info. When routing responses, peek_target_peer returns None, so we need to look up the peer by address in the connection manager."

  2. Update the code comment at p2p_protoc.rs:481-495 - Remove mention of "dropped connections" since that's not what this fixes. Focus on the actual issue: operations storing addresses vs full peer info.

  3. Add a specific test - Create a test that verifies Put response routing works when peek_target_peer returns None but the connection exists.

  4. Consider performance - If this lookup becomes frequent, add a reverse index in ConnectionManager.

  5. Question the design - Why do Put/Get operations only store addresses while Connect stores full peer info? Should we make operation state tracking consistent? The current inconsistency is what necessitates this workaround.


Verdict: The fix itself is reasonable and improves the situation, but the explanation of what it fixes is incorrect. The code will help with Put/Get response routing when connections exist but operation state doesn't have full peer info - not when connections are dropped.


|

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where Put/Get operation responses were silently dropped when the upstream connection was closed before the response could be routed back. The fix adds a fallback peer lookup mechanism and adjusts logging to reflect that dropped connections are a legitimate occurrence rather than an error condition.

Key Changes:

  • Adds fallback to connection_manager.get_peer_location_by_addr() when peek_target_peer returns None for Put/Get operations
  • Changes log level from error to warn since connection drops can legitimately occur with transient connections
  • Improves log message and comments to explain when and why this fallback is needed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

active_connections = ctx.connections.len(),
phase = "error",
"Cannot establish connection - no peer info available in operation state"
phase = "warn",
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phase field should describe the operational stage, not the log level. Based on the context and similar logging at line 505, this should be phase = "connect" since the code is attempting to establish a connection. Other phase values in this file include "send", "connect", "error", "timeout", "abort", etc., but never a log level like "warn".

Suggested change
phase = "warn",
phase = "connect",

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - fixed in 5e234cf. Changed phase = "warn" to phase = "connect" to correctly describe the operational stage.

[AI-assisted - Claude]

When a connection is dropped before a Put response can be routed back,
peek_target_peer returns None (since Put operations only store addresses,
not full peer info). This caused responses to be silently dropped.

Add fallback lookup in connection_manager.get_peer_location_by_addr()
when peek_target_peer returns None. Also change log level from error to
warn since this can happen legitimately with transient connections.

Fixes #2279

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sanity sanity force-pushed the fix-2279-put-response branch from 3219dfb to 5e234cf Compare December 13, 2025 03:39
@sanity
Copy link
Collaborator Author

sanity commented Dec 13, 2025

Response to Skeptical Review

Thanks for the thorough analysis. You've raised several valid points that I've addressed:

✅ Fixed: Misleading comments about "dropped connections"

You correctly identified that the fallback lookup in connection_manager.get_peer_location_by_addr() only searches active connections. The original comments were misleading - this fix helps when:

  1. The connection still exists but peek_target_peer returns None (because Put/Get ops only store addresses, not full PeerKeyLocation)

NOT when:

  1. The connection was already dropped (in which case neither lookup will find it)

Updated comments in 5e234cf to clarify this distinction.

✅ Fixed: phase = "warn"phase = "connect"

Per Copilot's inline suggestion, changed the phase field to describe the operational stage, not the log level.

Acknowledged: What this actually fixes

Your analysis is correct. peek_target_peer returns None for all non-Connect operations (op_state_manager.rs:510-521). This fix enables Put/Get response routing when the connection still exists but operation state lacks full peer info.

The error path (when both lookups fail) now correctly states the connection was dropped.

Re: Performance concern

Good point about the linear scan. If this becomes a hot path, we could add a reverse index. For now, the number of active connections per node is typically small enough that this shouldn't be a bottleneck.

Re: Design question

You asked why Put/Get operations only store addresses while Connect stores full peer info. This is historical - Connect needs PeerKeyLocation for ring topology decisions, while Put/Get just need to know where to route responses. The inconsistency could be addressed in a future refactor, but this fix addresses the immediate symptom.

[AI-assisted - Claude]

@iduartgomez
Copy link
Collaborator

@claude after all last changes, have been the PR review issues been addressed successfully?

@claude
Copy link

claude bot commented Dec 13, 2025

Claude finished @iduartgomez's task in 1m 12s —— View job


PR Review Issues - Assessment

Tasks

  • Read the current state of modified file
  • Review original concerns from skeptical review
  • Verify if fixes address the concerns
  • Check if PR description was updated
  • Provide final assessment

Assessment: ⚠️ Partially Addressed

I've reviewed the changes and compared them against the skeptical review feedback. Here's what I found:

✅ Successfully Fixed

1. Code comments are now accurate (p2p_protoc.rs:467-471)

The new comments correctly explain what the fallback does:

"If peek_target_peer returns None (e.g., for Put/Get operations which only store addresses, not full peer info), try looking up the peer in the connection_manager by address. This enables response routing when the connection still exists but operation state lacks full peer info."

This accurately describes the fix - it helps when the connection exists but operation state lacks full peer info.

2. Error path comment clarified (p2p_protoc.rs:481-486)

The comment when both lookups fail now correctly states:

"This happens when the connection was dropped before we could route the response. Since the peer is no longer in connection_manager, we can't re-establish the connection."

This is accurate - if we reach this point, the connection is truly gone.

3. Log level changed from error to warn (p2p_protoc.rs:487)

Appropriately reflects that connection timeouts are legitimate, not errors.

4. Phase field fixed (p2p_protoc.rs:491)

Changed from phase = "warn" to phase = "connect" as suggested by Copilot's review.

❌ Not Addressed

Critical: PR description still contains misleading explanation

The PR description's "Root Cause" section still says:

"peek_target_peer returns None for all non-Connect operations (op_state_manager.rs:510-521) because Put/Get operations only store upstream addresses, not full PeerKeyLocation info. When routing responses, this caused the code to fail even when the connection still existed."

But then the "Summary" bullet says:

"Adds fallback peer lookup in connection_manager.get_peer_location_by_addr() when peek_target_peer returns None for Put/Get operations"

And:

"Changes log level from error to warn since connection timeouts can legitimately occur"

The issue: While the Root Cause section is now accurate (thanks to @sanity's update), the opening summary is confusing because it doesn't clearly explain the problem being solved. A reader might think this is just about timeouts, when it's really about Put/Get operations not storing full peer info.

📝 Recommendation

The code changes are correct and well-commented. The only remaining issue is communication clarity in the PR description. Consider updating the Summary section to be more explicit:

Suggested Summary:

## Summary

**Problem:** Put/Get operations only store upstream addresses, not full `PeerKeyLocation` info. When routing responses, `peek_target_peer` returns None, causing responses to be dropped even when the connection still exists.

**Fix:** Added fallback lookup in `connection_manager.get_peer_location_by_addr()` to find peer info by address when `peek_target_peer` returns None.

**Also:** Changed log level from error to warn since connection drops can legitimately occur before response routing.

Conclusion

All code-level review issues have been successfully addressed. The comments are accurate, the logic is sound, and the fix correctly handles the scenario where Put/Get operations lack full peer info but the connection still exists.

The only remaining improvement would be clarifying the PR description to help future readers understand the problem and solution more quickly - but this is optional polish, not a blocking issue.


@iduartgomez iduartgomez added this pull request to the merge queue Dec 13, 2025
Merged via the queue into main with commit fab8b66 Dec 13, 2025
11 checks passed
@iduartgomez iduartgomez deleted the fix-2279-put-response branch December 13, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: Put operations from peers behind NAT timeout - gateway doesn't send PutResponse

3 participants