-
-
Notifications
You must be signed in to change notification settings - Fork 110
fix(transport): reset gateway session when peer restarts with new identity #2281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ntity When a peer behind NAT restarts with a new identity but the NAT assigns the same external IP:port, the gateway would continue trying to use the old session encryption keys. The new peer's handshake packets were silently dropped because they failed decryption with the stale keys. This fix detects new RSA-encrypted intro packets (256 bytes) arriving from existing connections. When such a packet is successfully decrypted, it indicates a new peer identity is connecting from the same address. The gateway now resets the stale session and establishes a fresh connection with the new peer. Key changes: - Added identity detection check in packet handler for gateways - When a 256-byte packet from a known address decrypts as RSA intro, the old session is removed and a new gateway connection is started - Added unit test that reproduces the bug scenario Closes #2277 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@claude Please do a skeptical review of this PR |
|
Claude finished @sanity's task in 2m 0s —— View job Skeptical PR ReviewTasks
SummaryThis PR addresses a real issue (#2277) where gateways fail to reconnect with peers that restart with new identities from the same NAT address. The core approach—detecting RSA intro packets from existing connections and resetting stale sessions—is sound. However, I've identified several critical security and correctness issues that need attention. Critical Issues🔴 1. RSA Decryption Side-Channel Attack Vector (Security)Location: let is_new_identity = self.is_gateway
&& size == 256
&& packet_data.try_decrypt_asym(&self.this_peer_keypair.secret).is_ok();Problem: This code attempts RSA decryption on every 256-byte packet from existing connections. This creates a significant security vulnerability:
Why this is critical:
Recommended fix: // Only attempt RSA decryption if the packet COULD be an intro packet
// Real intro packets have specific structure that can be validated cheaply first
let is_new_identity = if self.is_gateway && size == 256 {
// Quick structural validation before expensive RSA operation
// Real intro packets start with specific magic bytes or structure
if looks_like_intro_packet(&packet_data) {
packet_data.try_decrypt_asym(&self.this_peer_keypair.secret).is_ok()
} else {
false
}
} else {
false
};Alternatively, rate-limit RSA decryption attempts per connection to prevent abuse. 🔴 2. Silent Failure on Decryption Errors (Correctness)Location: && packet_data.try_decrypt_asym(&self.this_peer_keypair.secret).is_ok();Problem: If Why this matters:
Recommended fix: let is_new_identity = if self.is_gateway && size == 256 {
match packet_data.try_decrypt_asym(&self.this_peer_keypair.secret) {
Ok(_) => true,
Err(TransportError::DecryptionFailure { .. }) => false, // Wrong key, not new identity
Err(err) => {
// Unexpected error - log and drop packet
tracing::warn!(
peer_addr = %remote_addr,
error = %err,
"Unexpected error during RSA intro packet check"
);
continue; // Don't send to old handler OR fall through
}
}
} else {
false
};🟡 3. Race Condition: Packet Loss During Transition (Correctness)Location: // Don't reinsert - let the packet fall through to gateway_connection
// which will establish a fresh session with the new peerProblem: Between detecting the new identity and establishing the fresh session, there's a window where:
Example timeline: Evidence from code: Lines 531-537 show that Recommended fix: if is_new_identity {
tracing::info!(...);
// Insert tombstone to prevent duplicate sessions
self.remote_connections.insert(remote_addr, RemoteConnection {
inbound_packet_sender: /* closed channel */,
observed_addr: None,
});
// The intro packet will replace this with real connection
}Or use a separate "pending connections" map. 🟡 4. Assumption: RSA Intro Packets Are Always 256 Bytes (Brittleness)Location: && size == 256Problem: This hardcodes the RSA ciphertext size assumption. If:
...this code silently breaks without any warning. Evidence: I couldn't find a constant like Recommended fix: // In crypto.rs or packet_data.rs
pub const RSA_ENCRYPTED_PACKET_SIZE: usize = 256; // 2048-bit RSA with PKCS#1v1.5
// In connection_handler.rs
&& size == RSA_ENCRYPTED_PACKET_SIZEAdd a compile-time assertion that the key size matches: const_assert!(RSA_KEY_BITS == 2048);
const_assert!(RSA_ENCRYPTED_PACKET_SIZE == RSA_KEY_BITS / 8);Minor Issues5. Test Only Validates Happy Path (Testing)Location: Test at line 2560 The test
Missing test cases:
Recommendation: Add negative tests to verify the fix doesn't break legitimate traffic or create new attack vectors. 6. Inconsistent Error Handling Path (Code Quality)The new code path has three outcomes:
Path #1 and #4 both "don't reinsert" but for different reasons. The asymmetry makes the code harder to reason about. Consider extracting this into a helper function or using an enum for clarity. 7. Missing Metrics/Observability (Operations)When the gateway detects and resets a stale session (the main success case of this PR), there's only a single
What the PR Does Well✅ Correctly identifies the root cause: Gateway retention of stale sessions is indeed the problem ✅ Minimal change surface: Only modifies the packet routing logic, doesn't touch handshake protocol ✅ Includes a regression test: The new test would catch if someone accidentally removed the fix ✅ Good documentation: The PR description clearly explains the problem and solution Questions for Author
RecommendationDo not merge without addressing Issue #1 (RSA side-channel/DoS). This is a critical security vulnerability that could allow:
The other issues are important but non-blocking if you add follow-up tasks for them. Suggested Next Steps
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug where gateways fail to accept reconnections from peers that restart with a new identity but retain the same external IP:port (common with NAT). The fix detects new RSA-encrypted intro packets on existing connections and resets stale sessions to allow fresh handshakes.
Key changes:
- Added detection logic for new peer identities by checking if 256-byte packets decrypt successfully as RSA intro packets
- Restructured packet handling to check for new identities before routing to existing connections
- Added comprehensive test coverage for the same-address, new-identity reconnection scenario
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let is_new_identity = self.is_gateway | ||
| && size == 256 | ||
| && packet_data.try_decrypt_asym(&self.this_peer_keypair.secret).is_ok(); |
Copilot
AI
Dec 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSA decryption is computationally expensive (typically 1-10ms) and is being performed on every 256-byte packet from an existing connection. This creates a denial-of-service vulnerability where an attacker could send many 256-byte packets to cause CPU exhaustion on the gateway.
Consider adding rate limiting or caching the last RSA decryption attempt timestamp per address to prevent repeated expensive decryption operations within a short time window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid concern about CPU cost. However, a few mitigating factors:
- This only triggers for existing connections - an attacker would first need to establish a legitimate connection to reach this code path
- RSA decryption failing is fast - invalid ciphertext fails early in the decryption process; only properly-formatted RSA ciphertext incurs the full cost
- Already rate-limited by UDP - the socket read rate naturally limits throughput
That said, I'll add a per-address rate limit for RSA decryption attempts (e.g., max 1 attempt per second per IP). This prevents abuse while still handling legitimate reconnects quickly.
[AI-assisted - Claude]
| let is_new_identity = self.is_gateway | ||
| && size == 256 | ||
| && packet_data.try_decrypt_asym(&self.this_peer_keypair.secret).is_ok(); |
Copilot
AI
Dec 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code only checks if RSA decryption succeeds but doesn't validate that the decrypted data is actually a valid intro packet. According to the gateway_connection function (lines 791-801), a valid intro packet should contain:
- Protocol version (PROTOC_VERSION.len() bytes)
- Outbound key (16 bytes)
Without this validation, any 256-byte packet that successfully RSA decrypts will trigger session reset, even if it's not a genuine intro packet. This could cause unnecessary connection resets.
Consider validating the decrypted packet structure before resetting the session, similar to what's done in gateway_connection.
| let is_new_identity = self.is_gateway | |
| && size == 256 | |
| && packet_data.try_decrypt_asym(&self.this_peer_keypair.secret).is_ok(); | |
| let is_new_identity = if self.is_gateway && size == 256 { | |
| match packet_data.try_decrypt_asym(&self.this_peer_keypair.secret) { | |
| Ok(decrypted) => { | |
| // Validate intro packet structure: | |
| // 1. Protocol version (PCK_VERSION.len() bytes) | |
| // 2. Outbound key (16 bytes) | |
| let proto_len = PCK_VERSION.len(); | |
| if decrypted.len() >= proto_len + 16 | |
| && &decrypted[..proto_len] == PCK_VERSION | |
| { | |
| true | |
| } else { | |
| false | |
| } | |
| } | |
| Err(_) => false, | |
| } | |
| } else { | |
| false | |
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion - you're right that we should validate the decrypted structure. I'll adopt your suggestion with a minor tweak: use PROTOC_VERSION (the protocol version constant) instead of PCK_VERSION.
This ensures we only reset for legitimate intro packets, not any 256-byte data that happens to decrypt.
[AI-assisted - Claude]
Response to Claude's Skeptical ReviewThanks for the thorough analysis. Addressing each point: Critical Issues🔴 1. RSA Decryption Side-Channel/DoSAgreed this needs mitigation. However, some clarifications:
Action: Will add rate limiting + structural pre-validation. 🔴 2. Silent Failure on Decryption ErrorsGood catch on distinguishing "wrong key" from other errors. I'll add explicit error handling with logging for unexpected failures. Action: Will implement the suggested match pattern. 🟡 3. Race Condition During Session ResetI analyzed this more carefully:
The key insight is that Action: Will verify the behavior under load and add a test for rapid packet arrival. 🟡 4. Magic Constant 256Agreed—already addressed in inline comment. Will extract to Minor Issues5. Test CoverageWill add:
6. Code QualityAgreed the paths could be clearer. Will refactor after the security fixes. 7. MetricsWill add a counter for session resets. Full observability can be a follow-up. Answers to Questions
Implementation Plan
Will push these improvements shortly. [AI-assisted - Claude] |
Improves the gateway session reset fix (#2277) based on review feedback: 1. Extract magic number to RSA_INTRO_PACKET_SIZE constant with documentation 2. Add rate limiting for RSA decryption attempts (1/second per address) to prevent CPU exhaustion DoS attacks 3. Validate decrypted packet structure (check protocol version) before resetting session - prevents triggering on arbitrary 256-byte data 4. Use decrypted.data() accessor instead of direct indexing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pushed improvements based on review feedbackCommit 0b2164d addresses the main security and correctness concerns: Changes Made
What remains
The main security concerns (RSA DoS, arbitrary packet triggering reset) are now addressed. Tests pass locally. [AI-assisted - Claude] |
sanity
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @iduartgomez)
| // Rate limit RSA decryption attempts to prevent DoS | ||
| let now = Instant::now(); | ||
| let rate_limited = self | ||
| .last_rsa_attempt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we clean up from this map after connections are dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit 8224600 - added cleanup at all 4 sites where connections are removed. Thanks for catching this!
[AI-assisted - Claude]
|
@claude after last commits, have been all the issues pointer in PR reviews been addressed? |
|
Claude finished @iduartgomez's task in 1m 43s —— View job Review Issue Resolution StatusTasks
SummaryAfter reviewing the latest commits (particularly ✅ Issues Successfully Addressed1. Magic Constant Extracted (Issue #4)
2. Rate Limiting Added (Issue #1 - DoS Prevention)
3. Intro Packet Structure Validation (Issue #1 - False Positives)
4. Test Coverage
|
|
@sanity in the original issue I see
I don't think this is addressed in this PR which is the more worrying part, we shouldn't accumulate stale connections for sure they should be cleaned up after they are dropped. I don't think this is the case but if it is we should address the issue! |
@claude can you review this |
|
Claude finished @iduartgomez's task in 1m 51s —— View job Review: Stale Connection CleanupTasks
SummaryGood news and bad news regarding stale connection cleanup: ✅ GOOD: ❌ BAD: Detailed Analysis1.
|
|
Check comment #2281 (comment) If we already had cleanup in place, how is possible the original problem described in the issue did even happen? |
|
Claude finished @iduartgomez's task in 1m 40s —— View job Investigating Cleanup vs Original IssueTasks
Answer: The Existing Cleanup Didn't Cover This ScenarioYou've identified an important nuance. Here's what's happening: Pre-existing Cleanup (on
|
…moved Address PR review feedback from @iduartgomez: the last_rsa_attempt HashMap used for rate-limiting RSA decryption attempts was never cleaned up, creating a memory leak as entries persisted indefinitely. Added cleanup at all 4 sites where connections are removed: 1. When new peer identity is detected (session reset) 2. When channel is disconnected (connection dead) 3. When stale connections from same IP are removed (issue #2235) 4. When dead outbound connection is removed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Addressed:
|
Problem
When a peer behind NAT restarts with a new identity but the NAT assigns the same external IP:port, the gateway fails to reconnect. This is because:
This is a significant issue in production because:
Solution
The fix detects when a new RSA-encrypted intro packet (256 bytes) arrives from an address that has an existing connection:
gateway_connection()to establish a fresh sessionWhy This Approach
Testing
Added a unit test
gateway_handles_peer_reconnection_same_addr_new_identitythat:All related tests pass:
gateway_handles_peer_reconnection_same_addr_new_identity(new test)gateway_handles_peer_reconnection_from_different_port(fix: Gateway retains stale crypto state on peer reconnection #2235)multiple_peers_behind_same_natFixes
Closes #2277
[AI-assisted - Claude]
This change is