-
Notifications
You must be signed in to change notification settings - Fork 0
fix(dns): implement staggered concurrent queries to reduce resolution latency #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… latency Previously, the DNS resolver queried nameservers sequentially, waiting for the full timeout (5s) on each failure before trying the next server. With 3 internal nameservers, this caused 10-15s delays when initial servers were slow or unreachable (common in Docker environments). This change implements staggered concurrent queries: - Query first nameserver immediately - After 150ms, if no response, fire next nameserver in parallel - Continue staggering until all nameservers queried or valid response arrives - Use first successful response - On SERVFAIL/REFUSED/error, immediately trigger next nameserver (no wait) Before: ns0 timeout (5s) -> ns1 timeout (5s) -> ns2 responds = ~10s+ After: ns0 -> +150ms ns1 -> +150ms ns2 -> first response wins = <300ms Applied to both UDP (forwardQuery) and TCP (forwardQueryTCP) paths.
WalkthroughThe pull request introduces staggered concurrent DNS queries that replace sequential nameserver querying with asynchronous result channels and first-success-wins logic. Timeout-aware context handling is added to abort when query deadlines expire, while cache mechanics and error handling pathways are retained with updated TTL management. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant NSPool as Nameserver<br/>Pool
participant ResultChan as Result<br/>Channel
participant Cache
participant Timeout as Context<br/>Timeout
Client->>Server: DNS Query
Server->>Cache: Check cache
alt Cache Hit
Cache-->>Client: Cached response
else Cache Miss
Server->>Server: Start stagger timer
Server->>NSPool: Query NS[0] immediately
rect rgb(200, 220, 255)
Note over Server,NSPool: Concurrent staggered queries
Server->>Timeout: Set deadline context
NSPool->>ResultChan: NS[0] result (response/error)
end
par Stagger Wait
Server->>Server: Wait stagger delay
Server->>NSPool: Query NS[1]
NSPool->>ResultChan: NS[1] result
and Collect Results
ResultChan->>Server: First satisfactory result
end
alt Timeout expires
Timeout->>Server: Context deadline exceeded
Server->>Server: Send error response (SERVFAIL)
else Success received
Server->>Cache: Cache response + TTL
Server->>Client: Return response
else All queries fail
Server->>Server: Send error response
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dns/server_test.go (1)
1505-1571: Data race:callCountaccessed without synchronization.With the new staggered concurrent queries,
mockDialermay be invoked from multiple goroutines simultaneously. ThecallCount++increment on line 1509 and the read on line 1570 are not protected, creating a data race.Apply the same mutex pattern used in
TestDNSResolver_NameserverFallback:🔎 Proposed fix
func TestDNSResolver_NameserverFallback_ConnectionError(t *testing.T) { testLogger := logger.NewTestLogger() + var mu sync.Mutex var callCount int // Create a mock dialer where first nameserver fails with connection error mockDialer := func(ctx context.Context, network, address string) (net.Conn, error) { + mu.Lock() callCount++ + mu.Unlock() if address == "ns1.test:53" {And when reading:
// Both nameservers should have been called + mu.Lock() + gotCallCount := callCount + mu.Unlock() - if callCount != 2 { - t.Errorf("Expected 2 nameserver calls, got %d", callCount) + if gotCallCount != 2 { + t.Errorf("Expected 2 nameserver calls, got %d", gotCallCount) }
🧹 Nitpick comments (1)
dns/server.go (1)
451-557: Well-designed staggered query implementation with one robustness consideration.The staggered concurrent query approach effectively addresses the latency problem. The buffered channel prevents goroutine leaks, and immediate fallback on SERVFAIL/REFUSED is correct.
Timer Reset pattern: Calling
staggerTimer.Reset(0)(lines 504, 512, 524) without draining when the timer may have already fired can cause a stale value on the channel. On Go < 1.23, this could lead to spurious timer events. While the impact is minor (at worst, an extra nameserver is queried early), consider the defensive pattern for robustness:if !staggerTimer.Stop() { select { case <-staggerTimer.C: default: } } staggerTimer.Reset(0)This is optional if targeting Go 1.23+ where timer semantics were improved.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
dns/resolv.godns/server.godns/server_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
dns/resolv.go (1)
dns/dns.go (1)
DefaultExternalDNSServers(45-49)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Analyze (go)
🔇 Additional comments (6)
dns/resolv.go (1)
114-139: LGTM - Formatting-only changes.The whitespace adjustments in
GetSystemNameservershave no functional impact. The logic remains correct.dns/server_test.go (2)
7-7: LGTM - Required import for test synchronization.
1405-1479: LGTM - Proper mutex synchronization for concurrent test access.The mutex correctly guards
callCountandsecondCalledwhich are now accessed by multiple goroutines due to the staggered query implementation. Reading intogotSecondCalledbefore unlock is the correct pattern.dns/server.go (3)
22-22: LGTM - Reasonable stagger delay.150ms provides good balance between reducing latency when first nameserver is slow and avoiding unnecessary parallel queries when it responds quickly.
650-655: LGTM - Clean result container for concurrent queries.
657-788: LGTM - Consistent staggered implementation for UDP path.Same well-designed pattern as
forwardQueryTCP. The same optional timer robustness improvement mentioned above applies here (lines 717, 725, 738).The CNAME resolution correctly preserves original headers and merges answer records. Error responses are properly sent when all nameservers fail.
Problem
The DNS resolver was experiencing significant delays (10-15+ seconds) when resolving domains, particularly in Docker environments. This was reliably reproducible with internal domains like
local-local.machine.agentuity.internal.Symptoms observed:
The query would timeout twice (~5s each) before eventually resolving. Subsequent requests were fast due to caching.
Root Cause
The DNS resolver was querying nameservers sequentially, waiting for the full query timeout (5s) on each failure before trying the next server:
When internal nameservers were slow or unreachable from within Docker, each one consumed the full timeout before moving on.
Solution
Implemented staggered concurrent queries - a common pattern used by production DNS resolvers (including Go's built-in
net.Resolver):staggerDelay), if no response yet, fire the next nameserver in parallelResult:
Before: 10-15+ seconds worst case
After: <300ms if any nameserver responds quickly, graceful fallback if not
Changes
staggerDelayconstant (150ms) for the stagger intervalnsResponsestruct to track concurrent query resultsforwardQuery()(UDP) with staggered query logicforwardQueryTCP()with same staggered query logicTesting
Summary by CodeRabbit
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.