fix(dns): implement staggered concurrent queries to reduce resolution latency #145

jhaynie · 2025-12-23T17:28:26Z

Problem

The DNS resolver was experiencing significant delays (10-15+ seconds) when resolving domains, particularly in Docker environments. This was reliably reproducible with internal domains like local-local.machine.agentuity.internal.

Symptoms observed:

/ # dig local-local.machine.agentuity.internal -t AAAA
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out

;; Query time: 5003 msec

The query would timeout twice (~5s each) before eventually resolving. Subsequent requests were fast due to caching.

Root Cause

The DNS resolver was querying nameservers sequentially, waiting for the full query timeout (5s) on each failure before trying the next server:

ns0.agentuity.com (timeout 5s) → ns1.agentuity.com (timeout 5s) → ns2.agentuity.com (responds) = ~10s+

When internal nameservers were slow or unreachable from within Docker, each one consumed the full timeout before moving on.

Solution

Implemented staggered concurrent queries - a common pattern used by production DNS resolvers (including Go's built-in net.Resolver):

Query first nameserver immediately
After 150ms (staggerDelay), if no response yet, fire the next nameserver in parallel
Continue staggering until all nameservers are queried or a valid response arrives
Use the first successful response
On SERVFAIL/REFUSED/connection error, immediately trigger the next nameserver (no 150ms wait)

Result:

t=0ms:   query ns0
t=150ms: if no response, also query ns1  
t=300ms: if no response, also query ns2
         → first successful response wins

Before: 10-15+ seconds worst case
After: <300ms if any nameserver responds quickly, graceful fallback if not

Changes

Added staggerDelay constant (150ms) for the stagger interval
Added nsResponse struct to track concurrent query results
Rewrote forwardQuery() (UDP) with staggered query logic
Rewrote forwardQueryTCP() with same staggered query logic
Updated tests to handle concurrent behavior with proper synchronization

Testing

All existing DNS tests pass
Nameserver fallback tests updated and passing
Ready for manual testing in Docker environment

Summary by CodeRabbit

Improvements
- DNS queries now resolve faster through concurrent nameserver attempts
- Enhanced fallback logic and timeout handling for more reliable resolution
- Optimized caching mechanism with improved TTL management
Tests
- Fixed race conditions in DNS resolver tests to ensure stability

_{✏️ Tip: You can customize this high-level summary in your review settings.}

… latency Previously, the DNS resolver queried nameservers sequentially, waiting for the full timeout (5s) on each failure before trying the next server. With 3 internal nameservers, this caused 10-15s delays when initial servers were slow or unreachable (common in Docker environments). This change implements staggered concurrent queries: - Query first nameserver immediately - After 150ms, if no response, fire next nameserver in parallel - Continue staggering until all nameservers queried or valid response arrives - Use first successful response - On SERVFAIL/REFUSED/error, immediately trigger next nameserver (no wait) Before: ns0 timeout (5s) -> ns1 timeout (5s) -> ns2 responds = ~10s+ After: ns0 -> +150ms ns1 -> +150ms ns2 -> first response wins = <300ms Applied to both UDP (forwardQuery) and TCP (forwardQueryTCP) paths.

coderabbitai · 2025-12-23T17:28:37Z

Walkthrough

The pull request introduces staggered concurrent DNS queries that replace sequential nameserver querying with asynchronous result channels and first-success-wins logic. Timeout-aware context handling is added to abort when query deadlines expire, while cache mechanics and error handling pathways are retained with updated TTL management.

Changes

Cohort / File(s)	Summary
Formatting adjustments `dns/resolv.go`	Whitespace and formatting changes around GetSystemNameservers; no logic or behavior changes.
Staggered concurrent DNS query engine `dns/server.go`	Replaces sequential per-nameserver querying with staggered concurrent queries. Introduces `nsResponse` type for result propagation, async result channels, and stagger timer logic. Adds timeout-aware context handling, updates TCP/UDP forwarding paths, maintains cache mechanics with revised TTL handling, preserves CNAME recursion, and refactors error paths for timeout/failure scenarios.
Test synchronization `dns/server_test.go`	Adds mutex-based synchronization to guard shared state (callCount, secondCalled) in tests to prevent race conditions. Includes minor field alignment adjustments in test struct definitions and a stagger timer timing comment.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant NSPool as Nameserver<br/>Pool
    participant ResultChan as Result<br/>Channel
    participant Cache
    participant Timeout as Context<br/>Timeout

    Client->>Server: DNS Query
    Server->>Cache: Check cache
    alt Cache Hit
        Cache-->>Client: Cached response
    else Cache Miss
        Server->>Server: Start stagger timer
        Server->>NSPool: Query NS[0] immediately
        rect rgb(200, 220, 255)
            Note over Server,NSPool: Concurrent staggered queries
            Server->>Timeout: Set deadline context
            NSPool->>ResultChan: NS[0] result (response/error)
        end
        par Stagger Wait
            Server->>Server: Wait stagger delay
            Server->>NSPool: Query NS[1]
            NSPool->>ResultChan: NS[1] result
        and Collect Results
            ResultChan->>Server: First satisfactory result
        end
        alt Timeout expires
            Timeout->>Server: Context deadline exceeded
            Server->>Server: Send error response (SERVFAIL)
        else Success received
            Server->>Cache: Cache response + TTL
            Server->>Client: Return response
        else All queries fail
            Server->>Server: Send error response
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 With staggered hops and channels bright,
Concurrent queries race through the night,
First winner takes the prize so fast,
No more waiting for the last!
The timeout keeps us from delays,
DNS flows in swifter ways. 🌩️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately summarizes the main change: implementing staggered concurrent queries for DNS resolution. It is concise, specific, and clearly conveys the primary improvement.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/dns-staggered-queries

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dns/server_test.go (1)
1505-1571: Data race: callCount accessed without synchronization.

With the new staggered concurrent queries, mockDialer may be invoked from multiple goroutines simultaneously. The callCount++ increment on line 1509 and the read on line 1570 are not protected, creating a data race.

Apply the same mutex pattern used in TestDNSResolver_NameserverFallback:
🔎 Proposed fix
 func TestDNSResolver_NameserverFallback_ConnectionError(t *testing.T) {
 	testLogger := logger.NewTestLogger()

+	var mu sync.Mutex
 	var callCount int

 	// Create a mock dialer where first nameserver fails with connection error
 	mockDialer := func(ctx context.Context, network, address string) (net.Conn, error) {
+		mu.Lock()
 		callCount++
+		mu.Unlock()
 
 		if address == "ns1.test:53" {
And when reading:
 	// Both nameservers should have been called
+	mu.Lock()
+	gotCallCount := callCount
+	mu.Unlock()
-	if callCount != 2 {
-		t.Errorf("Expected 2 nameserver calls, got %d", callCount)
+	if gotCallCount != 2 {
+		t.Errorf("Expected 2 nameserver calls, got %d", gotCallCount)
 	}

🧹 Nitpick comments (1)

dns/server.go (1)
451-557: Well-designed staggered query implementation with one robustness consideration.

The staggered concurrent query approach effectively addresses the latency problem. The buffered channel prevents goroutine leaks, and immediate fallback on SERVFAIL/REFUSED is correct.

Timer Reset pattern: Calling staggerTimer.Reset(0) (lines 504, 512, 524) without draining when the timer may have already fired can cause a stale value on the channel. On Go < 1.23, this could lead to spurious timer events. While the impact is minor (at worst, an extra nameserver is queried early), consider the defensive pattern for robustness:
if !staggerTimer.Stop() {
    select {
    case <-staggerTimer.C:
    default:
    }
}
staggerTimer.Reset(0)
This is optional if targeting Go 1.23+ where timer semantics were improved.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 545750e and 3e89a2a.

📒 Files selected for processing (3)

dns/resolv.go
dns/server.go
dns/server_test.go

🧰 Additional context used

🧬 Code graph analysis (1)

dns/resolv.go (1)

dns/dns.go (1)

DefaultExternalDNSServers (45-49)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build
GitHub Check: Analyze (go)

🔇 Additional comments (6)

dns/resolv.go (1)

114-139: LGTM - Formatting-only changes.

The whitespace adjustments in GetSystemNameservers have no functional impact. The logic remains correct.

dns/server_test.go (2)

7-7: LGTM - Required import for test synchronization.

1405-1479: LGTM - Proper mutex synchronization for concurrent test access.

The mutex correctly guards callCount and secondCalled which are now accessed by multiple goroutines due to the staggered query implementation. Reading into gotSecondCalled before unlock is the correct pattern.

dns/server.go (3)

22-22: LGTM - Reasonable stagger delay.

150ms provides good balance between reducing latency when first nameserver is slow and avoiding unnecessary parallel queries when it responds quickly.

650-655: LGTM - Clean result container for concurrent queries.

657-788: LGTM - Consistent staggered implementation for UDP path.

Same well-designed pattern as forwardQueryTCP. The same optional timer robustness improvement mentioned above applies here (lines 717, 725, 738).

The CNAME resolution correctly preserves original headers and merges answer records. Error responses are properly sent when all nameservers fail.

jhaynie requested a review from robindiddams December 23, 2025 17:28

lint the files

3e89a2a

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

jhaynie merged commit e85104a into main Dec 23, 2025
5 checks passed

jhaynie deleted the fix/dns-staggered-queries branch December 23, 2025 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dns): implement staggered concurrent queries to reduce resolution latency #145

fix(dns): implement staggered concurrent queries to reduce resolution latency #145

jhaynie commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(dns): implement staggered concurrent queries to reduce resolution latency #145

fix(dns): implement staggered concurrent queries to reduce resolution latency #145

Conversation

jhaynie commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Changes

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jhaynie commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2025 •

edited

Loading