Skip to content

Conversation

@jhaynie
Copy link
Member

@jhaynie jhaynie commented Dec 23, 2025

Problem

The DNS resolver was experiencing significant delays (10-15+ seconds) when resolving domains, particularly in Docker environments. This was reliably reproducible with internal domains like local-local.machine.agentuity.internal.

Symptoms observed:

/ # dig local-local.machine.agentuity.internal -t AAAA
;; communications error to 127.0.0.1#53: timed out
;; communications error to 127.0.0.1#53: timed out

;; Query time: 5003 msec

The query would timeout twice (~5s each) before eventually resolving. Subsequent requests were fast due to caching.

Root Cause

The DNS resolver was querying nameservers sequentially, waiting for the full query timeout (5s) on each failure before trying the next server:

ns0.agentuity.com (timeout 5s) → ns1.agentuity.com (timeout 5s) → ns2.agentuity.com (responds) = ~10s+

When internal nameservers were slow or unreachable from within Docker, each one consumed the full timeout before moving on.

Solution

Implemented staggered concurrent queries - a common pattern used by production DNS resolvers (including Go's built-in net.Resolver):

  1. Query first nameserver immediately
  2. After 150ms (staggerDelay), if no response yet, fire the next nameserver in parallel
  3. Continue staggering until all nameservers are queried or a valid response arrives
  4. Use the first successful response
  5. On SERVFAIL/REFUSED/connection error, immediately trigger the next nameserver (no 150ms wait)

Result:

t=0ms:   query ns0
t=150ms: if no response, also query ns1  
t=300ms: if no response, also query ns2
         → first successful response wins

Before: 10-15+ seconds worst case
After: <300ms if any nameserver responds quickly, graceful fallback if not

Changes

  • Added staggerDelay constant (150ms) for the stagger interval
  • Added nsResponse struct to track concurrent query results
  • Rewrote forwardQuery() (UDP) with staggered query logic
  • Rewrote forwardQueryTCP() with same staggered query logic
  • Updated tests to handle concurrent behavior with proper synchronization

Testing

  • All existing DNS tests pass
  • Nameserver fallback tests updated and passing
  • Ready for manual testing in Docker environment

Summary by CodeRabbit

  • Improvements

    • DNS queries now resolve faster through concurrent nameserver attempts
    • Enhanced fallback logic and timeout handling for more reliable resolution
    • Optimized caching mechanism with improved TTL management
  • Tests

    • Fixed race conditions in DNS resolver tests to ensure stability

✏️ Tip: You can customize this high-level summary in your review settings.

… latency

Previously, the DNS resolver queried nameservers sequentially, waiting for
the full timeout (5s) on each failure before trying the next server. With
3 internal nameservers, this caused 10-15s delays when initial servers
were slow or unreachable (common in Docker environments).

This change implements staggered concurrent queries:
- Query first nameserver immediately
- After 150ms, if no response, fire next nameserver in parallel
- Continue staggering until all nameservers queried or valid response arrives
- Use first successful response
- On SERVFAIL/REFUSED/error, immediately trigger next nameserver (no wait)

Before: ns0 timeout (5s) -> ns1 timeout (5s) -> ns2 responds = ~10s+
After:  ns0 -> +150ms ns1 -> +150ms ns2 -> first response wins = <300ms

Applied to both UDP (forwardQuery) and TCP (forwardQueryTCP) paths.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

Walkthrough

The pull request introduces staggered concurrent DNS queries that replace sequential nameserver querying with asynchronous result channels and first-success-wins logic. Timeout-aware context handling is added to abort when query deadlines expire, while cache mechanics and error handling pathways are retained with updated TTL management.

Changes

Cohort / File(s) Summary
Formatting adjustments
dns/resolv.go
Whitespace and formatting changes around GetSystemNameservers; no logic or behavior changes.
Staggered concurrent DNS query engine
dns/server.go
Replaces sequential per-nameserver querying with staggered concurrent queries. Introduces nsResponse type for result propagation, async result channels, and stagger timer logic. Adds timeout-aware context handling, updates TCP/UDP forwarding paths, maintains cache mechanics with revised TTL handling, preserves CNAME recursion, and refactors error paths for timeout/failure scenarios.
Test synchronization
dns/server_test.go
Adds mutex-based synchronization to guard shared state (callCount, secondCalled) in tests to prevent race conditions. Includes minor field alignment adjustments in test struct definitions and a stagger timer timing comment.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant NSPool as Nameserver<br/>Pool
    participant ResultChan as Result<br/>Channel
    participant Cache
    participant Timeout as Context<br/>Timeout

    Client->>Server: DNS Query
    Server->>Cache: Check cache
    alt Cache Hit
        Cache-->>Client: Cached response
    else Cache Miss
        Server->>Server: Start stagger timer
        Server->>NSPool: Query NS[0] immediately
        rect rgb(200, 220, 255)
            Note over Server,NSPool: Concurrent staggered queries
            Server->>Timeout: Set deadline context
            NSPool->>ResultChan: NS[0] result (response/error)
        end
        par Stagger Wait
            Server->>Server: Wait stagger delay
            Server->>NSPool: Query NS[1]
            NSPool->>ResultChan: NS[1] result
        and Collect Results
            ResultChan->>Server: First satisfactory result
        end
        alt Timeout expires
            Timeout->>Server: Context deadline exceeded
            Server->>Server: Send error response (SERVFAIL)
        else Success received
            Server->>Cache: Cache response + TTL
            Server->>Client: Return response
        else All queries fail
            Server->>Server: Send error response
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 With staggered hops and channels bright,
Concurrent queries race through the night,
First winner takes the prize so fast,
No more waiting for the last!
The timeout keeps us from delays,
DNS flows in swifter ways. 🌩️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: implementing staggered concurrent queries for DNS resolution. It is concise, specific, and clearly conveys the primary improvement.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/dns-staggered-queries

Comment @coderabbitai help to get the list of available commands and usage tips.

@jhaynie jhaynie requested a review from robindiddams December 23, 2025 17:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dns/server_test.go (1)

1505-1571: Data race: callCount accessed without synchronization.

With the new staggered concurrent queries, mockDialer may be invoked from multiple goroutines simultaneously. The callCount++ increment on line 1509 and the read on line 1570 are not protected, creating a data race.

Apply the same mutex pattern used in TestDNSResolver_NameserverFallback:

🔎 Proposed fix
 func TestDNSResolver_NameserverFallback_ConnectionError(t *testing.T) {
 	testLogger := logger.NewTestLogger()

+	var mu sync.Mutex
 	var callCount int

 	// Create a mock dialer where first nameserver fails with connection error
 	mockDialer := func(ctx context.Context, network, address string) (net.Conn, error) {
+		mu.Lock()
 		callCount++
+		mu.Unlock()
 
 		if address == "ns1.test:53" {

And when reading:

 	// Both nameservers should have been called
+	mu.Lock()
+	gotCallCount := callCount
+	mu.Unlock()
-	if callCount != 2 {
-		t.Errorf("Expected 2 nameserver calls, got %d", callCount)
+	if gotCallCount != 2 {
+		t.Errorf("Expected 2 nameserver calls, got %d", gotCallCount)
 	}
🧹 Nitpick comments (1)
dns/server.go (1)

451-557: Well-designed staggered query implementation with one robustness consideration.

The staggered concurrent query approach effectively addresses the latency problem. The buffered channel prevents goroutine leaks, and immediate fallback on SERVFAIL/REFUSED is correct.

Timer Reset pattern: Calling staggerTimer.Reset(0) (lines 504, 512, 524) without draining when the timer may have already fired can cause a stale value on the channel. On Go < 1.23, this could lead to spurious timer events. While the impact is minor (at worst, an extra nameserver is queried early), consider the defensive pattern for robustness:

if !staggerTimer.Stop() {
    select {
    case <-staggerTimer.C:
    default:
    }
}
staggerTimer.Reset(0)

This is optional if targeting Go 1.23+ where timer semantics were improved.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 545750e and 3e89a2a.

📒 Files selected for processing (3)
  • dns/resolv.go
  • dns/server.go
  • dns/server_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
dns/resolv.go (1)
dns/dns.go (1)
  • DefaultExternalDNSServers (45-49)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: Analyze (go)
🔇 Additional comments (6)
dns/resolv.go (1)

114-139: LGTM - Formatting-only changes.

The whitespace adjustments in GetSystemNameservers have no functional impact. The logic remains correct.

dns/server_test.go (2)

7-7: LGTM - Required import for test synchronization.


1405-1479: LGTM - Proper mutex synchronization for concurrent test access.

The mutex correctly guards callCount and secondCalled which are now accessed by multiple goroutines due to the staggered query implementation. Reading into gotSecondCalled before unlock is the correct pattern.

dns/server.go (3)

22-22: LGTM - Reasonable stagger delay.

150ms provides good balance between reducing latency when first nameserver is slow and avoiding unnecessary parallel queries when it responds quickly.


650-655: LGTM - Clean result container for concurrent queries.


657-788: LGTM - Consistent staggered implementation for UDP path.

Same well-designed pattern as forwardQueryTCP. The same optional timer robustness improvement mentioned above applies here (lines 717, 725, 738).

The CNAME resolution correctly preserves original headers and merges answer records. Error responses are properly sent when all nameservers fail.

@jhaynie jhaynie merged commit e85104a into main Dec 23, 2025
5 checks passed
@jhaynie jhaynie deleted the fix/dns-staggered-queries branch December 23, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants