Skip to content

Conversation

@bbednarski9
Copy link
Contributor

@bbednarski9 bbednarski9 commented Jan 25, 2026

Description

This PR brings the ai-dynamo prometheus metrics API integration with grafana dash boarding into NAT. This is a core feature of the dynamo optimization pipeline for the NAT/dynamo integration. We also integrate the prometheus API with the NAT profiler, so that we can begin end-to-end agentic header-injection optimization.
Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Prometheus + Grafana monitoring stack, dashboards, recording rules and metrics collector for Dynamo
    • Optimized Thompson Sampling router (and processor) with Prometheus instrumentation
    • Multi-worker orchestration, MultiLRU tooling, KV event observer and demo scripts
    • Profiler integration: Dynamo metrics collection and convenience APIs
  • Documentation

    • Detailed monitoring, architecture, parameters, and deployment guides
  • Chores

    • Default HTTP port changed from 8099 to 8000; updated start/stop scripts and examples

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
@bbednarski9 bbednarski9 requested a review from a team as a code owner January 25, 2026 20:37
@bbednarski9 bbednarski9 added improvement Improvement to existing functionality non-breaking Non-breaking change labels Jan 25, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Warning

Rate limit exceeded

@bbednarski9 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 23 minutes and 30 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

Adds an optimized Thompson Sampling router and processors for Dynamo with dynamic discovery; extensive Prometheus/Grafana monitoring, dashboards, and metric collection tooling; multi-worker startup/stop/orchestration scripts and image-build tooling; NAT profiler integration for Dynamo metrics; transport-level LLM hint injection; numerous docs and demo utilities.

Changes

Cohort / File(s) Summary
Documentation & Architecture
docs/source/components/integrations/frameworks.md, external/dynamo/optimized/ARCHITECTURE.md, external/dynamo/optimized/README.md, external/dynamo/optimized/PARAMETERS.md, external/dynamo/monitoring/README.md
Updated framework link and added comprehensive architecture, parameters, README, and monitoring operation docs for the optimized Thompson Sampling router and Dynamo monitoring stack.
Router Implementations
external/dynamo/optimized/router.py, external/dynamo/optimized/router_multilru.py
New WorkloadAwareRouter implementations (contextual LinTS/Thompson Sampling, per-prefix reuse, learning, Prometheus metrics, CLI/config overrides, endpoints for find_worker/feedback).
Processor Implementations
external/dynamo/optimized/processor.py, external/dynamo/optimized/processor_multilru.py, external/dynamo/generalized/processor.py
New processor modules that register as dynamo.backend.generate, extract nvext annotations, perform worker selection via router, stream responses, emit KV-efficiency metrics; removed explicit service creation in generalized processor.
Monitoring Stack & Dashboards
external/dynamo/monitoring/docker-compose.yml, external/dynamo/monitoring/prometheus.yml, external/dynamo/monitoring/grafana/provisioning/..., external/dynamo/monitoring/rules/vllm-aliases.yml, external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json
Adds Prometheus + Grafana compose, Prometheus scrape config, Grafana provisioning (datasource, provider, JSON dashboard) and recording rules to harmonize vLLM metrics for dashboards.
Metrics Collection & Observability Tools
external/dynamo/collect_metrics.sh, external/dynamo/monitoring/scripts/kv_event_observer.py, external/dynamo/monitoring/rules/vllm-aliases.yml
New Bash metrics collector script, ZMQ-based KV event observer script, and Prometheus recording rules to surface vLLM metric aliases and collect/aggregate metrics.
Profiler Integration
src/nat/profiler/inference_optimization/dynamo_metrics.py, src/nat/data_models/profiler.py, src/nat/profiler/profile_runner.py
Adds Dynamo metrics collector, data models (DynamoCoreMetrics, DynamoMetricsResult), ProfilerConfig.dynamo_metrics field, and profile runner integration to optionally collect/attach Dynamo Prometheus metrics to profiling results.
LLM Client Transport
src/nat/llm/dynamo_llm.py
Replaces httpx event-hook header injection with a custom transport that injects Dynamo routing hints into both HTTP headers and nvext.annotations in request bodies.
Examples & Configs
examples/dynamo_integration/.../profile_rethinking_full_test.yml, examples/.../eval_config_no_rethinking_minimal_test.yml
Enables Dynamo Prometheus metrics in profiler config, updates Dynam oLLM base_url endpoints from :8099 to :8000, and reduces eval concurrency in one config.
Environment & Build
external/dynamo/.env.example, external/dynamo/build_multi_lru_image.sh
Expands .env example with HF credentials and many active Dynamo tunables; adds script to build a MultiLRU-enabled vLLM image.
Startup / Orchestration / Lifecycle Scripts
external/dynamo/start_dynamo_unified.sh, external/dynamo/start_dynamo_disagg.sh, external/dynamo/start_dynamo_unified_thompson_hints.sh, external/dynamo/start_dynamo_optimized_thompson_hints_sglang.sh, external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh, external/dynamo/stop_dynamo.sh, external/dynamo/monitor_dynamo.sh
New and updated start/stop scripts for unified/disaggregated/optimized deployments: multi-worker GPU allocation, per-worker ports, HTTP port default changed 8099→8000, extended timeouts, ETCD/NATS and monitoring orchestration, plus enhanced stop options for metrics.
Metrics Demo & Eviction Demo
external/dynamo/demo_priority_eviction.sh
Adds demo script to exercise MultiLRU priority eviction behavior and measure cache-hit deltas.
Grafana Provisioning Files
external/dynamo/monitoring/grafana/provisioning/dashboards/dashboards.yml, external/dynamo/monitoring/grafana/provisioning/datasources/datasources.yml
Adds Grafana provisioning files to auto-load the Dynamo dashboards and Prometheus datasource.
Misc: Pre-commit
.pre-commit-config.yaml
Switches a local hook to run as a Python script and adds nbconvert dependency for clearing notebook outputs.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend
    participant Processor
    participant Router
    participant Worker
    participant Prometheus

    Client->>Frontend: HTTP generate request (headers / nvext.annotations)
    Frontend->>Processor: route to backend.generate
    Processor->>Processor: extract hints (prefix_id, reuse_budget, osl, iat)
    Processor->>Router: find_worker(tokens, prefix_id, reuse_budget)
    Router->>Prometheus: record decision metric
    Router-->>Processor: worker_id, decision_id
    Processor->>Worker: forward/stream generate request
    Worker-->>Processor: stream response chunks
    Processor->>Prometheus: update latency/token metrics
    Processor->>Prometheus: async update KVE metrics
    Processor-->>Frontend: stream response back
    Frontend-->>Client: deliver response
    Processor->>Router: feedback(decision_id, latency, success)
    Router->>Prometheus: record feedback / update baselines
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)
Check name Status Explanation Resolution
Title check ❌ Error Pull request title exceeds the recommended 72-character limit by 6 characters (78 total). While the title is descriptive and uses imperative mood, it violates the conciseness requirement. Consider shortening the title to fit within 72 characters, for example: 'feature: dynamo profiler integration with prometheus/grafana dashboard' or 'feature: integrate dynamo metrics into nat profiler'
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
external/dynamo/start_dynamo_unified.sh (1)

17-25: Update the header comment to match the new multi‑worker defaults.

The header still mentions GPUs 0‑3 with TP=4, but defaults are now 0‑7 and TP=2 with multiple workers. Please refresh the description. As per coding guidelines.

📝 Example update
-#   - Unified Worker (GPUs 0,1,2,3, TP=4, no disaggregation)
+#   - Unified Workers (default GPUs 0-7, TP=2 per worker, no disaggregation)
🤖 Fix all issues with AI agents
In `@external/dynamo/collect_metrics.sh`:
- Around line 27-41: The curl calls that write to "$frontend_file" and
"$worker_file" should use --fail and --max-time (e.g., --max-time 10) and check
curl's exit status so failures don't create empty/partial files; fetch into temp
files, verify each curl succeeded, log an error or warning to stderr with
context ("frontend metrics" / "worker metrics") and either exit 1 or skip
inclusion into "$combined_file" based on desired behavior, then atomically move
the temp files to "$frontend_file" and "$worker_file" before building
"$combined_file".

In
`@external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json`:
- Around line 1-4: Add the SPDX Apache-2.0 header object according to the repo’s
JSON convention immediately before the existing dashboard content (i.e., before
the top-level "annotations" key); update the JSON to include the standard SPDX
fields (e.g., "spdx": { "license": "Apache-2.0", "copyright": "...", ... })
following the repository’s header schema so the file begins with the SPDX header
object then the existing dashboard object with "annotations".

In `@external/dynamo/monitoring/README.md`:
- Around line 1-3: Add the SPDX Apache-2.0 license header as the very first line
of the README before the "# Dynamo Monitoring Stack" title; update the README.md
for the "Dynamo Monitoring Stack" document to include the exact SPDX identifier
"SPDX-License-Identifier: Apache-2.0" on its own line at the top of the file.

In `@external/dynamo/optimized/ARCHITECTURE.md`:
- Around line 1-2: The ARCHITECTURE.md file titled "Optimized Thompson Sampling
Router Architecture" is missing the required SPDX Apache-2.0 header; add the
standard SPDX header comment (e.g., "SPDX-License-Identifier: Apache-2.0") at
the very top of the file before any content so the document complies with
licensing guidelines and linter checks.

In `@external/dynamo/optimized/PARAMETERS.md`:
- Around line 1-3: Add the standard SPDX Apache-2.0 header to the top of
PARAMETERS.md to comply with licensing guidelines; insert the SPDX identifier
line "SPDX-License-Identifier: Apache-2.0" (and optional year/owner comment
above or beside it per project convention) as the first lines of the file so the
new Markdown doc describing WorkloadAwareRouter in router.py includes the
required license header.

In `@external/dynamo/optimized/processor.py`:
- Around line 718-745: Add a return type annotation to parse_args() (->
argparse.Namespace) and add a new CLI argument "--kv-cache-block-size"
(type=int) that defaults from the DYNAMO_KV_BLOCK_SIZE environment variable or
64 when unset; ensure parse_args returns the Namespace with this value and then
thread that value into the call to register_llm(...) (or wherever the model/card
is registered) so register_llm receives kv_cache_block_size instead of the
hard-coded 1, keeping the argument name kv_cache_block_size consistent with
existing worker/frontend usage.

In `@external/dynamo/optimized/README.md`:
- Around line 1-3: Add the SPDX license header to the top of the README so the
file begins with the standard SPDX identifier before the existing title line "#
Optimized Thompson Sampling Router Architecture"; insert a single-line header
"SPDX-License-Identifier: Apache-2.0" (with no other characters) as the first
line of the file so the README follows the repository's Markdown licensing
guideline.

In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh`:
- Around line 284-301: The script currently uses read -p to prompt for HF_TOKEN
which can echo the secret to the terminal; change the interactive prompt to use
a silent read (e.g., read -s -r HF_TOKEN) so the token is not displayed, add a
newline/confirmation echo after the silent read to keep UX clean, and ensure the
later checks that set HF_TOKEN="dummy" or echo "✓ HuggingFace token received"
remain unchanged; update the prompt logic around the HF_TOKEN and
LOCAL_MODEL_DIR checks to use the silent read and not print the token.
- Around line 77-83: Validate GPU/TP sizing after computing NUM_GPUS and before
using NUM_WORKERS: check that TP_SIZE is a positive integer and that NUM_GPUS >=
TP_SIZE and that NUM_GPUS is divisible by TP_SIZE so NUM_WORKERS computed from
NUM_GPUS/TP_SIZE is > 0; if the check fails, print a clear error using
WORKER_GPUS, TP_SIZE and NUM_GPUS context and exit non‑zero to avoid later
indexing in wait_for_worker or using CONTAINER_GPU_INDICES when NUM_WORKERS ==
0. Ensure the validation runs right after NUM_WORKERS is assigned and prevents
further execution on bad configs.

In `@external/dynamo/start_dynamo_unified.sh`:
- Around line 53-55: NUM_WORKERS calculation can yield 0 or incorrect values
when TP_SIZE > NUM_GPUS or when NUM_GPUS is not divisible by TP_SIZE; add
validation after computing NUM_WORKERS (and before accessing WORKER_PIDS) to
check that NUM_GPUS and TP_SIZE are positive integers, that TP_SIZE <= NUM_GPUS,
and that NUM_GPUS % TP_SIZE == 0, and if any check fails print a clear error and
exit (use the same logging/exit pattern as the script). Also consider computing
NUM_WORKERS as an integer division only after validation to avoid truncation
surprises and reference NUM_WORKERS, NUM_GPUS, TP_SIZE and WORKER_PIDS in the
error message so the failure is easy to debug.

In `@src/nat/llm/dynamo_llm.py`:
- Around line 393-410: The code always generates a prefix ID even when
prefix_template is intended to disable hints; change the logic in the block that
sets unique_id/prefix_id so that if prefix_template is None you set prefix_id =
None (do not generate unique_id or call uuid.uuid4()), otherwise generate
unique_id and compute prefix_id = prefix_template.format(uuid=unique_id) (or
fallback f"nat-dynamo-{unique_id}" when prefix_template is truthy but empty),
update the logger to only log the created prefix when prefix_id is not None, and
pass that prefix_id (possibly None) into _DynamoTransport so the transport can
skip injecting hints when prefix_id is None.
🟡 Minor comments (19)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/profile_rethinking_full_test.yml-243-246 (1)

243-246: Update Prometheus scrape interval reference to match actual config.

The comment states Prometheus scrapes every 5s, but the Prometheus configuration in external/dynamo/monitoring/prometheus.yml defines scrape_interval: 2s. Update the minimum range calculation accordingly.

Proposed fix
-        # Minimum: 15s (Prometheus scrapes every 5s, need ≥3 points for reliable rates)
+        # Minimum: 6s (Prometheus scrapes every 2s, need ≥3 points for reliable rates)
external/dynamo/optimized/ARCHITECTURE.md-13-14 (1)

13-14: Replace inanimate possessive ’s in Markdown.

Phrases like “Frontend’s built-in router” and “Dynamo’s --model-name/--model-path” violate the doc style rule. Please rephrase (e.g., “the built-in router of the frontend”, “the Dynamo --model-name/--model-path”). As per coding guidelines, ...

✍️ Example rephrases
-3. **Frontend's built-in router becomes irrelevant**
+3. **The built-in router of the frontend becomes irrelevant**

-| Model Mapping | Custom `FRONTEND_MODEL_MAPPING` | Dynamo's `--model-name`/`--model-path` |
+| Model Mapping | Custom `FRONTEND_MODEL_MAPPING` | the Dynamo `--model-name`/`--model-path` |

-> default frontend, but this is **irrelevant** in our architecture. The frontend's built-in
+> default frontend, but this is **irrelevant** in our architecture. The built-in router of the frontend

Also applies to: 216-224, 360-363

external/dynamo/optimized/ARCHITECTURE.md-15-21 (1)

15-21: Add language identifiers to fenced code blocks (MD040).

Markdownlint flags these fenced blocks for missing language tags. Please add text, bash, or json as appropriate.

🧩 Example fix
-```
+```text
 Frontend (built-in router: round-robin)
     → routes to dynamo.backend.generate
     → OUR PROCESSOR (intercepts!)
         → queries Thompson Sampling router
         → forwards to dynamo.worker.generate (actual SGLang workers)
-```
+```

Also applies to: 23-24, 297-304, 307-313

src/nat/llm/dynamo_llm.py-22-23 (1)

22-23: Docstring still references event hooks; update to transport-based injection.

The module now uses a custom transport, but the docstring still describes event hooks.

📝 Suggested update
-The implementation uses httpx event hooks to inject hints at the HTTP transport level,
+The implementation uses a custom httpx transport to inject hints at the HTTP level,
external/dynamo/monitoring/README.md-383-392 (1)

383-392: Wrap bare URLs in inline code or links (MD034).

Bare URLs are flagged by markdownlint; please wrap them in inline code or convert to links (e.g., http://localhost:3000).

🔗 Example fix
-Then open http://localhost:3000 in your browser.
+Then open `http://localhost:3000` in your browser.
...
-Example queries:
+Example queries:
 
-```promql
+```promql
 # Request rate (requests/second)
 rate(dynamo_frontend_requests_total[1m])

Also applies to: 414-488

external/dynamo/optimized/README.md-7-30 (1)

7-30: Specify languages for the untyped fenced blocks (MD040).

Add a language such as text to the architecture diagram and expected‑output snippet to satisfy markdownlint.

🧹 Example fix
-```
+```text
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  Client Request (with nvext.annotations)                                │
 ...
 └─────────────────────────────────────────────────────────────────────────┘
-```
+```
...
-```
+```text
 Step 3: Starting Custom Processor (Registers as backend.generate)...
 ...
 INFO processor.generate: Routing decision: worker=... decision=...
-```
+```

Also applies to: 157-174

external/dynamo/optimized/README.md-13-48 (1)

13-48: Avoid possessive 's for inanimate objects in docs.

Rephrase instances like “Processor's model card”, “Frontend's ModelWatcher”, and “frontend's expected model” to non‑possessive forms for doc style compliance. As per coding guidelines.

✍️ Example wording updates
-│       ↓ finds Processor's model card!                                   │
+│       ↓ finds the model card for the processor                           │
...
-3. **Frontend's ModelWatcher** discovers the processor's model card
+3. **Frontend ModelWatcher** discovers the model card for the processor
...
-- `--model-name`: Served model name (must match frontend's expected model)
+- `--model-name`: Served model name (must match the expected model for the frontend)

Also applies to: 86-88

external/dynamo/monitoring/README.md-24-55 (1)

24-55: Add language identifiers to fenced blocks flagged by markdownlint.

Several code fences (diagram, KVES formula, file tree, metric lists) are missing a language tag. Add text (or a more specific language) to satisfy MD040.

🧹 Example fixes
-```
+```text
 ┌──────────────────────────────────────────────────────────────────────────────┐
 │                              Dynamo Stack                                    │
 ...
 └──────────────────────────────────────────────────────────────────────────────┘
-```
+```
...
-```
+```text
 KVES = (TotalWork - ActualWork) / TotalWork ∈ [0,1]
 ...
 w_hit = (w_gpu_hit, w_cpu_hit, w_disk_hit)  # weights per hit source
-```
+```

Also applies to: 165-173, 295-308, 531-545

src/nat/profiler/profile_runner.py-230-251 (1)

230-251: Preserve stack traces and satisfy BLE001 when Dynamo metrics collection fails.

The broad catch logs with logger.warning, which drops the stack trace and still triggers BLE001. If the broad catch is intentional, switch to logger.exception and annotate with # noqa: BLE001 (or narrow the exception list). As per coding guidelines.

🛠️ Suggested change
-            except Exception as e:
-                logger.warning("Failed to collect Dynamo metrics: %s", e)
+            except Exception:  # noqa: BLE001
+                logger.exception("Failed to collect Dynamo metrics")
external/dynamo/start_dynamo_unified.sh-480-482 (1)

480-482: Fix the separator typo in the output banner.

The separator line prints ...ca...; this looks like an accidental keystroke.

🧹 Proposed fix
-    echo "=====================================ca===================="
+    echo "========================================================="
external/dynamo/monitoring/README.md-185-195 (1)

185-195: Doc style tweaks: avoid inanimate possessives and hyphenate compounds.

Rephrase “SGLang's/Dynamo's …” and change “KV cache related” → “KV cache‑related”; also remove the blank line inside the blockquote to satisfy MD028. As per coding guidelines.

✍️ Example wording updates
-> **Why use SGLang's native metric?** SGLang computes cache hit rate internally but doesn't include
+> **Why use the native metric from SGLang?** SGLang computes cache hit rate internally but doesn't include
 > `cached_tokens` in its API responses. The processor's `thompson_kve_*` counters will show 0
 > unless the underlying engine provides `usage.prompt_tokens_details.cached_tokens`.
-
-> **Note on Full KVES**: To implement the full KVES equation with CPU/disk hit weights, you would need
+> **Note on the full KVES**: To implement the full KVES equation with CPU/disk hit weights, you would need
 ...
-This section documents the working status of all KV cache related metrics across the Dynamo stack.
+This section documents the working status of all KV cache‑related metrics across the Dynamo stack.
external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json-96-104 (1)

96-104: Align the Requests panel query with the "req/min" unit.

The panel is labeled "Requests (1m)" with unit "reqpm", but the query uses a 5s window. Use a 1m increase window to match the declared per-minute metric.

Suggested query change
-          "expr": "sum(increase(dynamo_frontend_requests_total[5s]))",
+          "expr": "sum(increase(dynamo_frontend_requests_total[1m]))",
external/dynamo/optimized/router.py-1054-1062 (1)

1054-1062: Remove TODO in production code

The TODO for metrics querying violates the no-TODOs guideline. Either implement a stubbed query with an explicit config gate or track it in an issue. As per coding guidelines, avoid TODOs in committed code/comments.

Do you want me to draft a minimal metrics client stub (or open an issue template) for this block?

src/nat/profiler/inference_optimization/dynamo_metrics.py-1059-1062 (1)

1059-1062: Docstring typo: kv_cache_efficiency doesn’t exist

The field is kv_efficiency. This example will mislead users. As per coding guidelines, keep documentation accurate.

✅ Suggested fix
-            print(f"KV Efficiency: {core.kv_cache_efficiency:.2%}")
+            print(f"KV Efficiency: {core.kv_efficiency:.2%}")
external/dynamo/.env.example-108-115 (1)

108-115: Export KV tuning vars so child processes see them

Without export, these values won’t reach the startup scripts and defaults will be used.

✅ Suggested fix
-DYNAMO_KV_BLOCK_SIZE=64
+export DYNAMO_KV_BLOCK_SIZE=64
@@
-DYNAMO_MEM_FRACTION_STATIC=0.9
+export DYNAMO_MEM_FRACTION_STATIC=0.9
external/dynamo/optimized/processor.py-232-234 (1)

232-234: Remove hard-coded Dynamo version in docstring

Version numbers shouldn’t be embedded in .py docs. As per coding guidelines, keep docs version-agnostic.

✅ Suggested fix
-        # doesn't expose custom bucket configuration in Dynamo 0.7.1)
+        # doesn't expose custom bucket configuration in the current Dynamo release)
external/dynamo/optimized/router.py-1259-1328 (1)

1259-1328: Add return type hint for the CLI parser

Public functions require return type hints. Add -> argparse.Namespace to parse_args().

✅ Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
external/dynamo/optimized/processor.py-488-491 (1)

488-491: Use logger.exception() to capture stack trace information

This exception handler catches and logs without re-raising, so it should use logger.exception() to preserve the full stack trace. Also, when using logger.exception(), omit the exception object from the format arguments—it's automatically captured:

Suggested fix
        except Exception as e:
-            logger.error("Failed to pick worker: %s", e)
+            logger.exception("Failed to pick worker")
             self._metrics.router_errors_total.inc()
             return None, None

Consider narrowing the exception type if possible based on what the router client can raise.

src/nat/profiler/inference_optimization/dynamo_metrics.py-717-720 (1)

717-720: Use logger.exception when swallowing errors without re-raising

These exception handlers catch exceptions without re-raising and don't capture stack traces. According to coding guidelines, use logger.exception() (or logger.error(exc_info=True)) to preserve full diagnostic information for exceptions caught without re-raising.

✅ Example adjustment for lines 717-720
                except Exception as e:
                    error_msg = f"Failed to collect {metric_name}: {e}"
-                   logger.warning(error_msg)
+                   logger.exception(error_msg)
                    result.errors.append(error_msg)

Also applies to:

  • Lines 972-974: Replace logger.debug() with logger.exception()
  • Lines 1001-1002 and 1008-1009: Add logger.exception() calls to capture stack traces before appending to the errors list
🧹 Nitpick comments (4)
external/dynamo/monitoring/docker-compose.yml (1)

33-42: Follow environment variable pattern for Grafana credentials and use safer defaults.

This file is documented for local development only, but using hardcoded admin credentials and enabling anonymous Admin access should still follow environment variable defaults for consistency with best practices. Additionally, fix the file ending: the file currently ends with 2 blank lines instead of 1.

Recommended changes
-      - GF_AUTH_ANONYMOUS_ENABLED=true
-      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
-      - GF_AUTH_DISABLE_LOGIN_FORM=true
-      - GF_SECURITY_ADMIN_USER=admin
-      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_AUTH_ANONYMOUS_ENABLED=${GRAFANA_ANON_ENABLED:-false}
+      - GF_AUTH_ANONYMOUS_ORG_ROLE=${GRAFANA_ANON_ROLE:-Viewer}
+      - GF_AUTH_DISABLE_LOGIN_FORM=${GRAFANA_DISABLE_LOGIN_FORM:-false}
+      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
+      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}

Remove trailing blank lines at end of file to have only a single newline.

src/nat/data_models/profiler.py (1)

44-105: Trim the DynamoMetricsConfig docstring to Google-style format.

Guidelines require Google-style docstrings for public classes; consider a concise summary + Attributes: and move the long how‑to content to external docs. As per coding guidelines.

external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json (1)

572-584: Replace hard-coded worker instance queries with dynamic per-instance aggregation.

The current queries for refId D and E pin specific instances (localhost:18081 and localhost:18082), which breaks the panel if the worker count changes. Use sum by (instance) to automatically show all worker instances:

♻️ Suggested query change
-        {
-          "expr": "rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\",instance=\"localhost:18081\"}[5s])",
-          "legendFormat": "4. Worker 0 (18081) [rate(dynamo_component_requests_total{worker:18081}[5s])]",
-          "refId": "D"
-        },
-        {
-          "expr": "rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\",instance=\"localhost:18082\"}[5s])",
-          "legendFormat": "4. Worker 1 (18082) [rate(dynamo_component_requests_total{worker:18082}[5s])]",
-          "refId": "E"
-        },
+        {
+          "expr": "sum by (instance) (rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\"}[5s]))",
+          "legendFormat": "4. Worker ({{instance}}) [rate(dynamo_component_requests_total{worker}[5s])]",
+          "refId": "D"
+        },
external/dynamo/optimized/router.py (1)

472-557: Harden debug trace storage location

Defaulting to /tmp is vulnerable to symlink races on shared systems. Consider a user-scoped cache dir and restrictive permissions.

♻️ Suggested hardening
-        debug_trace_dir: str = "/tmp/dynamo_router_traces",
+        debug_trace_dir: str = "~/.cache/dynamo_router_traces",
@@
-        self.debug_trace_dir = str(debug_trace_dir)
+        self.debug_trace_dir = os.path.expanduser(str(debug_trace_dir))
@@
-            os.makedirs(self.debug_trace_dir, exist_ok=True)
+            os.makedirs(self.debug_trace_dir, mode=0o700, exist_ok=True)

Comment on lines +27 to +41
# Collect frontend metrics
curl -s http://localhost:8000/metrics > "$frontend_file" 2>/dev/null

# Collect worker metrics
curl -s http://localhost:8081/metrics > "$worker_file" 2>/dev/null

# Create combined file with headers
{
echo "# Collected at: $(date -Iseconds)"
echo "# === FRONTEND METRICS ==="
cat "$frontend_file"
echo ""
echo "# === WORKER METRICS ==="
cat "$worker_file"
} > "$combined_file"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "collect_metrics.sh" -type f

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 105


🏁 Script executed:

cat -n ./external/dynamo/collect_metrics.sh

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2456


Add timeout and failure handling for metric collection.

The curl commands lack timeouts and error detection. If endpoints are unavailable, curl can hang indefinitely or write empty files, which silently corrupts the combined and rolling metrics with misleading data. Add --fail, --max-time, and proper error handling.

🔧 Suggested approach
-    # Collect frontend metrics
-    curl -s http://localhost:8000/metrics > "$frontend_file" 2>/dev/null
+    # Collect frontend metrics
+    if ! curl -fsS --max-time 5 http://localhost:8000/metrics > "$frontend_file"; then
+        echo "  Warning: frontend metrics unavailable"
+        return 1
+    fi
     
-    # Collect worker metrics
-    curl -s http://localhost:8081/metrics > "$worker_file" 2>/dev/null
+    # Collect worker metrics
+    if ! curl -fsS --max-time 5 http://localhost:8081/metrics > "$worker_file"; then
+        echo "  Warning: worker metrics unavailable"
+        return 1
+    fi

Note: If returning early is too strict for a continuous collector, consider logging warnings instead and continuing, or use exit 1 to stop the script entirely on failures.

🤖 Prompt for AI Agents
In `@external/dynamo/collect_metrics.sh` around lines 27 - 41, The curl calls that
write to "$frontend_file" and "$worker_file" should use --fail and --max-time
(e.g., --max-time 10) and check curl's exit status so failures don't create
empty/partial files; fetch into temp files, verify each curl succeeded, log an
error or warning to stderr with context ("frontend metrics" / "worker metrics")
and either exit 1 or skip inclusion into "$combined_file" based on desired
behavior, then atomically move the temp files to "$frontend_file" and
"$worker_file" before building "$combined_file".

Comment on lines +1 to +4
{
"annotations": {
"list": []
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add SPDX Apache-2.0 header to the dashboard JSON.

Repository policy requires SPDX headers for .json files; please add the standard header using the repo’s JSON convention before the dashboard content. As per coding guidelines.

📝 Example header fields
 {
+  "__comment_spdx": "SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.",
+  "__comment_license": "SPDX-License-Identifier: Apache-2.0",
   "annotations": {
     "list": []
   },
🤖 Prompt for AI Agents
In
`@external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json`
around lines 1 - 4, Add the SPDX Apache-2.0 header object according to the
repo’s JSON convention immediately before the existing dashboard content (i.e.,
before the top-level "annotations" key); update the JSON to include the standard
SPDX fields (e.g., "spdx": { "license": "Apache-2.0", "copyright": "...", ... })
following the repository’s header schema so the file begins with the SPDX header
object then the existing dashboard object with "annotations".

Comment on lines 1 to 3
# Dynamo Monitoring Stack

This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add SPDX Apache-2.0 header at the top of this README.

The repository requires the SPDX header at the beginning of every .md file. Please add it before the title. As per coding guidelines.

📝 Proposed header
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
 # Dynamo Monitoring Stack
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Dynamo Monitoring Stack
This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router.
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->
# Dynamo Monitoring Stack
This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router.
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` around lines 1 - 3, Add the SPDX
Apache-2.0 license header as the very first line of the README before the "#
Dynamo Monitoring Stack" title; update the README.md for the "Dynamo Monitoring
Stack" document to include the exact SPDX identifier "SPDX-License-Identifier:
Apache-2.0" on its own line at the top of the file.

Comment on lines +1 to +2
# Optimized Thompson Sampling Router Architecture

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add SPDX Apache-2.0 header to the new architecture doc.

This Markdown file is missing the required SPDX header comment at the top. As per coding guidelines, ...

🧾 Proposed header addition
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 # Optimized Thompson Sampling Router Architecture
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Optimized Thompson Sampling Router Architecture
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Optimized Thompson Sampling Router Architecture
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/ARCHITECTURE.md` around lines 1 - 2, The
ARCHITECTURE.md file titled "Optimized Thompson Sampling Router Architecture" is
missing the required SPDX Apache-2.0 header; add the standard SPDX header
comment (e.g., "SPDX-License-Identifier: Apache-2.0") at the very top of the
file before any content so the document complies with licensing guidelines and
linter checks.

Comment on lines +1 to +3
# Thompson Sampling Router Parameters

This document describes all configurable parameters for the `WorkloadAwareRouter` in `router.py`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add SPDX Apache-2.0 header to the new Markdown doc.

This file is missing the required SPDX header at the top. Please add the standard header comment. As per coding guidelines, ...

🧾 Proposed header addition
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 # Thompson Sampling Router Parameters
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Thompson Sampling Router Parameters
This document describes all configurable parameters for the `WorkloadAwareRouter` in `router.py`.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Thompson Sampling Router Parameters
This document describes all configurable parameters for the `WorkloadAwareRouter` in `router.py`.
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/PARAMETERS.md` around lines 1 - 3, Add the standard
SPDX Apache-2.0 header to the top of PARAMETERS.md to comply with licensing
guidelines; insert the SPDX identifier line "SPDX-License-Identifier:
Apache-2.0" (and optional year/owner comment above or beside it per project
convention) as the first lines of the file so the new Markdown doc describing
WorkloadAwareRouter in router.py includes the required license header.

Comment on lines +1 to +3
# Optimized Thompson Sampling Router Architecture

This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with **Dynamic Discovery** to intercept requests from the default Dynamo frontend.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add SPDX Apache-2.0 header to this README.

Markdown files must begin with the SPDX header; this file currently starts with the title. Please add the standard header comment before the heading. As per coding guidelines.

📝 Proposed header
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
 # Optimized Thompson Sampling Router Architecture
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Optimized Thompson Sampling Router Architecture
This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with **Dynamic Discovery** to intercept requests from the default Dynamo frontend.
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->
# Optimized Thompson Sampling Router Architecture
This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with **Dynamic Discovery** to intercept requests from the default Dynamo frontend.
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/README.md` around lines 1 - 3, Add the SPDX license
header to the top of the README so the file begins with the standard SPDX
identifier before the existing title line "# Optimized Thompson Sampling Router
Architecture"; insert a single-line header "SPDX-License-Identifier: Apache-2.0"
(with no other characters) as the first line of the file so the README follows
the repository's Markdown licensing guideline.

Comment on lines +77 to +83
# Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container)
NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l)
CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1)))

# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against invalid GPU/TP_SIZE combos that yield zero workers

If NUM_GPUS < TP_SIZE or not divisible, NUM_WORKERS becomes 0 and later wait_for_worker indexes an empty array. Add a fast-fail validation here.

🐛 Suggested fix
 NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
+if (( NUM_GPUS % TP_SIZE != 0 )); then
+    echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
+    exit 1
+fi
+if (( NUM_WORKERS < 1 )); then
+    echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container)
NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l)
CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1)))
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
# Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container)
NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l)
CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1)))
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
if (( NUM_GPUS % TP_SIZE != 0 )); then
echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
exit 1
fi
if (( NUM_WORKERS < 1 )); then
echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE"
exit 1
fi
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh` around lines 77 -
83, Validate GPU/TP sizing after computing NUM_GPUS and before using
NUM_WORKERS: check that TP_SIZE is a positive integer and that NUM_GPUS >=
TP_SIZE and that NUM_GPUS is divisible by TP_SIZE so NUM_WORKERS computed from
NUM_GPUS/TP_SIZE is > 0; if the check fails, print a clear error using
WORKER_GPUS, TP_SIZE and NUM_GPUS context and exit non‑zero to avoid later
indexing in wait_for_worker or using CONTAINER_GPU_INDICES when NUM_WORKERS ==
0. Ensure the validation runs right after NUM_WORKERS is assigned and prevents
further execution on bad configs.

Comment on lines +284 to +301
# Verify HF_TOKEN is set
if [ -z "${HF_TOKEN:-}" ]; then
echo ""
echo "⚠ HF_TOKEN environment variable is not set."
echo ""
if [ -d "$LOCAL_MODEL_DIR" ]; then
echo "✓ Local model found - proceeding without HF_TOKEN"
HF_TOKEN="dummy"
else
echo "✗ Local model NOT found and no HF_TOKEN to download it"
echo ""
read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
if [ -z "$HF_TOKEN" ]; then
echo "WARNING: Proceeding without HF_TOKEN."
HF_TOKEN="dummy"
else
echo "✓ HuggingFace token received"
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid echoing HF tokens to the terminal

read -p echoes the token; use silent input to avoid leaking credentials in terminal history.

🔒 Suggested fix
-        read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+        read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+        echo
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Verify HF_TOKEN is set
if [ -z "${HF_TOKEN:-}" ]; then
echo ""
echo "⚠ HF_TOKEN environment variable is not set."
echo ""
if [ -d "$LOCAL_MODEL_DIR" ]; then
echo "✓ Local model found - proceeding without HF_TOKEN"
HF_TOKEN="dummy"
else
echo "✗ Local model NOT found and no HF_TOKEN to download it"
echo ""
read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
if [ -z "$HF_TOKEN" ]; then
echo "WARNING: Proceeding without HF_TOKEN."
HF_TOKEN="dummy"
else
echo "✓ HuggingFace token received"
fi
# Verify HF_TOKEN is set
if [ -z "${HF_TOKEN:-}" ]; then
echo ""
echo "⚠ HF_TOKEN environment variable is not set."
echo ""
if [ -d "$LOCAL_MODEL_DIR" ]; then
echo "✓ Local model found - proceeding without HF_TOKEN"
HF_TOKEN="dummy"
else
echo "✗ Local model NOT found and no HF_TOKEN to download it"
echo ""
read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
echo
if [ -z "$HF_TOKEN" ]; then
echo "WARNING: Proceeding without HF_TOKEN."
HF_TOKEN="dummy"
else
echo "✓ HuggingFace token received"
fi
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh` around lines 284 -
301, The script currently uses read -p to prompt for HF_TOKEN which can echo the
secret to the terminal; change the interactive prompt to use a silent read
(e.g., read -s -r HF_TOKEN) so the token is not displayed, add a
newline/confirmation echo after the silent read to keep UX clean, and ensure the
later checks that set HF_TOKEN="dummy" or echo "✓ HuggingFace token received"
remain unchanged; update the prompt logic around the HF_TOKEN and
LOCAL_MODEL_DIR checks to use the silent read and not print the token.

Comment on lines +53 to +55
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against zero or non‑divisible GPU/TP configurations.

If TP_SIZE is greater than NUM_GPUS or not divisible, NUM_WORKERS becomes 0 (or truncates), and WORKER_PIDS[0] is accessed later. Add validation before continuing.

✅ Suggested guard
 # Calculate number of workers based on available GPUs and TP size
 NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
+if [ $((NUM_GPUS % TP_SIZE)) -ne 0 ]; then
+    echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
+    exit 1
+fi
+if [ "$NUM_WORKERS" -lt 1 ]; then
+    echo "ERROR: No workers can be started (NUM_GPUS=$NUM_GPUS, TP_SIZE=$TP_SIZE)"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
if [ $((NUM_GPUS % TP_SIZE)) -ne 0 ]; then
echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
exit 1
fi
if [ "$NUM_WORKERS" -lt 1 ]; then
echo "ERROR: No workers can be started (NUM_GPUS=$NUM_GPUS, TP_SIZE=$TP_SIZE)"
exit 1
fi
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_unified.sh` around lines 53 - 55, NUM_WORKERS
calculation can yield 0 or incorrect values when TP_SIZE > NUM_GPUS or when
NUM_GPUS is not divisible by TP_SIZE; add validation after computing NUM_WORKERS
(and before accessing WORKER_PIDS) to check that NUM_GPUS and TP_SIZE are
positive integers, that TP_SIZE <= NUM_GPUS, and that NUM_GPUS % TP_SIZE == 0,
and if any check fails print a clear error and exit (use the same logging/exit
pattern as the script). Also consider computing NUM_WORKERS as an integer
division only after validation to avoid truncation surprises and reference
NUM_WORKERS, NUM_GPUS, TP_SIZE and WORKER_PIDS in the error message so the
failure is easy to debug.

Comment on lines +393 to +410
# Generate the prefix ID once
unique_id = uuid.uuid4().hex[:16]
if prefix_template:
prefix_id = prefix_template.format(uuid=unique_id)
else:
prefix_id = f"nat-dynamo-{unique_id}"

logger.debug("Created Dynamo client with prefix ID: %s", prefix_id)

# Create a base transport and wrap it with our custom transport
base_transport = httpx.AsyncHTTPTransport()
dynamo_transport = _DynamoTransport(
transport=base_transport,
prefix_id=prefix_id,
total_requests=total_requests,
osl=osl,
iat=iat,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

prefix_template=None still injects hints; disable injection as documented.

The docstring says prefix_template=None disables prefix hints, but the current logic always generates a prefix ID. This makes the “disable” path ineffective.

🐛 Proposed fix
-    # Generate the prefix ID once
-    unique_id = uuid.uuid4().hex[:16]
-    if prefix_template:
-        prefix_id = prefix_template.format(uuid=unique_id)
-    else:
-        prefix_id = f"nat-dynamo-{unique_id}"
+    # If hints are disabled, return a plain client
+    if prefix_template is None:
+        return httpx.AsyncClient(timeout=httpx.Timeout(timeout))
+
+    # Generate the prefix ID once
+    unique_id = uuid.uuid4().hex[:16]
+    prefix_id = prefix_template.format(uuid=unique_id)
🤖 Prompt for AI Agents
In `@src/nat/llm/dynamo_llm.py` around lines 393 - 410, The code always generates
a prefix ID even when prefix_template is intended to disable hints; change the
logic in the block that sets unique_id/prefix_id so that if prefix_template is
None you set prefix_id = None (do not generate unique_id or call uuid.uuid4()),
otherwise generate unique_id and compute prefix_id =
prefix_template.format(uuid=unique_id) (or fallback f"nat-dynamo-{unique_id}"
when prefix_template is truthy but empty), update the logger to only log the
created prefix when prefix_id is not None, and pass that prefix_id (possibly
None) into _DynamoTransport so the transport can skip injecting hints when
prefix_id is None.

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
.pre-commit-config.yaml (1)

37-42: Add jupyter to additional_dependencies—the hook will fail without it.

The shell script is properly executable with correct shebang, but it calls jupyter nbconvert on line 25. Since language: python creates an isolated venv, only nbconvert is installed from additional_dependencies: ["nbconvert"]. The jupyter command won't be available, causing the hook to fail at runtime. Add "jupyter" to additional_dependencies. (Note: nbconvert is already pinned to 7.16.6 in uv.lock, so the version pinning concern is addressed.)

external/dynamo/start_dynamo_disagg.sh (1)

27-27: Stale comment: port reference is outdated.

The comment states "Frontend: Port 8099" but the default HTTP port has been changed to 8000 on line 39. Update the comment to reflect the new default.

-# Frontend: Port 8099 (HTTP API)
+# Frontend: Port 8000 (HTTP API, configurable via DYNAMO_HTTP_PORT)
external/dynamo/start_dynamo_unified_thompson_hints.sh (1)

28-28: Stale comment: port reference is outdated.

The comment states "Frontend: Port 8099" but the default HTTP port has been changed to 8000 on line 44. Update the comment for consistency.

-# Frontend: Port 8099 (HTTP API with prefix hint headers)
+# Frontend: Port 8000 (HTTP API with prefix hint headers)
🤖 Fix all issues with AI agents
In `@external/dynamo/monitoring/docker-compose.yml`:
- Around line 30-43: The Grafana service is exposed insecurely via network_mode:
host with anonymous admin access and login disabled; update the grafana service
to bind only to localhost (remove network_mode: host and use a ports mapping
like 127.0.0.1:3000:3000) and harden the environment defaults by setting
GF_AUTH_ANONYMOUS_ENABLED=false and GF_AUTH_DISABLE_LOGIN_FORM=false while
keeping GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD present; make
these defaults overrideable via environment or an .env file so developers can
re-enable anonymous/local-only access when explicitly needed (refer to the
grafana service, network_mode, GF_AUTH_ANONYMOUS_ENABLED,
GF_AUTH_DISABLE_LOGIN_FORM, GF_AUTH_ANONYMOUS_ORG_ROLE).

In `@external/dynamo/monitoring/README.md`:
- Around line 222-224: Replace inanimate possessive phrasing like "SGLang's
native metric" with non-possessive alternatives such as "the native SGLang
metric" or "the native metric from SGLang" in the README text (specifically the
sentence starting "SGLang computes cache hit rate internally..."). Update any
other occurrences of similar inanimate possessives in this section (e.g.,
"SGLang's", "the processor's `thompson_kve_*` counters") to use equivalent
non-possessive constructions to match the documentation style guidelines.
- Around line 229-233: Update the header paragraph "KV Cache Metrics Status" to
hyphenate the compound adjective by replacing the phrase "KV cache related
metrics" with "KV cache-related metrics" (preferably using a non-breaking hyphen
U+2011 if you want to prevent line-wrapping); locate the sentence in the same
paragraph that mentions the `${backend}` template variable and make the
single-word change there to ensure grammatical correctness.
- Around line 571-573: The README example uses a user-specific absolute path;
update the example invocation to use a generic placeholder instead (e.g.,
/path/to/repo or $REPO_ROOT) so it’s portable; modify the example line that runs
./collect_metrics.sh ./metrics_output 30 to show changing directory to a generic
repo root (instead of
/localhome/local-bbednarski/NeMo-Agent-Toolkit/external/dynamo) and keep the
collect_metrics.sh command unchanged to preserve the usage example.

In `@external/dynamo/monitoring/scripts/cache_experiment.sh`:
- Around line 1-14: The script cache_experiment.sh currently has an abbreviated
header; replace it with the full Apache-2.0 license header block (the standard
boilerplate used in the repo) including the SPDX-License-Identifier: Apache-2.0
line, the full copyright notice (Copyright (c) 2025-2026, NVIDIA CORPORATION &
AFFILIATES. All rights reserved.), and the complete Apache License 2.0 notice
text (permissions, conditions, and disclaimer) so the file contains the
canonical license template at the top.

In `@external/dynamo/monitoring/scripts/kv_event_observer.py`:
- Line 85: The parameter parent_hash in function record_stored is unused and
triggers a lint warning; either rename it to _parent_hash to indicate
intentional unusedness or actually use it (e.g., log or store ancestry info).
Edit the record_stored signature (record_stored) to rename parent_hash to
_parent_hash if you don't need it, or add a single-line usage inside
record_stored (such as including parent_hash in the emitted metric/log or
appending it to an ancestry list) so the linter no longer flags it.
- Around line 1-35: Add the required SPDX Apache-2.0 license header immediately
after the shebang line in kv_event_observer.py; place the standard multi-line
SPDX/Apache-2.0 header (including SPDX-License-Identifier: Apache-2.0 and the
copyright/notice block per repo policy) right after "#!/usr/bin/env python3" so
the file header is compliant before the module docstring and any definitions (no
changes to functions or logic like the KVEventBatch parsing or metrics polling).

In `@external/dynamo/monitoring/scripts/README.md`:
- Around line 1-5: Add the SPDX license header "SPDX-License-Identifier:
Apache-2.0" to the very top of the README.md for the "KV Cache Event Observer"
(place the header before the first "# KV Cache Event Observer" line) so the file
includes the required Apache-2.0 license identifier per project guidelines.

In `@external/dynamo/optimized/ARCHITECTURE.md`:
- Around line 9-14: Update the inanimate possessive phrasing in ARCHITECTURE.md:
replace occurrences like "Frontend's built-in router" with "the built-in router
of the frontend" (and similar rephrasings) throughout the document, including
the other instances noted around the section discussing DYN_ROUTER_MODE,
dynamo.backend.generate, and dynamo.worker.generate (lines referenced in the
review); preserve meaning and references to DYN_ROUTER_MODE,
dynamo.backend.generate, and dynamo.worker.generate while only adjusting
possessive wording to match the style guide.
- Around line 15-24: The fenced code blocks containing the "Frontend (built-in
router: round-robin) → routes to dynamo.backend.generate → OUR PROCESSOR ..."
snippet (and the other similar blocks around the ASCII box starting with
"┌─────────────────────────────────────────────────────────────────────────────────┐")
are missing language identifiers and there is a blank line inside a blockquote;
add an appropriate language tag (e.g., ```text, ```bash or ```json) to each
fenced code block and remove the empty line inside the blockquote so the block
is contiguous, ensuring all occurrences (including the other blocks referenced
in the comment) follow the same fixes.

In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh`:
- Around line 309-330: The prompt currently uses read -p which echoes the
HF_TOKEN to the terminal; change the interactive prompt in the HF_TOKEN check
(the block referencing HF_TOKEN and LOCAL_MODEL_DIR and the read call) to use
silent input by replacing read -p with read -s -p and then emit an echo newline
after the read so the terminal prompt formatting remains correct, preserving the
existing logic that checks for an empty value and sets HF_TOKEN="dummy" or
prints confirmation.
- Around line 86-92: After computing NUM_GPUS, CONTAINER_GPU_INDICES and
NUM_WORKERS, add a guard that validates TP_SIZE against NUM_GPUS: if TP_SIZE is
zero, greater than NUM_GPUS, or NUM_GPUS is not evenly divisible by TP_SIZE
(NUM_WORKERS equals 0 or NUM_GPUS % TP_SIZE != 0), immediately print a clear
error and exit with non‑zero status. Reference the variables NUM_GPUS, TP_SIZE,
NUM_WORKERS and CONTAINER_GPU_INDICES when implementing the check so the script
fails fast before any worker loops or indexing occur.

In `@external/dynamo/start_dynamo_unified.sh`:
- Line 481: The separator banner echo contains a stray "ca" that should be
removed; locate the echo call that prints
"=====================================ca====================" (in
start_dynamo_unified.sh) and update the string to a clean separator without the
"ca" characters so the output shows a continuous line of "=" characters.
🧹 Nitpick comments (4)
external/dynamo/monitoring/scripts/cache_experiment.sh (2)

188-198: Dependency on bc may cause failures on minimal systems.

The bc utility is used for arithmetic calculations but may not be installed by default. Consider adding a dependency check or using bash arithmetic where possible.

Alternative using awk (commonly available)
-echo -e "  Cache hits:    ${final_hits} (delta: +$(echo "$final_hits - $initial_hits" | bc))"
-echo -e "  Cache queries: ${final_queries} (delta: +$(echo "$final_queries - $initial_queries" | bc))"
+delta_hits_calc=$(awk "BEGIN {print $final_hits - $initial_hits}")
+delta_queries_calc=$(awk "BEGIN {print $final_queries - $initial_queries}")
+echo -e "  Cache hits:    ${final_hits} (delta: +${delta_hits_calc})"
+echo -e "  Cache queries: ${final_queries} (delta: +${delta_queries_calc})"
 echo ""
 
 # Calculate hit rate for this experiment
-delta_hits=$(echo "$final_hits - $initial_hits" | bc)
-delta_queries=$(echo "$final_queries - $initial_queries" | bc)
+delta_hits=$delta_hits_calc
+delta_queries=$delta_queries_calc
 if [ "$delta_queries" != "0" ]; then
-    hit_rate=$(echo "scale=1; $delta_hits * 100 / $delta_queries" | bc)
+    hit_rate=$(awk "BEGIN {printf \"%.1f\", $delta_hits * 100 / $delta_queries}")
     echo -e "${GREEN}Experiment hit rate: ${hit_rate}%${NC}"
 fi

173-176: Process cleanup could leave orphaned process on signal.

If the script receives a signal (e.g., Ctrl+C) before reaching line 174, the observer process will be orphaned. Consider adding a trap for cleanup.

Suggested fix

Add after line 144:

# Cleanup trap
trap 'kill $OBSERVER_PID 2>/dev/null; rm -f "$OBSERVER_LOG"' EXIT INT TERM

Then the explicit cleanup at lines 174-176 and 201 can be simplified or removed since the trap handles it.

external/dynamo/stop_dynamo.sh (1)

142-147: Missing error handling for volume removal.

If the Prometheus data volume doesn't exist, docker volume rm will output an error message. Consider suppressing the error or checking existence first.

Suggested fix
     if [ "$CLEAR_METRICS" = true ]; then
         echo ""
         echo "Clearing Prometheus data volume..."
-        docker volume rm monitoring_prometheus_data && echo "✓ Prometheus data volume removed (old metrics cleared)"
+        if docker volume rm monitoring_prometheus_data 2>/dev/null; then
+            echo "✓ Prometheus data volume removed (old metrics cleared)"
+        else
+            echo "  (Prometheus data volume not found or already removed)"
+        fi
     fi
external/dynamo/.env.example (1)

59-62: Align export usage in the .env example.

This file mixes export with bare assignments. If it’s intended to be sourced in a shell, DYNAMO_HTTP_PORT should be exported like the others; if it’s for Docker Compose .env, remove export keywords or add a note to avoid confusion.

🛠️ Consistency option (shell-source)
- DYNAMO_HTTP_PORT=8000
+ export DYNAMO_HTTP_PORT=8000

Comment on lines +30 to +43
grafana:
image: grafana/grafana:10.2.2
container_name: dynamo-grafana
network_mode: host
environment:
- GF_SERVER_HTTP_PORT=3000
# Disable authentication for local development
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
# Keep these for if you re-enable login later
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Lock down Grafana exposure by default.

With network_mode: host, anonymous admin access and disabled login form exposes Grafana openly on the host interface. This is a high‑risk default outside a strictly local sandbox. Recommend binding to localhost and disabling anonymous access by default, while allowing overrides via env vars.

🔒 Proposed hardening (safer defaults)
     environment:
-      - GF_SERVER_HTTP_PORT=3000
-      # Disable authentication for local development
-      - GF_AUTH_ANONYMOUS_ENABLED=true
-      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
-      - GF_AUTH_DISABLE_LOGIN_FORM=true
-      # Keep these for if you re-enable login later
-      - GF_SECURITY_ADMIN_USER=admin
-      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_SERVER_HTTP_PORT=${GF_SERVER_HTTP_PORT:-3000}
+      - GF_SERVER_HTTP_ADDR=${GF_SERVER_HTTP_ADDR:-127.0.0.1}
+      # Enable anonymous auth only when explicitly requested
+      - GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED:-false}
+      - GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE:-Viewer}
+      - GF_AUTH_DISABLE_LOGIN_FORM=${GF_AUTH_DISABLE_LOGIN_FORM:-false}
+      # Admin credentials (override via env)
+      - GF_SECURITY_ADMIN_USER=${GF_SECURITY_ADMIN_USER:-admin}
+      - GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:-admin}
       - GF_USERS_ALLOW_SIGN_UP=false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
grafana:
image: grafana/grafana:10.2.2
container_name: dynamo-grafana
network_mode: host
environment:
- GF_SERVER_HTTP_PORT=3000
# Disable authentication for local development
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
# Keep these for if you re-enable login later
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
grafana:
image: grafana/grafana:10.2.2
container_name: dynamo-grafana
network_mode: host
environment:
- GF_SERVER_HTTP_PORT=${GF_SERVER_HTTP_PORT:-3000}
- GF_SERVER_HTTP_ADDR=${GF_SERVER_HTTP_ADDR:-127.0.0.1}
# Enable anonymous auth only when explicitly requested
- GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED:-false}
- GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE:-Viewer}
- GF_AUTH_DISABLE_LOGIN_FORM=${GF_AUTH_DISABLE_LOGIN_FORM:-false}
# Admin credentials (override via env)
- GF_SECURITY_ADMIN_USER=${GF_SECURITY_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/docker-compose.yml` around lines 30 - 43, The
Grafana service is exposed insecurely via network_mode: host with anonymous
admin access and login disabled; update the grafana service to bind only to
localhost (remove network_mode: host and use a ports mapping like
127.0.0.1:3000:3000) and harden the environment defaults by setting
GF_AUTH_ANONYMOUS_ENABLED=false and GF_AUTH_DISABLE_LOGIN_FORM=false while
keeping GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD present; make
these defaults overrideable via environment or an .env file so developers can
re-enable anonymous/local-only access when explicitly needed (refer to the
grafana service, network_mode, GF_AUTH_ANONYMOUS_ENABLED,
GF_AUTH_DISABLE_LOGIN_FORM, GF_AUTH_ANONYMOUS_ORG_ROLE).

Comment on lines +222 to +224
> **Why use SGLang's native metric?** SGLang computes cache hit rate internally but doesn't include
> `cached_tokens` in its API responses. The processor's `thompson_kve_*` counters will show 0
> unless the underlying engine provides `usage.prompt_tokens_details.cached_tokens`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid inanimate possessive wording.

The phrasing “SGLang's” and similar possessives conflict with the documentation style rules. Prefer “the native metric from SGLang” or “the native SGLang metric.”

As per coding guidelines, please adjust inanimate possessives in this document.

🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` around lines 222 - 224, Replace
inanimate possessive phrasing like "SGLang's native metric" with non-possessive
alternatives such as "the native SGLang metric" or "the native metric from
SGLang" in the README text (specifically the sentence starting "SGLang computes
cache hit rate internally..."). Update any other occurrences of similar
inanimate possessives in this section (e.g., "SGLang's", "the processor's
`thompson_kve_*` counters") to use equivalent non-possessive constructions to
match the documentation style guidelines.

Comment on lines +229 to +233
## KV Cache Metrics Status

This section documents the working status of all KV cache related metrics across the Dynamo stack.

**Backend Selection**: The Grafana dashboard uses a `${backend}` template variable. Select `sglang` or `vllm` from the dropdown to switch all backend-specific queries.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor grammar: add hyphen in “KV cache-related”.

“KV cache related metrics” should be “KV cache‑related metrics.”

🧰 Tools
🪛 LanguageTool

[grammar] ~231-~231: Use a hyphen to join words.
Context: ...ments the working status of all KV cache related metrics across the Dynamo stack....

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` around lines 229 - 233, Update the
header paragraph "KV Cache Metrics Status" to hyphenate the compound adjective
by replacing the phrase "KV cache related metrics" with "KV cache-related
metrics" (preferably using a non-breaking hyphen U+2011 if you want to prevent
line-wrapping); locate the sentence in the same paragraph that mentions the
`${backend}` template variable and make the single-word change there to ensure
grammatical correctness.

Comment on lines 1 to 14
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# KV Cache Experiment Script
#
# Demonstrates prefix cache behavior with queries A → B → C → A
# Shows: STORED events, REMOVED (eviction) events, and CACHE HITs
#
# Usage:
# ./cache_experiment.sh # Run with defaults
# ./cache_experiment.sh --flush # Flush cache first by filling it
# ./cache_experiment.sh --verbose # Show full curl responses

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing full Apache 2.0 license header.

Per coding guidelines, all source files must include the complete SPDX Apache-2.0 header template. The current header is missing the full license text block.

Suggested fix
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
 # KV Cache Experiment Script
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# KV Cache Experiment Script
#
# Demonstrates prefix cache behavior with queries A → B → C → A
# Shows: STORED events, REMOVED (eviction) events, and CACHE HITs
#
# Usage:
# ./cache_experiment.sh # Run with defaults
# ./cache_experiment.sh --flush # Flush cache first by filling it
# ./cache_experiment.sh --verbose # Show full curl responses
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# KV Cache Experiment Script
#
# Demonstrates prefix cache behavior with queries A → B → C → A
# Shows: STORED events, REMOVED (eviction) events, and CACHE HITs
#
# Usage:
# ./cache_experiment.sh # Run with defaults
# ./cache_experiment.sh --flush # Flush cache first by filling it
# ./cache_experiment.sh --verbose # Show full curl responses
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/scripts/cache_experiment.sh` around lines 1 - 14,
The script cache_experiment.sh currently has an abbreviated header; replace it
with the full Apache-2.0 license header block (the standard boilerplate used in
the repo) including the SPDX-License-Identifier: Apache-2.0 line, the full
copyright notice (Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All
rights reserved.), and the complete Apache License 2.0 notice text (permissions,
conditions, and disclaimer) so the file contains the canonical license template
at the top.

Comment on lines +9 to +14
**Key insight**: The default Dynamo frontend has its own built-in router (`DYN_ROUTER_MODE`) and routes directly to `dynamo.backend.generate`. To intercept requests and apply custom Thompson Sampling routing:

1. **Processor registers as `dynamo.backend.generate`** - The frontend discovers our processor as the "backend"
2. **SGLang Worker registers as `dynamo.worker.generate`** - Our processor forwards to actual workers after routing
3. **Frontend's built-in router becomes irrelevant** - The frontend routes to `dynamo.backend.generate` which is our processor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid inanimate possessive wording in documentation.

Phrases like “Frontend's built-in router” conflict with the documentation style rules. Rephrase to “the built‑in router of the frontend” or similar.

As per coding guidelines, please adjust inanimate possessives in this document.

Also applies to: 228-233

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/ARCHITECTURE.md` around lines 9 - 14, Update the
inanimate possessive phrasing in ARCHITECTURE.md: replace occurrences like
"Frontend's built-in router" with "the built-in router of the frontend" (and
similar rephrasings) throughout the document, including the other instances
noted around the section discussing DYN_ROUTER_MODE, dynamo.backend.generate,
and dynamo.worker.generate (lines referenced in the review); preserve meaning
and references to DYN_ROUTER_MODE, dynamo.backend.generate, and
dynamo.worker.generate while only adjusting possessive wording to match the
style guide.

Comment on lines +15 to +24
```
Frontend (built-in router: round-robin)
→ routes to dynamo.backend.generate
→ OUR PROCESSOR (intercepts!)
→ queries Thompson Sampling router
→ forwards to dynamo.worker.generate (actual SGLang workers)
```

```
┌─────────────────────────────────────────────────────────────────────────────────┐
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add languages to fenced code blocks and fix the blockquote lint issue.

markdownlint flags missing language identifiers (MD040) and a blank line inside a blockquote (MD028). Add language tags (e.g., text, bash, json) and remove the blank line inside the blockquote to satisfy lint rules.

🧹 Example fixes
-```
+```text
 Frontend (built-in router: round-robin)
     → routes to dynamo.backend.generate
     → OUR PROCESSOR (intercepts!)
         → queries Thompson Sampling router
         → forwards to dynamo.worker.generate (actual SGLang workers)

```diff
-> **Note on `DYN_ROUTER_MODE`**: The startup script passes `--router-mode round-robin` to the
-> default frontend, but this is **irrelevant** in our architecture. The frontend's built-in
-> router routes to `dynamo.backend.generate`, which is our Processor (not a real backend).
-> Our Processor intercepts the request and uses our custom Thompson Sampling router instead.
+> **Note on `DYN_ROUTER_MODE`**: The startup script passes `--router-mode round-robin` to the
+> default frontend, but this is **irrelevant** in our architecture. The frontend built‑in
+> router routes to `dynamo.backend.generate`, which is our Processor (not a real backend).
+> Our Processor intercepts the request and uses our custom Thompson Sampling router instead.

Also applies to: 297-307, 366-373

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

15-15: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


23-23: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/ARCHITECTURE.md` around lines 15 - 24, The fenced
code blocks containing the "Frontend (built-in router: round-robin) → routes to
dynamo.backend.generate → OUR PROCESSOR ..." snippet (and the other similar
blocks around the ASCII box starting with
"┌─────────────────────────────────────────────────────────────────────────────────┐")
are missing language identifiers and there is a blank line inside a blockquote;
add an appropriate language tag (e.g., ```text, ```bash or ```json) to each
fenced code block and remove the empty line inside the blockquote so the block
is contiguous, ensuring all occurrences (including the other blocks referenced
in the comment) follow the same fixes.

Comment on lines +86 to +92
# Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container)
NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l)
CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1)))

# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate GPU/TP sizing before computing workers.

If NUM_GPUS < TP_SIZE or not divisible, NUM_WORKERS becomes 0 and later loops/indexes break. Add a fast‑fail check right after computing NUM_WORKERS.

🛠️ Suggested guard
 NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
+if (( NUM_GPUS % TP_SIZE != 0 )); then
+    echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
+    exit 1
+fi
+if (( NUM_WORKERS < 1 )); then
+    echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container)
NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l)
CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1)))
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
# Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container)
NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l)
CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1)))
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
if (( NUM_GPUS % TP_SIZE != 0 )); then
echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
exit 1
fi
if (( NUM_WORKERS < 1 )); then
echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE"
exit 1
fi
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh` around lines
86 - 92, After computing NUM_GPUS, CONTAINER_GPU_INDICES and NUM_WORKERS, add a
guard that validates TP_SIZE against NUM_GPUS: if TP_SIZE is zero, greater than
NUM_GPUS, or NUM_GPUS is not evenly divisible by TP_SIZE (NUM_WORKERS equals 0
or NUM_GPUS % TP_SIZE != 0), immediately print a clear error and exit with
non‑zero status. Reference the variables NUM_GPUS, TP_SIZE, NUM_WORKERS and
CONTAINER_GPU_INDICES when implementing the check so the script fails fast
before any worker loops or indexing occur.

Comment on lines +309 to +330
# Verify HF_TOKEN is set
if [ -z "${HF_TOKEN:-}" ]; then
echo ""
echo "⚠ HF_TOKEN environment variable is not set."
echo ""
if [ -d "$LOCAL_MODEL_DIR" ]; then
echo "✓ Local model found - proceeding without HF_TOKEN"
HF_TOKEN="dummy"
else
echo "✗ Local model NOT found and no HF_TOKEN to download it"
echo ""
read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
if [ -z "$HF_TOKEN" ]; then
echo "WARNING: Proceeding without HF_TOKEN."
HF_TOKEN="dummy"
else
echo "✓ HuggingFace token received"
fi
fi
else
echo "✓ HuggingFace token is set"
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prompt for HF_TOKEN silently to avoid secret echo.

The current read -p echoes the token to the terminal. Use silent input to prevent accidental leakage.

🔒 Safer prompt
-        read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+        read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+        echo
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Verify HF_TOKEN is set
if [ -z "${HF_TOKEN:-}" ]; then
echo ""
echo "⚠ HF_TOKEN environment variable is not set."
echo ""
if [ -d "$LOCAL_MODEL_DIR" ]; then
echo "✓ Local model found - proceeding without HF_TOKEN"
HF_TOKEN="dummy"
else
echo "✗ Local model NOT found and no HF_TOKEN to download it"
echo ""
read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
if [ -z "$HF_TOKEN" ]; then
echo "WARNING: Proceeding without HF_TOKEN."
HF_TOKEN="dummy"
else
echo "✓ HuggingFace token received"
fi
fi
else
echo "✓ HuggingFace token is set"
fi
# Verify HF_TOKEN is set
if [ -z "${HF_TOKEN:-}" ]; then
echo ""
echo "⚠ HF_TOKEN environment variable is not set."
echo ""
if [ -d "$LOCAL_MODEL_DIR" ]; then
echo "✓ Local model found - proceeding without HF_TOKEN"
HF_TOKEN="dummy"
else
echo "✗ Local model NOT found and no HF_TOKEN to download it"
echo ""
read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
echo
if [ -z "$HF_TOKEN" ]; then
echo "WARNING: Proceeding without HF_TOKEN."
HF_TOKEN="dummy"
else
echo "✓ HuggingFace token received"
fi
fi
else
echo "✓ HuggingFace token is set"
fi
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh` around lines
309 - 330, The prompt currently uses read -p which echoes the HF_TOKEN to the
terminal; change the interactive prompt in the HF_TOKEN check (the block
referencing HF_TOKEN and LOCAL_MODEL_DIR and the read call) to use silent input
by replacing read -p with read -s -p and then emit an echo newline after the
read so the terminal prompt formatting remains correct, preserving the existing
logic that checks for an empty value and sets HF_TOKEN="dummy" or prints
confirmation.

echo "========================================================="
echo "Test Request:"
echo "========================================================="
echo "=====================================ca===================="
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix banner typo in output.

There is a stray “ca” in the separator line.

✏️ Proposed fix
-    echo "=====================================ca===================="
+    echo "========================================================="
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "=====================================ca===================="
echo "========================================================="
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_unified.sh` at line 481, The separator banner
echo contains a stray "ca" that should be removed; locate the echo call that
prints "=====================================ca====================" (in
start_dynamo_unified.sh) and update the string to a clean separator without the
"ca" characters so the output shows a continuous line of "=" characters.

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
external/dynamo/generalized/processor.py (1)

428-433: ⚠️ Potential issue | 🔴 Critical

Update code to match deployed Dynamo 0.7.1 or add version compatibility layer.

Lines 428-433 assume Dynamo ≥0.8.x (where create_service() was removed), but all deployment scripts in external/dynamo/ are pinned to Dynamo 0.7.1, where create_service() is still required. This code will fail at runtime with the current deployment version. Either restore the create_service() call for 0.7.1 compatibility, update deployment scripts to Dynamo 0.8.x+, or add conditional logic to detect the runtime version and handle both APIs.

This same issue affects multiple files: generalized/router.py, optimized/router_multilru.py, and optimized/processor_multilru.py.

🤖 Fix all issues with AI agents
In `@external/dynamo/.env.example`:
- Around line 16-18: The HF_TOKEN placeholder in the .env example is active and
may be accidentally sourced; change the HF_TOKEN export to be commented out
(e.g., prefix the HF_TOKEN line with a comment marker) so users must explicitly
set or export HF_TOKEN themselves; ensure HF_HOME remains exported and leave a
commented example or instruction for users to add their own HF_TOKEN instead of
exporting a dummy value.
- Around line 59-62: Update the .env.example so all variables are consistently
exported and remove unnecessary quotes: add export to variables missing it
(e.g., DYNAMO_HTTP_PORT and the other non-exported entries referenced in the
comment) so they propagate to child processes, and remove the surrounding quotes
from the image URL on the quoted line so the value follows dotenv conventions;
ensure other exported lines (those already using export) remain unchanged and
keep naming identical to existing keys.

In `@external/dynamo/build_multi_lru_image.sh`:
- Around line 193-207: The script currently calls ./container/build.sh while
shell option set -e may cause the script to exit before BUILD_EXIT_CODE is
captured; modify the invocation around the build step (the block that calls
./container/build.sh and assigns BUILD_EXIT_CODE) to temporarily disable set -e
(e.g., set +e), run ./container/build.sh with the existing flags (framework
VLLM, --target "$BUILD_TARGET", --tag "$IMAGE_TAG", --enable-kvbm, build-arg and
vllm-max-jobs, $NO_CACHE, $DRY_RUN), capture the exit status into
BUILD_EXIT_CODE, then re-enable set -e (set -e) so subsequent error handling
works reliably; alternatively implement the &&/|| pattern to set BUILD_EXIT_CODE
explicitly after the command.
- Around line 122-128: When updating an existing clone (KVBM_NEXT_DIR) the
current git sequence uses git checkout "$BRANCH" which can fail if a local
branch named BRANCH doesn't exist; replace that step with logic that ensures a
local branch is created from the remote if missing: after git fetch origin,
check if the local branch exists (e.g., git show-ref --verify --quiet
refs/heads/"$BRANCH"), and if it does git checkout "$BRANCH", otherwise create
it from origin with git checkout -b "$BRANCH" "origin/$BRANCH" (or git branch
--track "$BRANCH" "origin/$BRANCH" then checkout) before running git pull and
git submodule update --init --recursive.

In `@external/dynamo/generalized/router.py`:
- Around line 1069-1071: The comment about create_service() being removed is
inconsistent with other router implementations that still call await
component.create_service(); inspect which Dynamo version the project targets
(pyproject.toml) and either (A) make removal intentional by removing
create_service() everywhere and update the comment near
runtime.namespace("dynamo").component("router") / WorkloadAwareRouter to
explicitly state “create_service removed in Dynamo >=0.8; router endpoints
register automatically” or (B) preserve compatibility by calling create_service
conditionally—e.g., check for hasattr(component, "create_service") or a
runtime/version flag and call await component.create_service() when present—so
that the behavior matches the optimized router implementation (which currently
calls create_service), keeping the comment aligned with the chosen approach.

In `@external/dynamo/optimized/processor_multilru.py`:
- Around line 507-510: Replace the logger.error call in the except block in
external/dynamo/optimized/processor_multilru.py with logger.exception so the
full stack trace is recorded when an exception is caught (keep the existing
message string and the subsequent self._metrics.router_errors_total.inc() and
return None, None intact); locate the except block that currently does
logger.error("Failed to pick worker: %s", e) and change it to
logger.exception(...) to capture the traceback for debugging.
- Around line 389-394: The _extract_hints function's signature and docstring are
out of sync with its actual return values: update the return type annotation
from tuple[str, int, str, str] to include the fifth value (likely a bool for
use_frequency_backend), e.g. tuple[str, int, str, str, bool], and revise the
docstring "Returns:" line to list all five returned items in the correct order
with brief types/names (prefix_id, total_requests, osl, iat,
use_frequency_backend) so documentation matches the implementation.
- Around line 661-664: The fire-and-forget call using
asyncio.create_task(self._update_kve_metrics_async(kve_data)) can let the Task
be garbage-collected; add a class-level set (e.g., self._background_tasks) in
the class __init__ and store the created Task in it, e.g., task =
asyncio.create_task(self._update_kve_metrics_async(kve_data));
self._background_tasks.add(task), and ensure completed tasks are removed (attach
a done callback or await/cleanup in shutdown) so tasks are tracked and not
prematurely collected.
🧹 Nitpick comments (9)
external/dynamo/demo_priority_eviction.sh (1)

53-56: Consider adding defensive handling for the metrics fetch.

With set -euo pipefail, if the metrics endpoint doesn't contain prefix_cache_hits_total{, grep returns exit code 1 and terminates the script. Additionally, if the container isn't running or curl fails, downstream arithmetic on empty values will error.

🛠️ Proposed fix to add fallback for robustness
 get_hits() {
-  docker exec dynamo-vllm curl -s http://localhost:18081/metrics 2>/dev/null | \
-    grep "prefix_cache_hits_total{" | grep -v external | awk '{print $NF}'
+  local hits
+  hits=$(docker exec dynamo-vllm curl -s http://localhost:18081/metrics 2>/dev/null | \
+    grep "prefix_cache_hits_total{" | grep -v external | awk '{print $NF}' || true)
+  echo "${hits:-0}"
 }
external/dynamo/optimized/router_multilru.py (7)

79-81: Add noqa comment for intentionally unused arguments.

The fallback implementation intentionally ignores tokens and min_overlap to maintain interface compatibility. Add noqa to suppress the static analysis warning and document the intent.

Suggested fix
-        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:
+        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:  # noqa: ARG002
             """Find overlap scores for each worker. Returns empty scores (round-robin fallback)."""
             return OverlapScores({})

171-171: Consider using tempfile.gettempdir() for secure temporary directory handling.

Hardcoding /tmp can be insecure on multi-user systems. The path is configurable via config, but the default should use the system's secure temp directory.

Suggested fix
+import tempfile
+
 def get_builtin_defaults() -> dict[str, Any]:
     """Return built-in default configuration (matches config.yaml)."""
     return {
         # ... other config ...
         "debug": {
             "traces_enabled": False,
-            "trace_dir": "/tmp/dynamo_router_traces",
+            "trace_dir": os.path.join(tempfile.gettempdir(), "dynamo_router_traces"),
             "buffer_size": 2000,
         },
     }

378-408: Consider extracting shared Pydantic models to avoid duplication.

These models (RouterRequest, RouterResponse, FeedbackRequest, FeedbackAck) are duplicated in external/dynamo/generalized/router.py (lines 80-108) and processor_multilru.py. Consider extracting to a shared models.py module to follow the DRY principle.


502-503: Log a warning when lin_forget is clamped to valid range.

The clamping silently modifies user-provided values. Consider logging when the value is adjusted to help with debugging.

Suggested fix
         self.lin_forget = float(lints_forget)
+        original_forget = self.lin_forget
         self.lin_forget = max(1e-6, min(self.lin_forget, 0.999999))
+        if self.lin_forget != original_forget:
+            logger.warning("lin_forget clamped from %s to %s", original_forget, self.lin_forget)

568-574: Consider logging exception details instead of just the message.

The blind Exception catch is acceptable here since trace writing should never crash the router, but logging only %s loses the exception type. Use logger.debug(..., exc_info=True) or at least include the exception type.

Suggested fix
         try:
             path = os.path.join(self.debug_trace_dir, "router_traces.jsonl")
             with open(path, "a", encoding="utf-8") as f:
                 f.write(json.dumps(item, separators=(",", ":")) + "\n")
-        except Exception as e:
-            logger.debug("Trace write failed: %s", e)
+        except Exception:
+            logger.debug("Trace write failed", exc_info=True)

1075-1075: Prefix unused variable with underscore.

chosen_ctx is unpacked but never used. Prefix with underscore to indicate intentional discard and suppress the linter warning.

Suggested fix
-        chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
+        chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)

1257-1264: Add return type hint to _get_underloaded.

Per coding guidelines, public/internal APIs should have type hints. This method returns a tuple that should be annotated.

Suggested fix
-    def _get_underloaded(self, metrics: dict[str, Any] | None):
+    def _get_underloaded(self, metrics: dict[str, Any] | None) -> tuple[int, float]:
         if not metrics or not metrics.get("endpoints"):
             wid = int(random.choice(list(self.engine_client.instance_ids())))
             return wid, 0.0
external/dynamo/optimized/processor_multilru.py (1)

157-157: Sort __slots__ alphabetically for consistency.

Static analysis recommends natural sorting of __slots__. This is a minor style convention.

Suggested fix
-    __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks")
+    __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")

Comment on lines +16 to +18
export HF_HOME=/path/to/local/storage/.cache/huggingface

export HF_TOKEN=my_huggingface_read_token
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists
find . -path "*/external/dynamo/.env.example" -type f 2>/dev/null | head -20

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 99


🏁 Script executed:

# Get file size to determine how to read it
wc -l ./external/dynamo/.env.example 2>/dev/null || echo "File not found or error"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 103


🏁 Script executed:

# Read lines 14-20 to see context around lines 16-18
sed -n '14,20p' ./external/dynamo/.env.example

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 264


🏁 Script executed:

# Also check the full content to understand the file structure
cat -n ./external/dynamo/.env.example

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 6685


Comment out HF_TOKEN placeholder to prevent accidental use.

Exporting a dummy token as an active value can lead to confusing authentication failures if this file is sourced or used in automation. Commenting it out forces users to explicitly set the token themselves.

Suggested fix
-export HF_TOKEN=my_huggingface_read_token
+# export HF_TOKEN=your_huggingface_read_token
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export HF_HOME=/path/to/local/storage/.cache/huggingface
export HF_TOKEN=my_huggingface_read_token
export HF_HOME=/path/to/local/storage/.cache/huggingface
# export HF_TOKEN=your_huggingface_read_token
🤖 Prompt for AI Agents
In `@external/dynamo/.env.example` around lines 16 - 18, The HF_TOKEN placeholder
in the .env example is active and may be accidentally sourced; change the
HF_TOKEN export to be commented out (e.g., prefix the HF_TOKEN line with a
comment marker) so users must explicitly set or export HF_TOKEN themselves;
ensure HF_HOME remains exported and leave a commented example or instruction for
users to add their own HF_TOKEN instead of exporting a dummy value.

Comment on lines 59 to 62
# HTTP port for Dynamo frontend API
# Default: 8099
# DYNAMO_HTTP_PORT="8099"
# Default: 8000
DYNAMO_HTTP_PORT=8000

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find the file and check if it exists
fd -type f -name ".env.example" external/dynamo/

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 239


🏁 Script executed:

# Read the file to inspect the specific lines mentioned
head -130 external/dynamo/.env.example | tail -80

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2835


🏁 Script executed:

# Get line count and read the full file to see export patterns
wc -l external/dynamo/.env.example
cat -n external/dynamo/.env.example

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 6718


🏁 Script executed:

# Check for dotenv-linter configuration or references in the repo
fd -name ".dotenv*" -o -name "*dotenv*" | grep -E "(config|rc|yml|yaml)"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 299


🏁 Script executed:

# Search for dotenv-linter usage or configuration
rg "dotenv-linter" --type yaml --type json --type toml

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51


🏁 Script executed:

# Check if dotenv-linter is installed and test it on the file
which dotenv-linter || echo "dotenv-linter not found in PATH"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 161


Add missing export statements and remove quotes for consistency with dotenv conventions.

The file uses export on some variables (lines 16, 18, 30, 33, 41) but omits it on others (lines 61, 110, 125-126). When sourced, variables without export won't propagate to child processes. Additionally, line 126 has quotes around the image URL, which is non-standard for .env files and can cause parsing issues.

Suggested fix
-DYNAMO_HTTP_PORT=8000
+export DYNAMO_HTTP_PORT=8000
...
-DYNAMO_KV_BLOCK_SIZE=64
+export DYNAMO_KV_BLOCK_SIZE=64
...
-DYNAMO_USE_MULTILRU=false
-DYNAMO_VLLM_IMAGE="nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1"
+export DYNAMO_USE_MULTILRU=false
+export DYNAMO_VLLM_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
🤖 Prompt for AI Agents
In `@external/dynamo/.env.example` around lines 59 - 62, Update the .env.example
so all variables are consistently exported and remove unnecessary quotes: add
export to variables missing it (e.g., DYNAMO_HTTP_PORT and the other
non-exported entries referenced in the comment) so they propagate to child
processes, and remove the surrounding quotes from the image URL on the quoted
line so the value follows dotenv conventions; ensure other exported lines (those
already using export) remain unchanged and keep naming identical to existing
keys.

Comment on lines +122 to +128
if [ -d "$KVBM_NEXT_DIR" ]; then
echo "Updating existing $BRANCH branch..."
cd "$KVBM_NEXT_DIR"
git fetch origin
git checkout "$BRANCH"
git pull origin "$BRANCH"
git submodule update --init --recursive
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n external/dynamo/build_multi_lru_image.sh | head -150 | tail -40

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1745


🏁 Script executed:

# Test: Does git checkout <BRANCH> fail when local branch doesn't exist?
# This requires understanding git checkout behavior

# The concern is: if a repo was cloned on a different branch and BRANCH doesn't exist locally,
# does 'git checkout "$BRANCH"' fail?
# Let's verify the git behavior

git --version
# Check git documentation behavior
git checkout --help | grep -A 5 "If -b is not given"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 202


🏁 Script executed:

# Verify git checkout -B syntax
git checkout --help | grep -A 10 "\-B"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 166


🏁 Script executed:

# Test git checkout behavior when local branch doesn't exist
# Create a test repo to verify the behavior

mkdir -p /tmp/test-git
cd /tmp/test-git

# Initialize a bare repo to act as "origin"
mkdir -p origin.git
cd origin.git
git init --bare
cd ..

# Clone with depth=1 (simulating the actual script behavior)
git clone --depth 1 origin.git local_clone
cd local_clone

# Create a test branch on origin to simulate the scenario
cd ../origin.git
git config --add uploadpack.allowAnySHA1InWant true
cd ..

# Add content to the local clone and push
cd local_clone
echo "test" > file.txt
git config user.email "test@test.com"
git config user.name "Test"
git add file.txt
git commit -m "initial"

# Create a new branch locally and push to origin
git checkout -b testbranch
echo "feature" > feature.txt
git add feature.txt
git commit -m "feature"
git push -u origin testbranch

# Now simulate the problem: fetch and try checkout on a fresh scenario
# Create a new clone from origin to simulate a repo checked out on a different branch
cd ..
rm -rf fresh_clone
git clone --depth 1 --branch main origin.git fresh_clone
cd fresh_clone

# Now try: git fetch origin followed by git checkout testbranch
git fetch origin testbranch

# Test 1: Does git checkout testbranch work without -B?
echo "Test 1: git checkout testbranch (should fail or work depending on git version)"
git checkout testbranch 2>&1 || echo "CHECKOUT FAILED"

# Test 2: Does git checkout -B work?
echo ""
echo "Test 2: git checkout -B testbranch origin/testbranch (should succeed)"
git checkout -B testbranch origin/testbranch 2>&1 && echo "CHECKOUT -B SUCCEEDED"

cd /
rm -rf /tmp/test-git

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1794


🌐 Web query:

git checkout behavior when local branch doesn't exist but remote tracking branch exists

💡 Result:

  • If you run git checkout <name> (or git switch <name>) and there is no local branch <name>, but there is a remote-tracking branch like origin/<name> on exactly one remote, Git’s default “DWIM/guess” behavior is to create a new local branch <name> and set it up to track that remote branch (equivalent to git checkout -b <name> --track <remote>/<name>). [1][2]

  • If <name> exists on multiple remotes (e.g., both origin/<name> and upstream/<name> exist), Git won’t guess which one you meant and will require you to be explicit—unless you set checkout.defaultRemote (e.g., origin) to break ties. [1][3]

  • If you run git checkout origin/<name>, you are checking out a remote-tracking branch name directly, which typically leaves you in detached HEAD. To get a local branch, use git checkout <name> (DWIM) or git checkout -b <name> --track origin/<name>. [2]

  • You can disable/enable the guessing behavior with --no-guess / --guess (and configure it via checkout.guess). [1][3]

Sources: Git documentation for git-checkout (guessing rules, ambiguity, checkout.defaultRemote, --guess/--no-guess) [1][3], and examples/discussion of the DWIM behavior and detached-HEAD when checking out origin/<branch> [2].


Handle existing clones that don't yet have a local branch.

git checkout "$BRANCH" may fail when updating repositories cloned on a different branch. While modern Git versions (2.39+) can auto-create local branches via DWIM behavior, this relies on guessing logic that can be disabled or fail with non-standard configurations. The suggested fix explicitly creates the local branch from the remote tracking branch, ensuring reliable updates across all Git configurations and versions.

Suggested fix
-        git fetch origin
-        git checkout "$BRANCH"
-        git pull origin "$BRANCH"
+        git fetch origin "$BRANCH"
+        git checkout -B "$BRANCH" "origin/$BRANCH"
+        git pull --ff-only origin "$BRANCH"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ -d "$KVBM_NEXT_DIR" ]; then
echo "Updating existing $BRANCH branch..."
cd "$KVBM_NEXT_DIR"
git fetch origin
git checkout "$BRANCH"
git pull origin "$BRANCH"
git submodule update --init --recursive
if [ -d "$KVBM_NEXT_DIR" ]; then
echo "Updating existing $BRANCH branch..."
cd "$KVBM_NEXT_DIR"
git fetch origin "$BRANCH"
git checkout -B "$BRANCH" "origin/$BRANCH"
git pull --ff-only origin "$BRANCH"
git submodule update --init --recursive
🤖 Prompt for AI Agents
In `@external/dynamo/build_multi_lru_image.sh` around lines 122 - 128, When
updating an existing clone (KVBM_NEXT_DIR) the current git sequence uses git
checkout "$BRANCH" which can fail if a local branch named BRANCH doesn't exist;
replace that step with logic that ensures a local branch is created from the
remote if missing: after git fetch origin, check if the local branch exists
(e.g., git show-ref --verify --quiet refs/heads/"$BRANCH"), and if it does git
checkout "$BRANCH", otherwise create it from origin with git checkout -b
"$BRANCH" "origin/$BRANCH" (or git branch --track "$BRANCH" "origin/$BRANCH"
then checkout) before running git pull and git submodule update --init
--recursive.

Comment on lines +193 to +207
# Run the build
# Note: --enable-kvbm is automatically set for VLLM framework, but we set it explicitly for clarity
./container/build.sh \
--framework VLLM \
--target "$BUILD_TARGET" \
--tag "$IMAGE_TAG" \
--enable-kvbm \
--build-arg "CARGO_BUILD_JOBS=$CARGO_BUILD_JOBS" \
--vllm-max-jobs "$MAX_JOBS" \
$NO_CACHE \
$DRY_RUN

BUILD_EXIT_CODE=$?

if [ $BUILD_EXIT_CODE -eq 0 ]; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "build_multi_lru_image.sh" | head -20

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 111


🏁 Script executed:

wc -l external/dynamo/build_multi_lru_image.sh

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 113


🏁 Script executed:

head -50 external/dynamo/build_multi_lru_image.sh

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1887


🏁 Script executed:

sed -n '180,220p' external/dynamo/build_multi_lru_image.sh

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1295


🏁 Script executed:

sed -n '1,70p' external/dynamo/build_multi_lru_image.sh | grep -A 5 "set "

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 379


🏁 Script executed:

sed -n '200,230p' external/dynamo/build_multi_lru_image.sh

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1084


Build failure handling is bypassed by set -e.
With set -e enabled, a non‑zero exit from build.sh exits the script before BUILD_EXIT_CODE is set. Disable set -e temporarily, run the build, capture the exit code, then re-enable it:

set +e
./container/build.sh \
    --framework VLLM \
    --target "$BUILD_TARGET" \
    --tag "$IMAGE_TAG" \
    --enable-kvbm \
    --build-arg "CARGO_BUILD_JOBS=$CARGO_BUILD_JOBS" \
    --vllm-max-jobs "$MAX_JOBS" \
    $NO_CACHE \
    $DRY_RUN
BUILD_EXIT_CODE=$?
set -e

Alternatively, use an explicit pattern: ./container/build.sh ... && BUILD_EXIT_CODE=0 || BUILD_EXIT_CODE=$?

🤖 Prompt for AI Agents
In `@external/dynamo/build_multi_lru_image.sh` around lines 193 - 207, The script
currently calls ./container/build.sh while shell option set -e may cause the
script to exit before BUILD_EXIT_CODE is captured; modify the invocation around
the build step (the block that calls ./container/build.sh and assigns
BUILD_EXIT_CODE) to temporarily disable set -e (e.g., set +e), run
./container/build.sh with the existing flags (framework VLLM, --target
"$BUILD_TARGET", --tag "$IMAGE_TAG", --enable-kvbm, build-arg and vllm-max-jobs,
$NO_CACHE, $DRY_RUN), capture the exit status into BUILD_EXIT_CODE, then
re-enable set -e (set -e) so subsequent error handling works reliably;
alternatively implement the &&/|| pattern to set BUILD_EXIT_CODE explicitly
after the command.

Comment on lines 1069 to 1071
component = runtime.namespace("dynamo").component("router")
await component.create_service()
# NOTE: create_service() was removed in Dynamo 0.8.x - endpoint creation handles registration
logger.info("Initializing WorkloadAwareRouter (LinTS + feedback + timeout + traces)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check pinned Dynamo runtime versions across common dependency files
fd -a -E .git -E node_modules -E dist -E build -E .venv 'pyproject\.toml|requirements.*|poetry\.lock|uv\.lock|Pipfile|setup\.cfg' \
  -x rg -n "dynamo" {}

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 406


🏁 Script executed:

#!/bin/bash
# Examine the file context around the flagged lines
cat -n external/dynamo/generalized/router.py | sed -n '1060,1080p'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 999


🏁 Script executed:

#!/bin/bash
# Search for version guards or Dynamo version handling in the router file
rg -n "version|0\.8|create_service" external/dynamo/generalized/router.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 171


🏁 Script executed:

#!/bin/bash
# Look for Dynamo version references across the codebase
rg -n "dynamo.*version|0\.8\.|create_service" --type py -A 2 -B 2 | head -100

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2853


🏁 Script executed:

#!/bin/bash
# Check context around create_service() calls in optimized files
rg -B5 -A5 "create_service()" external/dynamo/optimized/ | head -50

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3747


🏁 Script executed:

#!/bin/bash
# Look for any version guards or conditional Dynamo handling
rg -n "version|0\.8|0\.9|dynamo_worker|DistributedRuntime" external/dynamo/ -A 2 -B 2 | grep -E "(external/dynamo/(optimized|generalized)/(processor|router)\.py|version|0\.[0-9])" | head -80

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 7722


🏁 Script executed:

#!/bin/bash
# Check if there's documentation or setup.py defining Dynamo version requirements
fd -E .git -E node_modules 'setup.py|pyproject.toml|README.md|CHANGELOG.md' -x rg -l "dynamo" {} \; | head -10

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 413


🏁 Script executed:

#!/bin/bash
# Get more context on the actual endpoint registration mentioned in the comment
rg -n "endpoint.*creat|creat.*endpoint|register" external/dynamo/generalized/router.py -B 3 -A 3 | head -60

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 759


🏁 Script executed:

#!/bin/bash
# Search for endpoint creation in the router file after the logger.info call
sed -n '1070,1120p' external/dynamo/generalized/router.py | cat -n

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2454


🏁 Script executed:

#!/bin/bash
# Check what differences exist between optimized/router.py and optimized/router_multilru.py
rg -B10 -A10 "create_service|NOTE: create" external/dynamo/optimized/router.py | head -40

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 854


🏁 Script executed:

#!/bin/bash
# Look for pyproject.toml to check Dynamo version constraints
cat pyproject.toml | grep -A 10 -B 2 "dynamo"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1301


Resolve inconsistency: create_service() still used in other router implementations.

The comment states create_service() was removed in Dynamo 0.8.x, but external/dynamo/optimized/router.py (lines 1346) still explicitly calls await component.create_service(). Meanwhile, this file and external/dynamo/optimized/router_multilru.py remove the call entirely. Without version pinning in pyproject.toml, it's unclear which Dynamo version this code targets. Verify whether the removal is intentional for different deployment modes (generalized vs. optimized) or a partial migration that needs completion. If intentional, update the comment to clarify when the difference applies.

🤖 Prompt for AI Agents
In `@external/dynamo/generalized/router.py` around lines 1069 - 1071, The comment
about create_service() being removed is inconsistent with other router
implementations that still call await component.create_service(); inspect which
Dynamo version the project targets (pyproject.toml) and either (A) make removal
intentional by removing create_service() everywhere and update the comment near
runtime.namespace("dynamo").component("router") / WorkloadAwareRouter to
explicitly state “create_service removed in Dynamo >=0.8; router endpoints
register automatically” or (B) preserve compatibility by calling create_service
conditionally—e.g., check for hasattr(component, "create_service") or a
runtime/version flag and call await component.create_service() when present—so
that the behavior matches the optimized router implementation (which currently
calls create_service), keeping the comment aligned with the chosen approach.

Comment on lines +389 to +394
def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str]:
"""
Extract routing hints from PreprocessedRequest annotations.

Returns: (prefix_id, total_requests, osl, iat, use_frequency_backend)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update docstring to match actual return values.

The docstring states "Returns: (prefix_id, total_requests, osl, iat, use_frequency_backend)" but the comment shows only 4 items. The actual implementation returns 5 values including use_frequency_backend.

Suggested fix
     def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str]:
         """
         Extract routing hints from PreprocessedRequest annotations.

-        Returns: (prefix_id, total_requests, osl, iat, use_frequency_backend)
+        Returns:
+            Tuple of (prefix_id, total_requests, osl, iat, use_frequency_backend).
         """

Also update the return type annotation:

-    def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str]:
+    def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str, bool]:
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 389 - 394, The
_extract_hints function's signature and docstring are out of sync with its
actual return values: update the return type annotation from tuple[str, int,
str, str] to include the fifth value (likely a bool for use_frequency_backend),
e.g. tuple[str, int, str, str, bool], and revise the docstring "Returns:" line
to list all five returned items in the correct order with brief types/names
(prefix_id, total_requests, osl, iat, use_frequency_backend) so documentation
matches the implementation.

Comment on lines +507 to +510
except Exception as e:
logger.error("Failed to pick worker: %s", e)
self._metrics.router_errors_total.inc()
return None, None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use logger.exception() when catching and not re-raising.

Per coding guidelines, when catching and logging exceptions without re-raising, use logger.exception() to capture the full stack trace.

Suggested fix
         except Exception as e:
-            logger.error("Failed to pick worker: %s", e)
+            logger.exception("Failed to pick worker")
             self._metrics.router_errors_total.inc()
             return None, None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
logger.error("Failed to pick worker: %s", e)
self._metrics.router_errors_total.inc()
return None, None
except Exception as e:
logger.exception("Failed to pick worker")
self._metrics.router_errors_total.inc()
return None, None
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 507-507: Do not catch blind exception: Exception

(BLE001)


[warning] 508-508: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 507 - 510,
Replace the logger.error call in the except block in
external/dynamo/optimized/processor_multilru.py with logger.exception so the
full stack trace is recorded when an exception is caught (keep the existing
message string and the subsequent self._metrics.router_errors_total.inc() and
return None, None intact); locate the except block that currently does
logger.error("Failed to pick worker: %s", e) and change it to
logger.exception(...) to capture the traceback for debugging.

Comment on lines +661 to +664
# Fire-and-forget KVE metric update (async, non-blocking)
# This ensures KVE computation has ZERO impact on routing throughput
if kve_data is not None:
asyncio.create_task(self._update_kve_metrics_async(kve_data))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Store reference to asyncio.create_task() to prevent garbage collection.

Fire-and-forget tasks created with asyncio.create_task() can be garbage collected before completion if no reference is held. Store the task reference or add it to a background task set.

Suggested fix

Add a class-level set to track background tasks in __init__:

self._background_tasks: set[asyncio.Task] = set()

Then modify the create_task call:

                     if kve_data is not None:
-                        asyncio.create_task(self._update_kve_metrics_async(kve_data))
+                        task = asyncio.create_task(self._update_kve_metrics_async(kve_data))
+                        self._background_tasks.add(task)
+                        task.add_done_callback(self._background_tasks.discard)
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 664-664: Store a reference to the return value of asyncio.create_task

(RUF006)

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 661 - 664, The
fire-and-forget call using
asyncio.create_task(self._update_kve_metrics_async(kve_data)) can let the Task
be garbage-collected; add a class-level set (e.g., self._background_tasks) in
the class __init__ and store the created Task in it, e.g., task =
asyncio.create_task(self._update_kve_metrics_async(kve_data));
self._background_tasks.add(task), and ensure completed tasks are removed (attach
a done callback or await/cleanup in shutdown) so tasks are tracked and not
prematurely collected.

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.yml (1)

103-111: ⚠️ Potential issue | 🟠 Major

Update eval_llm to use port 8000 for consistency.

The eval_llm uses port 8099 while dynamo_llm uses port 8000 in the same file. This is inconsistent. The profile_rethinking_full_test.yml config shows that eval_llm works correctly with port 8000, and most other eval configs maintain the same port for both LLMs. Update line 106 to base_url: http://localhost:8000/v1.

🤖 Fix all issues with AI agents
In `@external/dynamo/optimized/processor_multilru.py`:
- Around line 742-769: Update parse_args() to declare and return an
argparse.Namespace return type and add a new CLI flag for kv_cache_block_size
(e.g., parser.add_argument("--kv-cache-block-size", type=int, default=1,
help="...")), then propagate that parsed value into register_llm where
kv_cache_block_size is currently hardcoded to 1 so the function uses
args.kv_cache_block_size; ensure register_llm signature or call sites accept the
new parameter if needed and update any references in the file (including the
other occurrence around lines 809-816) to use the parsed value.

In `@external/dynamo/optimized/processor.py`:
- Around line 497-500: Replace the plain error log with an exception log so the
full stack trace is captured: in the exception handler that currently calls
logger.error("Failed to pick worker: %s", e) (the block that then increments
self._metrics.router_errors_total and returns None, None), use
logger.exception(...) instead (preserving the message text) so the stack trace
is recorded when the method fails to pick a worker.
- Line 654: The create_task call that fires-and-forgets
self._update_kve_metrics_async(kve_data) can be garbage-collected; add a
persistent container (e.g., self._background_tasks: set[asyncio.Task]) in the
class __init__ and replace the bare asyncio.create_task(...) with creating the
task, adding it to self._background_tasks, and attaching a done callback to
remove the task from the set when finished; reference the existing call to
self._update_kve_metrics_async and the class initializer to locate where to add
the set and where to store/remove the task.

In `@external/dynamo/optimized/router_multilru.py`:
- Around line 1343-1345: The multilru router/processor files are inconsistent
with router.py/processor.py regarding component.create_service(); update
router_multilru.py and processor_multilru.py to match the non-multilru files by
conditionally invoking create_service when present: after obtaining component
via runtime.namespace("dynamo").component("router") (and similarly for
processor), check for the existence of create_service on the component and await
it if available (e.g., use a guard like hasattr or getattr to avoid failing on
Dynamo 0.8.x), and update/remove the outdated comment so both variants behave
consistently across Dynamo versions.
🧹 Nitpick comments (13)
external/dynamo/monitoring/scripts/kv_event_observer.py (2)

171-206: Consider validating URL scheme for defensive coding.

The urlopen call constructs URLs from host and metrics_port parameters. While this is an operator-run monitoring script, validating that the constructed URL uses http:// scheme would be more defensive against misconfiguration.

🛡️ Optional: Add scheme validation
     def _poll_metrics(self):
         """Background thread to poll Prometheus metrics for cache hits."""
         metrics_url = f"http://{self.host}:{self.metrics_port}/metrics"
+        if not metrics_url.startswith(("http://", "https://")):
+            raise ValueError(f"Invalid metrics URL scheme: {metrics_url}")
 
         while self.running:

507-508: Consider prefixing unused signal handler arguments.

The signal handler signature requires (signum, frame) arguments but they're unused. Prefixing with _ improves clarity and silences linter warnings.

♻️ Optional: Prefix unused args
-    signal.signal(signal.SIGINT, lambda s, f: setattr(observer, 'running', False))
-    signal.signal(signal.SIGTERM, lambda s, f: setattr(observer, 'running', False))
+    signal.signal(signal.SIGINT, lambda _s, _f: setattr(observer, 'running', False))
+    signal.signal(signal.SIGTERM, lambda _s, _f: setattr(observer, 'running', False))
external/dynamo/optimized/router.py (5)

79-81: Unused parameters in fallback KvIndexer.find_matches_for_request.

The tokens and min_overlap parameters are intentionally unused in this fallback stub implementation. Consider prefixing them with underscores to indicate they are deliberately ignored.

Suggested fix
-        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:
+        async def find_matches_for_request(self, _tokens: list[int], _min_overlap: int) -> OverlapScores:
             """Find overlap scores for each worker. Returns empty scores (round-robin fallback)."""
             return OverlapScores({})

171-171: Hardcoded /tmp path for debug traces.

Using /tmp as the default trace directory can be a security concern in shared environments. Consider using a user-specific temp directory or requiring explicit configuration.

Suggested fix using tempfile
+import tempfile
+
 def get_builtin_defaults() -> dict[str, Any]:
     """Return built-in default configuration (matches config.yaml)."""
     return {
         ...
         "debug": {
             "traces_enabled": False,
-            "trace_dir": "/tmp/dynamo_router_traces",
+            "trace_dir": os.path.join(tempfile.gettempdir(), "dynamo_router_traces"),
             "buffer_size": 2000,
         },
     }

Apply similar change to the constructor default at line 473.

Also applies to: 473-473


568-573: Use logger.debug with exception info for trace write failures.

The current implementation catches a blind Exception and logs with %s. Consider using logger.debug(..., exc_info=True) to capture the full context for debugging trace write issues.

Suggested fix
         try:
             path = os.path.join(self.debug_trace_dir, "router_traces.jsonl")
             with open(path, "a", encoding="utf-8") as f:
                 f.write(json.dumps(item, separators=(",", ":")) + "\n")
-        except Exception as e:
-            logger.debug("Trace write failed: %s", e)
+        except OSError:
+            logger.debug("Trace write failed", exc_info=True)

1073-1073: Unused variable chosen_ctx should be prefixed with underscore.

The unpacked chosen_ctx variable is not used. Prefix it with an underscore to indicate intentional discard.

Suggested fix
-        chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
+        chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)

1266-1333: Add return type annotation to parse_args().

Per coding guidelines, public functions require type hints on return values.

Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
     """Parse minimal CLI arguments.
external/dynamo/optimized/processor.py (1)

157-157: Sort __slots__ tuple for consistency.

Ruff suggests sorting __slots__ alphabetically for maintainability.

Suggested fix
-    __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks")
+    __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")
external/dynamo/optimized/processor_multilru.py (1)

157-157: Sort __slots__ tuple for consistency.

Same issue as in processor.py - consider sorting alphabetically.

Suggested fix
-    __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks")
+    __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")
external/dynamo/optimized/router_multilru.py (4)

79-81: Unused parameters in fallback KvIndexer.find_matches_for_request.

Same issue as router.py - prefix unused parameters with underscores.

Suggested fix
-        async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores:
+        async def find_matches_for_request(self, _tokens: list[int], _min_overlap: int) -> OverlapScores:
             """Find overlap scores for each worker. Returns empty scores (round-robin fallback)."""
             return OverlapScores({})

171-171: Hardcoded /tmp path for debug traces.

Same security concern as router.py - consider using tempfile.gettempdir() or requiring explicit configuration.

Also applies to: 473-473, 1388-1388


1073-1073: Unused variable chosen_ctx should be prefixed with underscore.

Same issue as router.py.

Suggested fix
-        chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
+        chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)

1266-1333: Add return type annotation to parse_args().

Same issue as router.py - public functions require return type hints.

Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
     """Parse minimal CLI arguments.

Comment on lines +742 to +769
def parse_args():
"""Parse command-line arguments for the processor."""
parser = argparse.ArgumentParser(description="Optimized Thompson Sampling Processor")
parser.add_argument(
"--enable-router",
action="store_true",
default=True,
help="Enable Thompson Sampling router integration",
)
parser.add_argument(
"--no-router",
action="store_false",
dest="enable_router",
help="Disable router (use engine load balancing only)",
)
parser.add_argument(
"--model-path",
type=str,
required=True,
help="Path to the model directory (for loading tokenizer and model card)",
)
parser.add_argument(
"--model-name",
type=str,
required=True,
help="Served model name (must match frontend's --model-name)",
)
return parser.parse_args()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add return type to parse_args() and thread kv_cache_block_size through CLI.

Same issue as processor.py - parse_args() lacks return type and kv_cache_block_size is hardcoded to 1.

Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
     """Parse command-line arguments for the processor."""
     parser = argparse.ArgumentParser(description="Optimized Thompson Sampling Processor")
     ...
+    parser.add_argument(
+        "--kv-cache-block-size",
+        type=int,
+        default=int(os.environ.get("DYNAMO_KV_BLOCK_SIZE", "64")),
+        help="KV cache block size in tokens (must match workers' --page-size)",
+    )
     return parser.parse_args()

Then update register_llm:

     await register_llm(
         ...
-        kv_cache_block_size=1,
+        kv_cache_block_size=args.kv_cache_block_size,
     )

Also applies to: 809-816

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 742 - 769,
Update parse_args() to declare and return an argparse.Namespace return type and
add a new CLI flag for kv_cache_block_size (e.g.,
parser.add_argument("--kv-cache-block-size", type=int, default=1, help="...")),
then propagate that parsed value into register_llm where kv_cache_block_size is
currently hardcoded to 1 so the function uses args.kv_cache_block_size; ensure
register_llm signature or call sites accept the new parameter if needed and
update any references in the file (including the other occurrence around lines
809-816) to use the parsed value.

Comment on lines +497 to +500
except Exception as e:
logger.error("Failed to pick worker: %s", e)
self._metrics.router_errors_total.inc()
return None, None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use logger.exception() when catching exceptions without re-raising.

Per coding guidelines, when catching and logging exceptions without re-raising, use logger.exception() to capture full stack trace information.

Suggested fix
         except Exception as e:
-            logger.error("Failed to pick worker: %s", e)
+            logger.exception("Failed to pick worker")
             self._metrics.router_errors_total.inc()
             return None, None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
logger.error("Failed to pick worker: %s", e)
self._metrics.router_errors_total.inc()
return None, None
except Exception as e:
logger.exception("Failed to pick worker")
self._metrics.router_errors_total.inc()
return None, None
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 497-497: Do not catch blind exception: Exception

(BLE001)


[warning] 498-498: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor.py` around lines 497 - 500, Replace the
plain error log with an exception log so the full stack trace is captured: in
the exception handler that currently calls logger.error("Failed to pick worker:
%s", e) (the block that then increments self._metrics.router_errors_total and
returns None, None), use logger.exception(...) instead (preserving the message
text) so the stack trace is recorded when the method fails to pick a worker.

# Fire-and-forget KVE metric update (async, non-blocking)
# This ensures KVE computation has ZERO impact on routing throughput
if kve_data is not None:
asyncio.create_task(self._update_kve_metrics_async(kve_data))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Store reference to asyncio.create_task() to prevent garbage collection.

Fire-and-forget tasks can be garbage collected before completion. Store the task reference to ensure it completes.

Suggested fix

Add a class-level set in __init__:

self._background_tasks: set[asyncio.Task] = set()

Then modify the create_task call:

                     if kve_data is not None:
-                        asyncio.create_task(self._update_kve_metrics_async(kve_data))
+                        task = asyncio.create_task(self._update_kve_metrics_async(kve_data))
+                        self._background_tasks.add(task)
+                        task.add_done_callback(self._background_tasks.discard)
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 654-654: Store a reference to the return value of asyncio.create_task

(RUF006)

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor.py` at line 654, The create_task call
that fires-and-forgets self._update_kve_metrics_async(kve_data) can be
garbage-collected; add a persistent container (e.g., self._background_tasks:
set[asyncio.Task]) in the class __init__ and replace the bare
asyncio.create_task(...) with creating the task, adding it to
self._background_tasks, and attaching a done callback to remove the task from
the set when finished; reference the existing call to
self._update_kve_metrics_async and the class initializer to locate where to add
the set and where to store/remove the task.

Comment on lines +1343 to +1345
component = runtime.namespace("dynamo").component("router")
# NOTE: create_service() was removed in Dynamo 0.8.x - endpoint creation handles registration
logger.info("Initializing Optimized Thompson Sampling Router (Prometheus metrics)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "create_service" external/dynamo/optimized/

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 600


🏁 Script executed:

cd external/dynamo/optimized && \
echo "=== router.py around line 1344 ===" && \
sed -n '1340,1350p' router.py && \
echo "" && \
echo "=== router_multilru.py around line 1343-1345 ===" && \
sed -n '1340,1350p' router_multilru.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1012


🏁 Script executed:

# Also check if there are other router files
ls -la external/dynamo/optimized/router*.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 249


Resolve create_service() inconsistency between router.py and router_multilru.py.

router.py calls await component.create_service() at line 1344, while router_multilru.py omits it with a note that the function was removed in Dynamo 0.8.x. The same pattern appears in processor.py vs processor_multilru.py. Clarify whether the multilru variants target a different Dynamo version or if the comment is outdated.

🤖 Prompt for AI Agents
In `@external/dynamo/optimized/router_multilru.py` around lines 1343 - 1345, The
multilru router/processor files are inconsistent with router.py/processor.py
regarding component.create_service(); update router_multilru.py and
processor_multilru.py to match the non-multilru files by conditionally invoking
create_service when present: after obtaining component via
runtime.namespace("dynamo").component("router") (and similarly for processor),
check for the existence of create_service on the component and await it if
available (e.g., use a guard like hasattr or getattr to avoid failing on Dynamo
0.8.x), and update/remove the outdated comment so both variants behave
consistently across Dynamo versions.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@external/dynamo/monitoring/README.md`:
- Line 3: Replace the inanimate possessive phrase "ai-dynamo's Prometheus API"
in the README with a non-possessive form such as "the ai-dynamo Prometheus API"
or "the Prometheus API from ai-dynamo" so the sentence reads e.g. "Metrics are
collected at 2-second resolution directly from the ai-dynamo Prometheus API for
per-request granularity."; locate the exact phrase "ai-dynamo's Prometheus API"
in the README and update it to one of the suggested alternatives.

@@ -0,0 +1,982 @@
# Dynamo Monitoring Stack

This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from ai-dynamo's Prometheus API for per-request granularity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid inanimate possessive.

The phrasing "ai-dynamo's Prometheus API" uses an inanimate possessive. Per documentation style guidelines, prefer "the Prometheus API from ai-dynamo" or "the ai-dynamo Prometheus API."

✏️ Suggested change
-This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from ai-dynamo's Prometheus API for per-request granularity.
+This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from the ai-dynamo Prometheus API for per-request granularity.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from ai-dynamo's Prometheus API for per-request granularity.
This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from the ai-dynamo Prometheus API for per-request granularity.
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` at line 3, Replace the inanimate
possessive phrase "ai-dynamo's Prometheus API" in the README with a
non-possessive form such as "the ai-dynamo Prometheus API" or "the Prometheus
API from ai-dynamo" so the sentence reads e.g. "Metrics are collected at
2-second resolution directly from the ai-dynamo Prometheus API for per-request
granularity."; locate the exact phrase "ai-dynamo's Prometheus API" in the
README and update it to one of the suggested alternatives.

Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement to existing functionality non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant