-
Notifications
You must be signed in to change notification settings - Fork 498
feature: dynamo integration with nat profiler and prometheus/grafana dashboard #1486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds an optimized Thompson Sampling router and processors for Dynamo with dynamic discovery; extensive Prometheus/Grafana monitoring, dashboards, and metric collection tooling; multi-worker startup/stop/orchestration scripts and image-build tooling; NAT profiler integration for Dynamo metrics; transport-level LLM hint injection; numerous docs and demo utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend
participant Processor
participant Router
participant Worker
participant Prometheus
Client->>Frontend: HTTP generate request (headers / nvext.annotations)
Frontend->>Processor: route to backend.generate
Processor->>Processor: extract hints (prefix_id, reuse_budget, osl, iat)
Processor->>Router: find_worker(tokens, prefix_id, reuse_budget)
Router->>Prometheus: record decision metric
Router-->>Processor: worker_id, decision_id
Processor->>Worker: forward/stream generate request
Worker-->>Processor: stream response chunks
Processor->>Prometheus: update latency/token metrics
Processor->>Prometheus: async update KVE metrics
Processor-->>Frontend: stream response back
Frontend-->>Client: deliver response
Processor->>Router: feedback(decision_id, latency, success)
Router->>Prometheus: record feedback / update baselines
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
external/dynamo/start_dynamo_unified.sh (1)
17-25: Update the header comment to match the new multi‑worker defaults.The header still mentions GPUs 0‑3 with TP=4, but defaults are now 0‑7 and TP=2 with multiple workers. Please refresh the description. As per coding guidelines.
📝 Example update
-# - Unified Worker (GPUs 0,1,2,3, TP=4, no disaggregation) +# - Unified Workers (default GPUs 0-7, TP=2 per worker, no disaggregation)
🤖 Fix all issues with AI agents
In `@external/dynamo/collect_metrics.sh`:
- Around line 27-41: The curl calls that write to "$frontend_file" and
"$worker_file" should use --fail and --max-time (e.g., --max-time 10) and check
curl's exit status so failures don't create empty/partial files; fetch into temp
files, verify each curl succeeded, log an error or warning to stderr with
context ("frontend metrics" / "worker metrics") and either exit 1 or skip
inclusion into "$combined_file" based on desired behavior, then atomically move
the temp files to "$frontend_file" and "$worker_file" before building
"$combined_file".
In
`@external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json`:
- Around line 1-4: Add the SPDX Apache-2.0 header object according to the repo’s
JSON convention immediately before the existing dashboard content (i.e., before
the top-level "annotations" key); update the JSON to include the standard SPDX
fields (e.g., "spdx": { "license": "Apache-2.0", "copyright": "...", ... })
following the repository’s header schema so the file begins with the SPDX header
object then the existing dashboard object with "annotations".
In `@external/dynamo/monitoring/README.md`:
- Around line 1-3: Add the SPDX Apache-2.0 license header as the very first line
of the README before the "# Dynamo Monitoring Stack" title; update the README.md
for the "Dynamo Monitoring Stack" document to include the exact SPDX identifier
"SPDX-License-Identifier: Apache-2.0" on its own line at the top of the file.
In `@external/dynamo/optimized/ARCHITECTURE.md`:
- Around line 1-2: The ARCHITECTURE.md file titled "Optimized Thompson Sampling
Router Architecture" is missing the required SPDX Apache-2.0 header; add the
standard SPDX header comment (e.g., "SPDX-License-Identifier: Apache-2.0") at
the very top of the file before any content so the document complies with
licensing guidelines and linter checks.
In `@external/dynamo/optimized/PARAMETERS.md`:
- Around line 1-3: Add the standard SPDX Apache-2.0 header to the top of
PARAMETERS.md to comply with licensing guidelines; insert the SPDX identifier
line "SPDX-License-Identifier: Apache-2.0" (and optional year/owner comment
above or beside it per project convention) as the first lines of the file so the
new Markdown doc describing WorkloadAwareRouter in router.py includes the
required license header.
In `@external/dynamo/optimized/processor.py`:
- Around line 718-745: Add a return type annotation to parse_args() (->
argparse.Namespace) and add a new CLI argument "--kv-cache-block-size"
(type=int) that defaults from the DYNAMO_KV_BLOCK_SIZE environment variable or
64 when unset; ensure parse_args returns the Namespace with this value and then
thread that value into the call to register_llm(...) (or wherever the model/card
is registered) so register_llm receives kv_cache_block_size instead of the
hard-coded 1, keeping the argument name kv_cache_block_size consistent with
existing worker/frontend usage.
In `@external/dynamo/optimized/README.md`:
- Around line 1-3: Add the SPDX license header to the top of the README so the
file begins with the standard SPDX identifier before the existing title line "#
Optimized Thompson Sampling Router Architecture"; insert a single-line header
"SPDX-License-Identifier: Apache-2.0" (with no other characters) as the first
line of the file so the README follows the repository's Markdown licensing
guideline.
In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh`:
- Around line 284-301: The script currently uses read -p to prompt for HF_TOKEN
which can echo the secret to the terminal; change the interactive prompt to use
a silent read (e.g., read -s -r HF_TOKEN) so the token is not displayed, add a
newline/confirmation echo after the silent read to keep UX clean, and ensure the
later checks that set HF_TOKEN="dummy" or echo "✓ HuggingFace token received"
remain unchanged; update the prompt logic around the HF_TOKEN and
LOCAL_MODEL_DIR checks to use the silent read and not print the token.
- Around line 77-83: Validate GPU/TP sizing after computing NUM_GPUS and before
using NUM_WORKERS: check that TP_SIZE is a positive integer and that NUM_GPUS >=
TP_SIZE and that NUM_GPUS is divisible by TP_SIZE so NUM_WORKERS computed from
NUM_GPUS/TP_SIZE is > 0; if the check fails, print a clear error using
WORKER_GPUS, TP_SIZE and NUM_GPUS context and exit non‑zero to avoid later
indexing in wait_for_worker or using CONTAINER_GPU_INDICES when NUM_WORKERS ==
0. Ensure the validation runs right after NUM_WORKERS is assigned and prevents
further execution on bad configs.
In `@external/dynamo/start_dynamo_unified.sh`:
- Around line 53-55: NUM_WORKERS calculation can yield 0 or incorrect values
when TP_SIZE > NUM_GPUS or when NUM_GPUS is not divisible by TP_SIZE; add
validation after computing NUM_WORKERS (and before accessing WORKER_PIDS) to
check that NUM_GPUS and TP_SIZE are positive integers, that TP_SIZE <= NUM_GPUS,
and that NUM_GPUS % TP_SIZE == 0, and if any check fails print a clear error and
exit (use the same logging/exit pattern as the script). Also consider computing
NUM_WORKERS as an integer division only after validation to avoid truncation
surprises and reference NUM_WORKERS, NUM_GPUS, TP_SIZE and WORKER_PIDS in the
error message so the failure is easy to debug.
In `@src/nat/llm/dynamo_llm.py`:
- Around line 393-410: The code always generates a prefix ID even when
prefix_template is intended to disable hints; change the logic in the block that
sets unique_id/prefix_id so that if prefix_template is None you set prefix_id =
None (do not generate unique_id or call uuid.uuid4()), otherwise generate
unique_id and compute prefix_id = prefix_template.format(uuid=unique_id) (or
fallback f"nat-dynamo-{unique_id}" when prefix_template is truthy but empty),
update the logger to only log the created prefix when prefix_id is not None, and
pass that prefix_id (possibly None) into _DynamoTransport so the transport can
skip injecting hints when prefix_id is None.
🟡 Minor comments (19)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/profile_rethinking_full_test.yml-243-246 (1)
243-246: Update Prometheus scrape interval reference to match actual config.The comment states Prometheus scrapes every 5s, but the Prometheus configuration in
external/dynamo/monitoring/prometheus.ymldefinesscrape_interval: 2s. Update the minimum range calculation accordingly.Proposed fix
- # Minimum: 15s (Prometheus scrapes every 5s, need ≥3 points for reliable rates) + # Minimum: 6s (Prometheus scrapes every 2s, need ≥3 points for reliable rates)external/dynamo/optimized/ARCHITECTURE.md-13-14 (1)
13-14: Replace inanimate possessive ’s in Markdown.Phrases like “Frontend’s built-in router” and “Dynamo’s
--model-name/--model-path” violate the doc style rule. Please rephrase (e.g., “the built-in router of the frontend”, “the Dynamo--model-name/--model-path”). As per coding guidelines, ...✍️ Example rephrases
-3. **Frontend's built-in router becomes irrelevant** +3. **The built-in router of the frontend becomes irrelevant** -| Model Mapping | Custom `FRONTEND_MODEL_MAPPING` | Dynamo's `--model-name`/`--model-path` | +| Model Mapping | Custom `FRONTEND_MODEL_MAPPING` | the Dynamo `--model-name`/`--model-path` | -> default frontend, but this is **irrelevant** in our architecture. The frontend's built-in +> default frontend, but this is **irrelevant** in our architecture. The built-in router of the frontendAlso applies to: 216-224, 360-363
external/dynamo/optimized/ARCHITECTURE.md-15-21 (1)
15-21: Add language identifiers to fenced code blocks (MD040).Markdownlint flags these fenced blocks for missing language tags. Please add
text,bash, orjsonas appropriate.🧩 Example fix
-``` +```text Frontend (built-in router: round-robin) → routes to dynamo.backend.generate → OUR PROCESSOR (intercepts!) → queries Thompson Sampling router → forwards to dynamo.worker.generate (actual SGLang workers) -``` +```Also applies to: 23-24, 297-304, 307-313
src/nat/llm/dynamo_llm.py-22-23 (1)
22-23: Docstring still references event hooks; update to transport-based injection.The module now uses a custom transport, but the docstring still describes event hooks.
📝 Suggested update
-The implementation uses httpx event hooks to inject hints at the HTTP transport level, +The implementation uses a custom httpx transport to inject hints at the HTTP level,external/dynamo/monitoring/README.md-383-392 (1)
383-392: Wrap bare URLs in inline code or links (MD034).Bare URLs are flagged by markdownlint; please wrap them in inline code or convert to links (e.g.,
http://localhost:3000).🔗 Example fix
-Then open http://localhost:3000 in your browser. +Then open `http://localhost:3000` in your browser. ... -Example queries: +Example queries: -```promql +```promql # Request rate (requests/second) rate(dynamo_frontend_requests_total[1m])Also applies to: 414-488
external/dynamo/optimized/README.md-7-30 (1)
7-30: Specify languages for the untyped fenced blocks (MD040).Add a language such as
textto the architecture diagram and expected‑output snippet to satisfy markdownlint.🧹 Example fix
-``` +```text ┌─────────────────────────────────────────────────────────────────────────┐ │ Client Request (with nvext.annotations) │ ... └─────────────────────────────────────────────────────────────────────────┘ -``` +``` ... -``` +```text Step 3: Starting Custom Processor (Registers as backend.generate)... ... INFO processor.generate: Routing decision: worker=... decision=... -``` +```Also applies to: 157-174
external/dynamo/optimized/README.md-13-48 (1)
13-48: Avoid possessive 's for inanimate objects in docs.Rephrase instances like “Processor's model card”, “Frontend's ModelWatcher”, and “frontend's expected model” to non‑possessive forms for doc style compliance. As per coding guidelines.
✍️ Example wording updates
-│ ↓ finds Processor's model card! │ +│ ↓ finds the model card for the processor │ ... -3. **Frontend's ModelWatcher** discovers the processor's model card +3. **Frontend ModelWatcher** discovers the model card for the processor ... -- `--model-name`: Served model name (must match frontend's expected model) +- `--model-name`: Served model name (must match the expected model for the frontend)Also applies to: 86-88
external/dynamo/monitoring/README.md-24-55 (1)
24-55: Add language identifiers to fenced blocks flagged by markdownlint.Several code fences (diagram, KVES formula, file tree, metric lists) are missing a language tag. Add
text(or a more specific language) to satisfy MD040.🧹 Example fixes
-``` +```text ┌──────────────────────────────────────────────────────────────────────────────┐ │ Dynamo Stack │ ... └──────────────────────────────────────────────────────────────────────────────┘ -``` +``` ... -``` +```text KVES = (TotalWork - ActualWork) / TotalWork ∈ [0,1] ... w_hit = (w_gpu_hit, w_cpu_hit, w_disk_hit) # weights per hit source -``` +```Also applies to: 165-173, 295-308, 531-545
src/nat/profiler/profile_runner.py-230-251 (1)
230-251: Preserve stack traces and satisfy BLE001 when Dynamo metrics collection fails.The broad catch logs with
logger.warning, which drops the stack trace and still triggers BLE001. If the broad catch is intentional, switch tologger.exceptionand annotate with# noqa: BLE001(or narrow the exception list). As per coding guidelines.🛠️ Suggested change
- except Exception as e: - logger.warning("Failed to collect Dynamo metrics: %s", e) + except Exception: # noqa: BLE001 + logger.exception("Failed to collect Dynamo metrics")external/dynamo/start_dynamo_unified.sh-480-482 (1)
480-482: Fix the separator typo in the output banner.The separator line prints
...ca...; this looks like an accidental keystroke.🧹 Proposed fix
- echo "=====================================ca====================" + echo "========================================================="external/dynamo/monitoring/README.md-185-195 (1)
185-195: Doc style tweaks: avoid inanimate possessives and hyphenate compounds.Rephrase “SGLang's/Dynamo's …” and change “KV cache related” → “KV cache‑related”; also remove the blank line inside the blockquote to satisfy MD028. As per coding guidelines.
✍️ Example wording updates
-> **Why use SGLang's native metric?** SGLang computes cache hit rate internally but doesn't include +> **Why use the native metric from SGLang?** SGLang computes cache hit rate internally but doesn't include > `cached_tokens` in its API responses. The processor's `thompson_kve_*` counters will show 0 > unless the underlying engine provides `usage.prompt_tokens_details.cached_tokens`. - -> **Note on Full KVES**: To implement the full KVES equation with CPU/disk hit weights, you would need +> **Note on the full KVES**: To implement the full KVES equation with CPU/disk hit weights, you would need ... -This section documents the working status of all KV cache related metrics across the Dynamo stack. +This section documents the working status of all KV cache‑related metrics across the Dynamo stack.external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json-96-104 (1)
96-104: Align the Requests panel query with the "req/min" unit.The panel is labeled "Requests (1m)" with unit "reqpm", but the query uses a 5s window. Use a 1m increase window to match the declared per-minute metric.
Suggested query change
- "expr": "sum(increase(dynamo_frontend_requests_total[5s]))", + "expr": "sum(increase(dynamo_frontend_requests_total[1m]))",external/dynamo/optimized/router.py-1054-1062 (1)
1054-1062: Remove TODO in production codeThe TODO for metrics querying violates the no-TODOs guideline. Either implement a stubbed query with an explicit config gate or track it in an issue. As per coding guidelines, avoid TODOs in committed code/comments.
Do you want me to draft a minimal metrics client stub (or open an issue template) for this block?
src/nat/profiler/inference_optimization/dynamo_metrics.py-1059-1062 (1)
1059-1062: Docstring typo:kv_cache_efficiencydoesn’t existThe field is
kv_efficiency. This example will mislead users. As per coding guidelines, keep documentation accurate.✅ Suggested fix
- print(f"KV Efficiency: {core.kv_cache_efficiency:.2%}") + print(f"KV Efficiency: {core.kv_efficiency:.2%}")external/dynamo/.env.example-108-115 (1)
108-115: Export KV tuning vars so child processes see themWithout
export, these values won’t reach the startup scripts and defaults will be used.✅ Suggested fix
-DYNAMO_KV_BLOCK_SIZE=64 +export DYNAMO_KV_BLOCK_SIZE=64 @@ -DYNAMO_MEM_FRACTION_STATIC=0.9 +export DYNAMO_MEM_FRACTION_STATIC=0.9external/dynamo/optimized/processor.py-232-234 (1)
232-234: Remove hard-coded Dynamo version in docstringVersion numbers shouldn’t be embedded in
.pydocs. As per coding guidelines, keep docs version-agnostic.✅ Suggested fix
- # doesn't expose custom bucket configuration in Dynamo 0.7.1) + # doesn't expose custom bucket configuration in the current Dynamo release)external/dynamo/optimized/router.py-1259-1328 (1)
1259-1328: Add return type hint for the CLI parserPublic functions require return type hints. Add
-> argparse.Namespacetoparse_args().✅ Suggested fix
-def parse_args(): +def parse_args() -> argparse.Namespace:external/dynamo/optimized/processor.py-488-491 (1)
488-491: Uselogger.exception()to capture stack trace informationThis exception handler catches and logs without re-raising, so it should use
logger.exception()to preserve the full stack trace. Also, when usinglogger.exception(), omit the exception object from the format arguments—it's automatically captured:Suggested fix
except Exception as e: - logger.error("Failed to pick worker: %s", e) + logger.exception("Failed to pick worker") self._metrics.router_errors_total.inc() return None, NoneConsider narrowing the exception type if possible based on what the router client can raise.
src/nat/profiler/inference_optimization/dynamo_metrics.py-717-720 (1)
717-720: Uselogger.exceptionwhen swallowing errors without re-raisingThese exception handlers catch exceptions without re-raising and don't capture stack traces. According to coding guidelines, use
logger.exception()(orlogger.error(exc_info=True)) to preserve full diagnostic information for exceptions caught without re-raising.✅ Example adjustment for lines 717-720
except Exception as e: error_msg = f"Failed to collect {metric_name}: {e}" - logger.warning(error_msg) + logger.exception(error_msg) result.errors.append(error_msg)Also applies to:
- Lines 972-974: Replace
logger.debug()withlogger.exception()- Lines 1001-1002 and 1008-1009: Add
logger.exception()calls to capture stack traces before appending to the errors list
🧹 Nitpick comments (4)
external/dynamo/monitoring/docker-compose.yml (1)
33-42: Follow environment variable pattern for Grafana credentials and use safer defaults.This file is documented for local development only, but using hardcoded admin credentials and enabling anonymous Admin access should still follow environment variable defaults for consistency with best practices. Additionally, fix the file ending: the file currently ends with 2 blank lines instead of 1.
Recommended changes
- - GF_AUTH_ANONYMOUS_ENABLED=true - - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin - - GF_AUTH_DISABLE_LOGIN_FORM=true - - GF_SECURITY_ADMIN_USER=admin - - GF_SECURITY_ADMIN_PASSWORD=admin + - GF_AUTH_ANONYMOUS_ENABLED=${GRAFANA_ANON_ENABLED:-false} + - GF_AUTH_ANONYMOUS_ORG_ROLE=${GRAFANA_ANON_ROLE:-Viewer} + - GF_AUTH_DISABLE_LOGIN_FORM=${GRAFANA_DISABLE_LOGIN_FORM:-false} + - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin} + - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}Remove trailing blank lines at end of file to have only a single newline.
src/nat/data_models/profiler.py (1)
44-105: Trim theDynamoMetricsConfigdocstring to Google-style format.Guidelines require Google-style docstrings for public classes; consider a concise summary +
Attributes:and move the long how‑to content to external docs. As per coding guidelines.external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json (1)
572-584: Replace hard-coded worker instance queries with dynamic per-instance aggregation.The current queries for refId D and E pin specific instances (
localhost:18081andlocalhost:18082), which breaks the panel if the worker count changes. Usesum by (instance)to automatically show all worker instances:♻️ Suggested query change
- { - "expr": "rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\",instance=\"localhost:18081\"}[5s])", - "legendFormat": "4. Worker 0 (18081) [rate(dynamo_component_requests_total{worker:18081}[5s])]", - "refId": "D" - }, - { - "expr": "rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\",instance=\"localhost:18082\"}[5s])", - "legendFormat": "4. Worker 1 (18082) [rate(dynamo_component_requests_total{worker:18082}[5s])]", - "refId": "E" - }, + { + "expr": "sum by (instance) (rate(dynamo_component_requests_total{dynamo_namespace=\"workers\",dynamo_component=\"worker\"}[5s]))", + "legendFormat": "4. Worker ({{instance}}) [rate(dynamo_component_requests_total{worker}[5s])]", + "refId": "D" + },external/dynamo/optimized/router.py (1)
472-557: Harden debug trace storage locationDefaulting to
/tmpis vulnerable to symlink races on shared systems. Consider a user-scoped cache dir and restrictive permissions.♻️ Suggested hardening
- debug_trace_dir: str = "/tmp/dynamo_router_traces", + debug_trace_dir: str = "~/.cache/dynamo_router_traces", @@ - self.debug_trace_dir = str(debug_trace_dir) + self.debug_trace_dir = os.path.expanduser(str(debug_trace_dir)) @@ - os.makedirs(self.debug_trace_dir, exist_ok=True) + os.makedirs(self.debug_trace_dir, mode=0o700, exist_ok=True)
| # Collect frontend metrics | ||
| curl -s http://localhost:8000/metrics > "$frontend_file" 2>/dev/null | ||
|
|
||
| # Collect worker metrics | ||
| curl -s http://localhost:8081/metrics > "$worker_file" 2>/dev/null | ||
|
|
||
| # Create combined file with headers | ||
| { | ||
| echo "# Collected at: $(date -Iseconds)" | ||
| echo "# === FRONTEND METRICS ===" | ||
| cat "$frontend_file" | ||
| echo "" | ||
| echo "# === WORKER METRICS ===" | ||
| cat "$worker_file" | ||
| } > "$combined_file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "collect_metrics.sh" -type fRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 105
🏁 Script executed:
cat -n ./external/dynamo/collect_metrics.shRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2456
Add timeout and failure handling for metric collection.
The curl commands lack timeouts and error detection. If endpoints are unavailable, curl can hang indefinitely or write empty files, which silently corrupts the combined and rolling metrics with misleading data. Add --fail, --max-time, and proper error handling.
🔧 Suggested approach
- # Collect frontend metrics
- curl -s http://localhost:8000/metrics > "$frontend_file" 2>/dev/null
+ # Collect frontend metrics
+ if ! curl -fsS --max-time 5 http://localhost:8000/metrics > "$frontend_file"; then
+ echo " Warning: frontend metrics unavailable"
+ return 1
+ fi
- # Collect worker metrics
- curl -s http://localhost:8081/metrics > "$worker_file" 2>/dev/null
+ # Collect worker metrics
+ if ! curl -fsS --max-time 5 http://localhost:8081/metrics > "$worker_file"; then
+ echo " Warning: worker metrics unavailable"
+ return 1
+ fiNote: If returning early is too strict for a continuous collector, consider logging warnings instead and continuing, or use exit 1 to stop the script entirely on failures.
🤖 Prompt for AI Agents
In `@external/dynamo/collect_metrics.sh` around lines 27 - 41, The curl calls that
write to "$frontend_file" and "$worker_file" should use --fail and --max-time
(e.g., --max-time 10) and check curl's exit status so failures don't create
empty/partial files; fetch into temp files, verify each curl succeeded, log an
error or warning to stderr with context ("frontend metrics" / "worker metrics")
and either exit 1 or skip inclusion into "$combined_file" based on desired
behavior, then atomically move the temp files to "$frontend_file" and
"$worker_file" before building "$combined_file".
| { | ||
| "annotations": { | ||
| "list": [] | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SPDX Apache-2.0 header to the dashboard JSON.
Repository policy requires SPDX headers for .json files; please add the standard header using the repo’s JSON convention before the dashboard content. As per coding guidelines.
📝 Example header fields
{
+ "__comment_spdx": "SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.",
+ "__comment_license": "SPDX-License-Identifier: Apache-2.0",
"annotations": {
"list": []
},🤖 Prompt for AI Agents
In
`@external/dynamo/monitoring/grafana/provisioning/dashboards/json/dynamo-overview.json`
around lines 1 - 4, Add the SPDX Apache-2.0 header object according to the
repo’s JSON convention immediately before the existing dashboard content (i.e.,
before the top-level "annotations" key); update the JSON to include the standard
SPDX fields (e.g., "spdx": { "license": "Apache-2.0", "copyright": "...", ... })
following the repository’s header schema so the file begins with the SPDX header
object then the existing dashboard object with "annotations".
external/dynamo/monitoring/README.md
Outdated
| # Dynamo Monitoring Stack | ||
|
|
||
| This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SPDX Apache-2.0 header at the top of this README.
The repository requires the SPDX header at the beginning of every .md file. Please add it before the title. As per coding guidelines.
📝 Proposed header
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
# Dynamo Monitoring Stack📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Dynamo Monitoring Stack | |
| This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. | |
| <!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | |
| <!-- SPDX-License-Identifier: Apache-2.0 --> | |
| # Dynamo Monitoring Stack | |
| This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. |
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` around lines 1 - 3, Add the SPDX
Apache-2.0 license header as the very first line of the README before the "#
Dynamo Monitoring Stack" title; update the README.md for the "Dynamo Monitoring
Stack" document to include the exact SPDX identifier "SPDX-License-Identifier:
Apache-2.0" on its own line at the top of the file.
| # Optimized Thompson Sampling Router Architecture | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SPDX Apache-2.0 header to the new architecture doc.
This Markdown file is missing the required SPDX header comment at the top. As per coding guidelines, ...
🧾 Proposed header addition
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
# Optimized Thompson Sampling Router Architecture📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Optimized Thompson Sampling Router Architecture | |
| <!-- | |
| SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| SPDX-License-Identifier: Apache-2.0 | |
| --> | |
| # Optimized Thompson Sampling Router Architecture | |
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/ARCHITECTURE.md` around lines 1 - 2, The
ARCHITECTURE.md file titled "Optimized Thompson Sampling Router Architecture" is
missing the required SPDX Apache-2.0 header; add the standard SPDX header
comment (e.g., "SPDX-License-Identifier: Apache-2.0") at the very top of the
file before any content so the document complies with licensing guidelines and
linter checks.
| # Thompson Sampling Router Parameters | ||
|
|
||
| This document describes all configurable parameters for the `WorkloadAwareRouter` in `router.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SPDX Apache-2.0 header to the new Markdown doc.
This file is missing the required SPDX header at the top. Please add the standard header comment. As per coding guidelines, ...
🧾 Proposed header addition
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
# Thompson Sampling Router Parameters📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Thompson Sampling Router Parameters | |
| This document describes all configurable parameters for the `WorkloadAwareRouter` in `router.py`. | |
| <!-- | |
| SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| SPDX-License-Identifier: Apache-2.0 | |
| --> | |
| # Thompson Sampling Router Parameters | |
| This document describes all configurable parameters for the `WorkloadAwareRouter` in `router.py`. |
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/PARAMETERS.md` around lines 1 - 3, Add the standard
SPDX Apache-2.0 header to the top of PARAMETERS.md to comply with licensing
guidelines; insert the SPDX identifier line "SPDX-License-Identifier:
Apache-2.0" (and optional year/owner comment above or beside it per project
convention) as the first lines of the file so the new Markdown doc describing
WorkloadAwareRouter in router.py includes the required license header.
| # Optimized Thompson Sampling Router Architecture | ||
|
|
||
| This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with **Dynamic Discovery** to intercept requests from the default Dynamo frontend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SPDX Apache-2.0 header to this README.
Markdown files must begin with the SPDX header; this file currently starts with the title. Please add the standard header comment before the heading. As per coding guidelines.
📝 Proposed header
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
# Optimized Thompson Sampling Router Architecture📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Optimized Thompson Sampling Router Architecture | |
| This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with **Dynamic Discovery** to intercept requests from the default Dynamo frontend. | |
| <!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | |
| <!-- SPDX-License-Identifier: Apache-2.0 --> | |
| # Optimized Thompson Sampling Router Architecture | |
| This directory contains the optimized implementation of the Thompson Sampling router for Dynamo, using the "Processor-as-Backend" pattern with **Dynamic Discovery** to intercept requests from the default Dynamo frontend. |
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/README.md` around lines 1 - 3, Add the SPDX license
header to the top of the README so the file begins with the standard SPDX
identifier before the existing title line "# Optimized Thompson Sampling Router
Architecture"; insert a single-line header "SPDX-License-Identifier: Apache-2.0"
(with no other characters) as the first line of the file so the README follows
the repository's Markdown licensing guideline.
| # Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container) | ||
| NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l) | ||
| CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1))) | ||
|
|
||
| # Calculate number of workers based on available GPUs and TP size | ||
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against invalid GPU/TP_SIZE combos that yield zero workers
If NUM_GPUS < TP_SIZE or not divisible, NUM_WORKERS becomes 0 and later wait_for_worker indexes an empty array. Add a fast-fail validation here.
🐛 Suggested fix
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
+if (( NUM_GPUS % TP_SIZE != 0 )); then
+ echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
+ exit 1
+fi
+if (( NUM_WORKERS < 1 )); then
+ echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE"
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container) | |
| NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l) | |
| CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1))) | |
| # Calculate number of workers based on available GPUs and TP size | |
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | |
| # Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container) | |
| NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l) | |
| CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1))) | |
| # Calculate number of workers based on available GPUs and TP size | |
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | |
| if (( NUM_GPUS % TP_SIZE != 0 )); then | |
| echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)" | |
| exit 1 | |
| fi | |
| if (( NUM_WORKERS < 1 )); then | |
| echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE" | |
| exit 1 | |
| fi | |
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh` around lines 77 -
83, Validate GPU/TP sizing after computing NUM_GPUS and before using
NUM_WORKERS: check that TP_SIZE is a positive integer and that NUM_GPUS >=
TP_SIZE and that NUM_GPUS is divisible by TP_SIZE so NUM_WORKERS computed from
NUM_GPUS/TP_SIZE is > 0; if the check fails, print a clear error using
WORKER_GPUS, TP_SIZE and NUM_GPUS context and exit non‑zero to avoid later
indexing in wait_for_worker or using CONTAINER_GPU_INDICES when NUM_WORKERS ==
0. Ensure the validation runs right after NUM_WORKERS is assigned and prevents
further execution on bad configs.
| # Verify HF_TOKEN is set | ||
| if [ -z "${HF_TOKEN:-}" ]; then | ||
| echo "" | ||
| echo "⚠ HF_TOKEN environment variable is not set." | ||
| echo "" | ||
| if [ -d "$LOCAL_MODEL_DIR" ]; then | ||
| echo "✓ Local model found - proceeding without HF_TOKEN" | ||
| HF_TOKEN="dummy" | ||
| else | ||
| echo "✗ Local model NOT found and no HF_TOKEN to download it" | ||
| echo "" | ||
| read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN | ||
| if [ -z "$HF_TOKEN" ]; then | ||
| echo "WARNING: Proceeding without HF_TOKEN." | ||
| HF_TOKEN="dummy" | ||
| else | ||
| echo "✓ HuggingFace token received" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid echoing HF tokens to the terminal
read -p echoes the token; use silent input to avoid leaking credentials in terminal history.
🔒 Suggested fix
- read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+ read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+ echo📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Verify HF_TOKEN is set | |
| if [ -z "${HF_TOKEN:-}" ]; then | |
| echo "" | |
| echo "⚠ HF_TOKEN environment variable is not set." | |
| echo "" | |
| if [ -d "$LOCAL_MODEL_DIR" ]; then | |
| echo "✓ Local model found - proceeding without HF_TOKEN" | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✗ Local model NOT found and no HF_TOKEN to download it" | |
| echo "" | |
| read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN | |
| if [ -z "$HF_TOKEN" ]; then | |
| echo "WARNING: Proceeding without HF_TOKEN." | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✓ HuggingFace token received" | |
| fi | |
| # Verify HF_TOKEN is set | |
| if [ -z "${HF_TOKEN:-}" ]; then | |
| echo "" | |
| echo "⚠ HF_TOKEN environment variable is not set." | |
| echo "" | |
| if [ -d "$LOCAL_MODEL_DIR" ]; then | |
| echo "✓ Local model found - proceeding without HF_TOKEN" | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✗ Local model NOT found and no HF_TOKEN to download it" | |
| echo "" | |
| read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN | |
| echo | |
| if [ -z "$HF_TOKEN" ]; then | |
| echo "WARNING: Proceeding without HF_TOKEN." | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✓ HuggingFace token received" | |
| fi |
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints.sh` around lines 284 -
301, The script currently uses read -p to prompt for HF_TOKEN which can echo the
secret to the terminal; change the interactive prompt to use a silent read
(e.g., read -s -r HF_TOKEN) so the token is not displayed, add a
newline/confirmation echo after the silent read to keep UX clean, and ensure the
later checks that set HF_TOKEN="dummy" or echo "✓ HuggingFace token received"
remain unchanged; update the prompt logic around the HF_TOKEN and
LOCAL_MODEL_DIR checks to use the silent read and not print the token.
| # Calculate number of workers based on available GPUs and TP size | ||
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against zero or non‑divisible GPU/TP configurations.
If TP_SIZE is greater than NUM_GPUS or not divisible, NUM_WORKERS becomes 0 (or truncates), and WORKER_PIDS[0] is accessed later. Add validation before continuing.
✅ Suggested guard
# Calculate number of workers based on available GPUs and TP size
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
+if [ $((NUM_GPUS % TP_SIZE)) -ne 0 ]; then
+ echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
+ exit 1
+fi
+if [ "$NUM_WORKERS" -lt 1 ]; then
+ echo "ERROR: No workers can be started (NUM_GPUS=$NUM_GPUS, TP_SIZE=$TP_SIZE)"
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Calculate number of workers based on available GPUs and TP size | |
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | |
| # Calculate number of workers based on available GPUs and TP size | |
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | |
| if [ $((NUM_GPUS % TP_SIZE)) -ne 0 ]; then | |
| echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)" | |
| exit 1 | |
| fi | |
| if [ "$NUM_WORKERS" -lt 1 ]; then | |
| echo "ERROR: No workers can be started (NUM_GPUS=$NUM_GPUS, TP_SIZE=$TP_SIZE)" | |
| exit 1 | |
| fi | |
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_unified.sh` around lines 53 - 55, NUM_WORKERS
calculation can yield 0 or incorrect values when TP_SIZE > NUM_GPUS or when
NUM_GPUS is not divisible by TP_SIZE; add validation after computing NUM_WORKERS
(and before accessing WORKER_PIDS) to check that NUM_GPUS and TP_SIZE are
positive integers, that TP_SIZE <= NUM_GPUS, and that NUM_GPUS % TP_SIZE == 0,
and if any check fails print a clear error and exit (use the same logging/exit
pattern as the script). Also consider computing NUM_WORKERS as an integer
division only after validation to avoid truncation surprises and reference
NUM_WORKERS, NUM_GPUS, TP_SIZE and WORKER_PIDS in the error message so the
failure is easy to debug.
| # Generate the prefix ID once | ||
| unique_id = uuid.uuid4().hex[:16] | ||
| if prefix_template: | ||
| prefix_id = prefix_template.format(uuid=unique_id) | ||
| else: | ||
| prefix_id = f"nat-dynamo-{unique_id}" | ||
|
|
||
| logger.debug("Created Dynamo client with prefix ID: %s", prefix_id) | ||
|
|
||
| # Create a base transport and wrap it with our custom transport | ||
| base_transport = httpx.AsyncHTTPTransport() | ||
| dynamo_transport = _DynamoTransport( | ||
| transport=base_transport, | ||
| prefix_id=prefix_id, | ||
| total_requests=total_requests, | ||
| osl=osl, | ||
| iat=iat, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix_template=None still injects hints; disable injection as documented.
The docstring says prefix_template=None disables prefix hints, but the current logic always generates a prefix ID. This makes the “disable” path ineffective.
🐛 Proposed fix
- # Generate the prefix ID once
- unique_id = uuid.uuid4().hex[:16]
- if prefix_template:
- prefix_id = prefix_template.format(uuid=unique_id)
- else:
- prefix_id = f"nat-dynamo-{unique_id}"
+ # If hints are disabled, return a plain client
+ if prefix_template is None:
+ return httpx.AsyncClient(timeout=httpx.Timeout(timeout))
+
+ # Generate the prefix ID once
+ unique_id = uuid.uuid4().hex[:16]
+ prefix_id = prefix_template.format(uuid=unique_id)🤖 Prompt for AI Agents
In `@src/nat/llm/dynamo_llm.py` around lines 393 - 410, The code always generates
a prefix ID even when prefix_template is intended to disable hints; change the
logic in the block that sets unique_id/prefix_id so that if prefix_template is
None you set prefix_id = None (do not generate unique_id or call uuid.uuid4()),
otherwise generate unique_id and compute prefix_id =
prefix_template.format(uuid=unique_id) (or fallback f"nat-dynamo-{unique_id}"
when prefix_template is truthy but empty), update the logger to only log the
created prefix when prefix_id is not None, and pass that prefix_id (possibly
None) into _DynamoTransport so the transport can skip injecting hints when
prefix_id is None.
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
.pre-commit-config.yaml (1)
37-42: Addjupytertoadditional_dependencies—the hook will fail without it.The shell script is properly executable with correct shebang, but it calls
jupyter nbconverton line 25. Sincelanguage: pythoncreates an isolated venv, onlynbconvertis installed fromadditional_dependencies: ["nbconvert"]. Thejupytercommand won't be available, causing the hook to fail at runtime. Add"jupyter"to additional_dependencies. (Note:nbconvertis already pinned to 7.16.6 inuv.lock, so the version pinning concern is addressed.)external/dynamo/start_dynamo_disagg.sh (1)
27-27: Stale comment: port reference is outdated.The comment states "Frontend: Port 8099" but the default HTTP port has been changed to 8000 on line 39. Update the comment to reflect the new default.
-# Frontend: Port 8099 (HTTP API) +# Frontend: Port 8000 (HTTP API, configurable via DYNAMO_HTTP_PORT)external/dynamo/start_dynamo_unified_thompson_hints.sh (1)
28-28: Stale comment: port reference is outdated.The comment states "Frontend: Port 8099" but the default HTTP port has been changed to 8000 on line 44. Update the comment for consistency.
-# Frontend: Port 8099 (HTTP API with prefix hint headers) +# Frontend: Port 8000 (HTTP API with prefix hint headers)
🤖 Fix all issues with AI agents
In `@external/dynamo/monitoring/docker-compose.yml`:
- Around line 30-43: The Grafana service is exposed insecurely via network_mode:
host with anonymous admin access and login disabled; update the grafana service
to bind only to localhost (remove network_mode: host and use a ports mapping
like 127.0.0.1:3000:3000) and harden the environment defaults by setting
GF_AUTH_ANONYMOUS_ENABLED=false and GF_AUTH_DISABLE_LOGIN_FORM=false while
keeping GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD present; make
these defaults overrideable via environment or an .env file so developers can
re-enable anonymous/local-only access when explicitly needed (refer to the
grafana service, network_mode, GF_AUTH_ANONYMOUS_ENABLED,
GF_AUTH_DISABLE_LOGIN_FORM, GF_AUTH_ANONYMOUS_ORG_ROLE).
In `@external/dynamo/monitoring/README.md`:
- Around line 222-224: Replace inanimate possessive phrasing like "SGLang's
native metric" with non-possessive alternatives such as "the native SGLang
metric" or "the native metric from SGLang" in the README text (specifically the
sentence starting "SGLang computes cache hit rate internally..."). Update any
other occurrences of similar inanimate possessives in this section (e.g.,
"SGLang's", "the processor's `thompson_kve_*` counters") to use equivalent
non-possessive constructions to match the documentation style guidelines.
- Around line 229-233: Update the header paragraph "KV Cache Metrics Status" to
hyphenate the compound adjective by replacing the phrase "KV cache related
metrics" with "KV cache-related metrics" (preferably using a non-breaking hyphen
U+2011 if you want to prevent line-wrapping); locate the sentence in the same
paragraph that mentions the `${backend}` template variable and make the
single-word change there to ensure grammatical correctness.
- Around line 571-573: The README example uses a user-specific absolute path;
update the example invocation to use a generic placeholder instead (e.g.,
/path/to/repo or $REPO_ROOT) so it’s portable; modify the example line that runs
./collect_metrics.sh ./metrics_output 30 to show changing directory to a generic
repo root (instead of
/localhome/local-bbednarski/NeMo-Agent-Toolkit/external/dynamo) and keep the
collect_metrics.sh command unchanged to preserve the usage example.
In `@external/dynamo/monitoring/scripts/cache_experiment.sh`:
- Around line 1-14: The script cache_experiment.sh currently has an abbreviated
header; replace it with the full Apache-2.0 license header block (the standard
boilerplate used in the repo) including the SPDX-License-Identifier: Apache-2.0
line, the full copyright notice (Copyright (c) 2025-2026, NVIDIA CORPORATION &
AFFILIATES. All rights reserved.), and the complete Apache License 2.0 notice
text (permissions, conditions, and disclaimer) so the file contains the
canonical license template at the top.
In `@external/dynamo/monitoring/scripts/kv_event_observer.py`:
- Line 85: The parameter parent_hash in function record_stored is unused and
triggers a lint warning; either rename it to _parent_hash to indicate
intentional unusedness or actually use it (e.g., log or store ancestry info).
Edit the record_stored signature (record_stored) to rename parent_hash to
_parent_hash if you don't need it, or add a single-line usage inside
record_stored (such as including parent_hash in the emitted metric/log or
appending it to an ancestry list) so the linter no longer flags it.
- Around line 1-35: Add the required SPDX Apache-2.0 license header immediately
after the shebang line in kv_event_observer.py; place the standard multi-line
SPDX/Apache-2.0 header (including SPDX-License-Identifier: Apache-2.0 and the
copyright/notice block per repo policy) right after "#!/usr/bin/env python3" so
the file header is compliant before the module docstring and any definitions (no
changes to functions or logic like the KVEventBatch parsing or metrics polling).
In `@external/dynamo/monitoring/scripts/README.md`:
- Around line 1-5: Add the SPDX license header "SPDX-License-Identifier:
Apache-2.0" to the very top of the README.md for the "KV Cache Event Observer"
(place the header before the first "# KV Cache Event Observer" line) so the file
includes the required Apache-2.0 license identifier per project guidelines.
In `@external/dynamo/optimized/ARCHITECTURE.md`:
- Around line 9-14: Update the inanimate possessive phrasing in ARCHITECTURE.md:
replace occurrences like "Frontend's built-in router" with "the built-in router
of the frontend" (and similar rephrasings) throughout the document, including
the other instances noted around the section discussing DYN_ROUTER_MODE,
dynamo.backend.generate, and dynamo.worker.generate (lines referenced in the
review); preserve meaning and references to DYN_ROUTER_MODE,
dynamo.backend.generate, and dynamo.worker.generate while only adjusting
possessive wording to match the style guide.
- Around line 15-24: The fenced code blocks containing the "Frontend (built-in
router: round-robin) → routes to dynamo.backend.generate → OUR PROCESSOR ..."
snippet (and the other similar blocks around the ASCII box starting with
"┌─────────────────────────────────────────────────────────────────────────────────┐")
are missing language identifiers and there is a blank line inside a blockquote;
add an appropriate language tag (e.g., ```text, ```bash or ```json) to each
fenced code block and remove the empty line inside the blockquote so the block
is contiguous, ensuring all occurrences (including the other blocks referenced
in the comment) follow the same fixes.
In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh`:
- Around line 309-330: The prompt currently uses read -p which echoes the
HF_TOKEN to the terminal; change the interactive prompt in the HF_TOKEN check
(the block referencing HF_TOKEN and LOCAL_MODEL_DIR and the read call) to use
silent input by replacing read -p with read -s -p and then emit an echo newline
after the read so the terminal prompt formatting remains correct, preserving the
existing logic that checks for an empty value and sets HF_TOKEN="dummy" or
prints confirmation.
- Around line 86-92: After computing NUM_GPUS, CONTAINER_GPU_INDICES and
NUM_WORKERS, add a guard that validates TP_SIZE against NUM_GPUS: if TP_SIZE is
zero, greater than NUM_GPUS, or NUM_GPUS is not evenly divisible by TP_SIZE
(NUM_WORKERS equals 0 or NUM_GPUS % TP_SIZE != 0), immediately print a clear
error and exit with non‑zero status. Reference the variables NUM_GPUS, TP_SIZE,
NUM_WORKERS and CONTAINER_GPU_INDICES when implementing the check so the script
fails fast before any worker loops or indexing occur.
In `@external/dynamo/start_dynamo_unified.sh`:
- Line 481: The separator banner echo contains a stray "ca" that should be
removed; locate the echo call that prints
"=====================================ca====================" (in
start_dynamo_unified.sh) and update the string to a clean separator without the
"ca" characters so the output shows a continuous line of "=" characters.
🧹 Nitpick comments (4)
external/dynamo/monitoring/scripts/cache_experiment.sh (2)
188-198: Dependency onbcmay cause failures on minimal systems.The
bcutility is used for arithmetic calculations but may not be installed by default. Consider adding a dependency check or using bash arithmetic where possible.Alternative using awk (commonly available)
-echo -e " Cache hits: ${final_hits} (delta: +$(echo "$final_hits - $initial_hits" | bc))" -echo -e " Cache queries: ${final_queries} (delta: +$(echo "$final_queries - $initial_queries" | bc))" +delta_hits_calc=$(awk "BEGIN {print $final_hits - $initial_hits}") +delta_queries_calc=$(awk "BEGIN {print $final_queries - $initial_queries}") +echo -e " Cache hits: ${final_hits} (delta: +${delta_hits_calc})" +echo -e " Cache queries: ${final_queries} (delta: +${delta_queries_calc})" echo "" # Calculate hit rate for this experiment -delta_hits=$(echo "$final_hits - $initial_hits" | bc) -delta_queries=$(echo "$final_queries - $initial_queries" | bc) +delta_hits=$delta_hits_calc +delta_queries=$delta_queries_calc if [ "$delta_queries" != "0" ]; then - hit_rate=$(echo "scale=1; $delta_hits * 100 / $delta_queries" | bc) + hit_rate=$(awk "BEGIN {printf \"%.1f\", $delta_hits * 100 / $delta_queries}") echo -e "${GREEN}Experiment hit rate: ${hit_rate}%${NC}" fi
173-176: Process cleanup could leave orphaned process on signal.If the script receives a signal (e.g., Ctrl+C) before reaching line 174, the observer process will be orphaned. Consider adding a trap for cleanup.
Suggested fix
Add after line 144:
# Cleanup trap trap 'kill $OBSERVER_PID 2>/dev/null; rm -f "$OBSERVER_LOG"' EXIT INT TERMThen the explicit cleanup at lines 174-176 and 201 can be simplified or removed since the trap handles it.
external/dynamo/stop_dynamo.sh (1)
142-147: Missing error handling for volume removal.If the Prometheus data volume doesn't exist,
docker volume rmwill output an error message. Consider suppressing the error or checking existence first.Suggested fix
if [ "$CLEAR_METRICS" = true ]; then echo "" echo "Clearing Prometheus data volume..." - docker volume rm monitoring_prometheus_data && echo "✓ Prometheus data volume removed (old metrics cleared)" + if docker volume rm monitoring_prometheus_data 2>/dev/null; then + echo "✓ Prometheus data volume removed (old metrics cleared)" + else + echo " (Prometheus data volume not found or already removed)" + fi fiexternal/dynamo/.env.example (1)
59-62: Alignexportusage in the.envexample.This file mixes
exportwith bare assignments. If it’s intended to be sourced in a shell,DYNAMO_HTTP_PORTshould be exported like the others; if it’s for Docker Compose.env, removeexportkeywords or add a note to avoid confusion.🛠️ Consistency option (shell-source)
- DYNAMO_HTTP_PORT=8000 + export DYNAMO_HTTP_PORT=8000
| grafana: | ||
| image: grafana/grafana:10.2.2 | ||
| container_name: dynamo-grafana | ||
| network_mode: host | ||
| environment: | ||
| - GF_SERVER_HTTP_PORT=3000 | ||
| # Disable authentication for local development | ||
| - GF_AUTH_ANONYMOUS_ENABLED=true | ||
| - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin | ||
| - GF_AUTH_DISABLE_LOGIN_FORM=true | ||
| # Keep these for if you re-enable login later | ||
| - GF_SECURITY_ADMIN_USER=admin | ||
| - GF_SECURITY_ADMIN_PASSWORD=admin | ||
| - GF_USERS_ALLOW_SIGN_UP=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock down Grafana exposure by default.
With network_mode: host, anonymous admin access and disabled login form exposes Grafana openly on the host interface. This is a high‑risk default outside a strictly local sandbox. Recommend binding to localhost and disabling anonymous access by default, while allowing overrides via env vars.
🔒 Proposed hardening (safer defaults)
environment:
- - GF_SERVER_HTTP_PORT=3000
- # Disable authentication for local development
- - GF_AUTH_ANONYMOUS_ENABLED=true
- - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- - GF_AUTH_DISABLE_LOGIN_FORM=true
- # Keep these for if you re-enable login later
- - GF_SECURITY_ADMIN_USER=admin
- - GF_SECURITY_ADMIN_PASSWORD=admin
+ - GF_SERVER_HTTP_PORT=${GF_SERVER_HTTP_PORT:-3000}
+ - GF_SERVER_HTTP_ADDR=${GF_SERVER_HTTP_ADDR:-127.0.0.1}
+ # Enable anonymous auth only when explicitly requested
+ - GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED:-false}
+ - GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE:-Viewer}
+ - GF_AUTH_DISABLE_LOGIN_FORM=${GF_AUTH_DISABLE_LOGIN_FORM:-false}
+ # Admin credentials (override via env)
+ - GF_SECURITY_ADMIN_USER=${GF_SECURITY_ADMIN_USER:-admin}
+ - GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| grafana: | |
| image: grafana/grafana:10.2.2 | |
| container_name: dynamo-grafana | |
| network_mode: host | |
| environment: | |
| - GF_SERVER_HTTP_PORT=3000 | |
| # Disable authentication for local development | |
| - GF_AUTH_ANONYMOUS_ENABLED=true | |
| - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin | |
| - GF_AUTH_DISABLE_LOGIN_FORM=true | |
| # Keep these for if you re-enable login later | |
| - GF_SECURITY_ADMIN_USER=admin | |
| - GF_SECURITY_ADMIN_PASSWORD=admin | |
| - GF_USERS_ALLOW_SIGN_UP=false | |
| grafana: | |
| image: grafana/grafana:10.2.2 | |
| container_name: dynamo-grafana | |
| network_mode: host | |
| environment: | |
| - GF_SERVER_HTTP_PORT=${GF_SERVER_HTTP_PORT:-3000} | |
| - GF_SERVER_HTTP_ADDR=${GF_SERVER_HTTP_ADDR:-127.0.0.1} | |
| # Enable anonymous auth only when explicitly requested | |
| - GF_AUTH_ANONYMOUS_ENABLED=${GF_AUTH_ANONYMOUS_ENABLED:-false} | |
| - GF_AUTH_ANONYMOUS_ORG_ROLE=${GF_AUTH_ANONYMOUS_ORG_ROLE:-Viewer} | |
| - GF_AUTH_DISABLE_LOGIN_FORM=${GF_AUTH_DISABLE_LOGIN_FORM:-false} | |
| # Admin credentials (override via env) | |
| - GF_SECURITY_ADMIN_USER=${GF_SECURITY_ADMIN_USER:-admin} | |
| - GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:-admin} | |
| - GF_USERS_ALLOW_SIGN_UP=false |
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/docker-compose.yml` around lines 30 - 43, The
Grafana service is exposed insecurely via network_mode: host with anonymous
admin access and login disabled; update the grafana service to bind only to
localhost (remove network_mode: host and use a ports mapping like
127.0.0.1:3000:3000) and harden the environment defaults by setting
GF_AUTH_ANONYMOUS_ENABLED=false and GF_AUTH_DISABLE_LOGIN_FORM=false while
keeping GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD present; make
these defaults overrideable via environment or an .env file so developers can
re-enable anonymous/local-only access when explicitly needed (refer to the
grafana service, network_mode, GF_AUTH_ANONYMOUS_ENABLED,
GF_AUTH_DISABLE_LOGIN_FORM, GF_AUTH_ANONYMOUS_ORG_ROLE).
| > **Why use SGLang's native metric?** SGLang computes cache hit rate internally but doesn't include | ||
| > `cached_tokens` in its API responses. The processor's `thompson_kve_*` counters will show 0 | ||
| > unless the underlying engine provides `usage.prompt_tokens_details.cached_tokens`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid inanimate possessive wording.
The phrasing “SGLang's” and similar possessives conflict with the documentation style rules. Prefer “the native metric from SGLang” or “the native SGLang metric.”
As per coding guidelines, please adjust inanimate possessives in this document.
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` around lines 222 - 224, Replace
inanimate possessive phrasing like "SGLang's native metric" with non-possessive
alternatives such as "the native SGLang metric" or "the native metric from
SGLang" in the README text (specifically the sentence starting "SGLang computes
cache hit rate internally..."). Update any other occurrences of similar
inanimate possessives in this section (e.g., "SGLang's", "the processor's
`thompson_kve_*` counters") to use equivalent non-possessive constructions to
match the documentation style guidelines.
| ## KV Cache Metrics Status | ||
|
|
||
| This section documents the working status of all KV cache related metrics across the Dynamo stack. | ||
|
|
||
| **Backend Selection**: The Grafana dashboard uses a `${backend}` template variable. Select `sglang` or `vllm` from the dropdown to switch all backend-specific queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor grammar: add hyphen in “KV cache-related”.
“KV cache related metrics” should be “KV cache‑related metrics.”
🧰 Tools
🪛 LanguageTool
[grammar] ~231-~231: Use a hyphen to join words.
Context: ...ments the working status of all KV cache related metrics across the Dynamo stack....
(QB_NEW_EN_HYPHEN)
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` around lines 229 - 233, Update the
header paragraph "KV Cache Metrics Status" to hyphenate the compound adjective
by replacing the phrase "KV cache related metrics" with "KV cache-related
metrics" (preferably using a non-breaking hyphen U+2011 if you want to prevent
line-wrapping); locate the sentence in the same paragraph that mentions the
`${backend}` template variable and make the single-word change there to ensure
grammatical correctness.
| #!/bin/bash | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # KV Cache Experiment Script | ||
| # | ||
| # Demonstrates prefix cache behavior with queries A → B → C → A | ||
| # Shows: STORED events, REMOVED (eviction) events, and CACHE HITs | ||
| # | ||
| # Usage: | ||
| # ./cache_experiment.sh # Run with defaults | ||
| # ./cache_experiment.sh --flush # Flush cache first by filling it | ||
| # ./cache_experiment.sh --verbose # Show full curl responses | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing full Apache 2.0 license header.
Per coding guidelines, all source files must include the complete SPDX Apache-2.0 header template. The current header is missing the full license text block.
Suggested fix
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
# KV Cache Experiment Script📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # KV Cache Experiment Script | |
| # | |
| # Demonstrates prefix cache behavior with queries A → B → C → A | |
| # Shows: STORED events, REMOVED (eviction) events, and CACHE HITs | |
| # | |
| # Usage: | |
| # ./cache_experiment.sh # Run with defaults | |
| # ./cache_experiment.sh --flush # Flush cache first by filling it | |
| # ./cache_experiment.sh --verbose # Show full curl responses | |
| #!/bin/bash | |
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| # | |
| # KV Cache Experiment Script | |
| # | |
| # Demonstrates prefix cache behavior with queries A → B → C → A | |
| # Shows: STORED events, REMOVED (eviction) events, and CACHE HITs | |
| # | |
| # Usage: | |
| # ./cache_experiment.sh # Run with defaults | |
| # ./cache_experiment.sh --flush # Flush cache first by filling it | |
| # ./cache_experiment.sh --verbose # Show full curl responses |
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/scripts/cache_experiment.sh` around lines 1 - 14,
The script cache_experiment.sh currently has an abbreviated header; replace it
with the full Apache-2.0 license header block (the standard boilerplate used in
the repo) including the SPDX-License-Identifier: Apache-2.0 line, the full
copyright notice (Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All
rights reserved.), and the complete Apache License 2.0 notice text (permissions,
conditions, and disclaimer) so the file contains the canonical license template
at the top.
| **Key insight**: The default Dynamo frontend has its own built-in router (`DYN_ROUTER_MODE`) and routes directly to `dynamo.backend.generate`. To intercept requests and apply custom Thompson Sampling routing: | ||
|
|
||
| 1. **Processor registers as `dynamo.backend.generate`** - The frontend discovers our processor as the "backend" | ||
| 2. **SGLang Worker registers as `dynamo.worker.generate`** - Our processor forwards to actual workers after routing | ||
| 3. **Frontend's built-in router becomes irrelevant** - The frontend routes to `dynamo.backend.generate` which is our processor | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid inanimate possessive wording in documentation.
Phrases like “Frontend's built-in router” conflict with the documentation style rules. Rephrase to “the built‑in router of the frontend” or similar.
As per coding guidelines, please adjust inanimate possessives in this document.
Also applies to: 228-233
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/ARCHITECTURE.md` around lines 9 - 14, Update the
inanimate possessive phrasing in ARCHITECTURE.md: replace occurrences like
"Frontend's built-in router" with "the built-in router of the frontend" (and
similar rephrasings) throughout the document, including the other instances
noted around the section discussing DYN_ROUTER_MODE, dynamo.backend.generate,
and dynamo.worker.generate (lines referenced in the review); preserve meaning
and references to DYN_ROUTER_MODE, dynamo.backend.generate, and
dynamo.worker.generate while only adjusting possessive wording to match the
style guide.
| ``` | ||
| Frontend (built-in router: round-robin) | ||
| → routes to dynamo.backend.generate | ||
| → OUR PROCESSOR (intercepts!) | ||
| → queries Thompson Sampling router | ||
| → forwards to dynamo.worker.generate (actual SGLang workers) | ||
| ``` | ||
|
|
||
| ``` | ||
| ┌─────────────────────────────────────────────────────────────────────────────────┐ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add languages to fenced code blocks and fix the blockquote lint issue.
markdownlint flags missing language identifiers (MD040) and a blank line inside a blockquote (MD028). Add language tags (e.g., text, bash, json) and remove the blank line inside the blockquote to satisfy lint rules.
🧹 Example fixes
-```
+```text
Frontend (built-in router: round-robin)
→ routes to dynamo.backend.generate
→ OUR PROCESSOR (intercepts!)
→ queries Thompson Sampling router
→ forwards to dynamo.worker.generate (actual SGLang workers)
```diff
-> **Note on `DYN_ROUTER_MODE`**: The startup script passes `--router-mode round-robin` to the
-> default frontend, but this is **irrelevant** in our architecture. The frontend's built-in
-> router routes to `dynamo.backend.generate`, which is our Processor (not a real backend).
-> Our Processor intercepts the request and uses our custom Thompson Sampling router instead.
+> **Note on `DYN_ROUTER_MODE`**: The startup script passes `--router-mode round-robin` to the
+> default frontend, but this is **irrelevant** in our architecture. The frontend built‑in
+> router routes to `dynamo.backend.generate`, which is our Processor (not a real backend).
+> Our Processor intercepts the request and uses our custom Thompson Sampling router instead.
Also applies to: 297-307, 366-373
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
15-15: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
23-23: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/ARCHITECTURE.md` around lines 15 - 24, The fenced
code blocks containing the "Frontend (built-in router: round-robin) → routes to
dynamo.backend.generate → OUR PROCESSOR ..." snippet (and the other similar
blocks around the ASCII box starting with
"┌─────────────────────────────────────────────────────────────────────────────────┐")
are missing language identifiers and there is a blank line inside a blockquote;
add an appropriate language tag (e.g., ```text, ```bash or ```json) to each
fenced code block and remove the empty line inside the blockquote so the block
is contiguous, ensuring all occurrences (including the other blocks referenced
in the comment) follow the same fixes.
| # Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container) | ||
| NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l) | ||
| CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1))) | ||
|
|
||
| # Calculate number of workers based on available GPUs and TP size | ||
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate GPU/TP sizing before computing workers.
If NUM_GPUS < TP_SIZE or not divisible, NUM_WORKERS becomes 0 and later loops/indexes break. Add a fast‑fail check right after computing NUM_WORKERS.
🛠️ Suggested guard
NUM_WORKERS=$((NUM_GPUS / TP_SIZE))
+if (( NUM_GPUS % TP_SIZE != 0 )); then
+ echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)"
+ exit 1
+fi
+if (( NUM_WORKERS < 1 )); then
+ echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE"
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container) | |
| NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l) | |
| CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1))) | |
| # Calculate number of workers based on available GPUs and TP size | |
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | |
| # Compute container-internal GPU indices (GPUs are renumbered 0,1,2,... inside the container) | |
| NUM_GPUS=$(echo "$WORKER_GPUS" | tr ',' '\n' | wc -l) | |
| CONTAINER_GPU_INDICES=$(seq -s, 0 $((NUM_GPUS - 1))) | |
| # Calculate number of workers based on available GPUs and TP size | |
| NUM_WORKERS=$((NUM_GPUS / TP_SIZE)) | |
| if (( NUM_GPUS % TP_SIZE != 0 )); then | |
| echo "ERROR: NUM_GPUS ($NUM_GPUS) must be divisible by TP_SIZE ($TP_SIZE)" | |
| exit 1 | |
| fi | |
| if (( NUM_WORKERS < 1 )); then | |
| echo "ERROR: NUM_WORKERS computed as $NUM_WORKERS; check DYNAMO_GPU_DEVICES and DYNAMO_TP_SIZE" | |
| exit 1 | |
| fi | |
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh` around lines
86 - 92, After computing NUM_GPUS, CONTAINER_GPU_INDICES and NUM_WORKERS, add a
guard that validates TP_SIZE against NUM_GPUS: if TP_SIZE is zero, greater than
NUM_GPUS, or NUM_GPUS is not evenly divisible by TP_SIZE (NUM_WORKERS equals 0
or NUM_GPUS % TP_SIZE != 0), immediately print a clear error and exit with
non‑zero status. Reference the variables NUM_GPUS, TP_SIZE, NUM_WORKERS and
CONTAINER_GPU_INDICES when implementing the check so the script fails fast
before any worker loops or indexing occur.
| # Verify HF_TOKEN is set | ||
| if [ -z "${HF_TOKEN:-}" ]; then | ||
| echo "" | ||
| echo "⚠ HF_TOKEN environment variable is not set." | ||
| echo "" | ||
| if [ -d "$LOCAL_MODEL_DIR" ]; then | ||
| echo "✓ Local model found - proceeding without HF_TOKEN" | ||
| HF_TOKEN="dummy" | ||
| else | ||
| echo "✗ Local model NOT found and no HF_TOKEN to download it" | ||
| echo "" | ||
| read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN | ||
| if [ -z "$HF_TOKEN" ]; then | ||
| echo "WARNING: Proceeding without HF_TOKEN." | ||
| HF_TOKEN="dummy" | ||
| else | ||
| echo "✓ HuggingFace token received" | ||
| fi | ||
| fi | ||
| else | ||
| echo "✓ HuggingFace token is set" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prompt for HF_TOKEN silently to avoid secret echo.
The current read -p echoes the token to the terminal. Use silent input to prevent accidental leakage.
🔒 Safer prompt
- read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+ read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN
+ echo📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Verify HF_TOKEN is set | |
| if [ -z "${HF_TOKEN:-}" ]; then | |
| echo "" | |
| echo "⚠ HF_TOKEN environment variable is not set." | |
| echo "" | |
| if [ -d "$LOCAL_MODEL_DIR" ]; then | |
| echo "✓ Local model found - proceeding without HF_TOKEN" | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✗ Local model NOT found and no HF_TOKEN to download it" | |
| echo "" | |
| read -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN | |
| if [ -z "$HF_TOKEN" ]; then | |
| echo "WARNING: Proceeding without HF_TOKEN." | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✓ HuggingFace token received" | |
| fi | |
| fi | |
| else | |
| echo "✓ HuggingFace token is set" | |
| fi | |
| # Verify HF_TOKEN is set | |
| if [ -z "${HF_TOKEN:-}" ]; then | |
| echo "" | |
| echo "⚠ HF_TOKEN environment variable is not set." | |
| echo "" | |
| if [ -d "$LOCAL_MODEL_DIR" ]; then | |
| echo "✓ Local model found - proceeding without HF_TOKEN" | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✗ Local model NOT found and no HF_TOKEN to download it" | |
| echo "" | |
| read -s -p "Please enter your HuggingFace token (or press Enter to skip): " HF_TOKEN | |
| echo | |
| if [ -z "$HF_TOKEN" ]; then | |
| echo "WARNING: Proceeding without HF_TOKEN." | |
| HF_TOKEN="dummy" | |
| else | |
| echo "✓ HuggingFace token received" | |
| fi | |
| fi | |
| else | |
| echo "✓ HuggingFace token is set" | |
| fi |
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_optimized_thompson_hints_vllm.sh` around lines
309 - 330, The prompt currently uses read -p which echoes the HF_TOKEN to the
terminal; change the interactive prompt in the HF_TOKEN check (the block
referencing HF_TOKEN and LOCAL_MODEL_DIR and the read call) to use silent input
by replacing read -p with read -s -p and then emit an echo newline after the
read so the terminal prompt formatting remains correct, preserving the existing
logic that checks for an empty value and sets HF_TOKEN="dummy" or prints
confirmation.
| echo "=========================================================" | ||
| echo "Test Request:" | ||
| echo "=========================================================" | ||
| echo "=====================================ca====================" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix banner typo in output.
There is a stray “ca” in the separator line.
✏️ Proposed fix
- echo "=====================================ca===================="
+ echo "========================================================="📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "=====================================ca====================" | |
| echo "=========================================================" |
🤖 Prompt for AI Agents
In `@external/dynamo/start_dynamo_unified.sh` at line 481, The separator banner
echo contains a stray "ca" that should be removed; locate the echo call that
prints "=====================================ca====================" (in
start_dynamo_unified.sh) and update the string to a clean separator without the
"ca" characters so the output shows a continuous line of "=" characters.
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
external/dynamo/generalized/processor.py (1)
428-433:⚠️ Potential issue | 🔴 CriticalUpdate code to match deployed Dynamo 0.7.1 or add version compatibility layer.
Lines 428-433 assume Dynamo ≥0.8.x (where
create_service()was removed), but all deployment scripts inexternal/dynamo/are pinned to Dynamo 0.7.1, wherecreate_service()is still required. This code will fail at runtime with the current deployment version. Either restore thecreate_service()call for 0.7.1 compatibility, update deployment scripts to Dynamo 0.8.x+, or add conditional logic to detect the runtime version and handle both APIs.This same issue affects multiple files:
generalized/router.py,optimized/router_multilru.py, andoptimized/processor_multilru.py.
🤖 Fix all issues with AI agents
In `@external/dynamo/.env.example`:
- Around line 16-18: The HF_TOKEN placeholder in the .env example is active and
may be accidentally sourced; change the HF_TOKEN export to be commented out
(e.g., prefix the HF_TOKEN line with a comment marker) so users must explicitly
set or export HF_TOKEN themselves; ensure HF_HOME remains exported and leave a
commented example or instruction for users to add their own HF_TOKEN instead of
exporting a dummy value.
- Around line 59-62: Update the .env.example so all variables are consistently
exported and remove unnecessary quotes: add export to variables missing it
(e.g., DYNAMO_HTTP_PORT and the other non-exported entries referenced in the
comment) so they propagate to child processes, and remove the surrounding quotes
from the image URL on the quoted line so the value follows dotenv conventions;
ensure other exported lines (those already using export) remain unchanged and
keep naming identical to existing keys.
In `@external/dynamo/build_multi_lru_image.sh`:
- Around line 193-207: The script currently calls ./container/build.sh while
shell option set -e may cause the script to exit before BUILD_EXIT_CODE is
captured; modify the invocation around the build step (the block that calls
./container/build.sh and assigns BUILD_EXIT_CODE) to temporarily disable set -e
(e.g., set +e), run ./container/build.sh with the existing flags (framework
VLLM, --target "$BUILD_TARGET", --tag "$IMAGE_TAG", --enable-kvbm, build-arg and
vllm-max-jobs, $NO_CACHE, $DRY_RUN), capture the exit status into
BUILD_EXIT_CODE, then re-enable set -e (set -e) so subsequent error handling
works reliably; alternatively implement the &&/|| pattern to set BUILD_EXIT_CODE
explicitly after the command.
- Around line 122-128: When updating an existing clone (KVBM_NEXT_DIR) the
current git sequence uses git checkout "$BRANCH" which can fail if a local
branch named BRANCH doesn't exist; replace that step with logic that ensures a
local branch is created from the remote if missing: after git fetch origin,
check if the local branch exists (e.g., git show-ref --verify --quiet
refs/heads/"$BRANCH"), and if it does git checkout "$BRANCH", otherwise create
it from origin with git checkout -b "$BRANCH" "origin/$BRANCH" (or git branch
--track "$BRANCH" "origin/$BRANCH" then checkout) before running git pull and
git submodule update --init --recursive.
In `@external/dynamo/generalized/router.py`:
- Around line 1069-1071: The comment about create_service() being removed is
inconsistent with other router implementations that still call await
component.create_service(); inspect which Dynamo version the project targets
(pyproject.toml) and either (A) make removal intentional by removing
create_service() everywhere and update the comment near
runtime.namespace("dynamo").component("router") / WorkloadAwareRouter to
explicitly state “create_service removed in Dynamo >=0.8; router endpoints
register automatically” or (B) preserve compatibility by calling create_service
conditionally—e.g., check for hasattr(component, "create_service") or a
runtime/version flag and call await component.create_service() when present—so
that the behavior matches the optimized router implementation (which currently
calls create_service), keeping the comment aligned with the chosen approach.
In `@external/dynamo/optimized/processor_multilru.py`:
- Around line 507-510: Replace the logger.error call in the except block in
external/dynamo/optimized/processor_multilru.py with logger.exception so the
full stack trace is recorded when an exception is caught (keep the existing
message string and the subsequent self._metrics.router_errors_total.inc() and
return None, None intact); locate the except block that currently does
logger.error("Failed to pick worker: %s", e) and change it to
logger.exception(...) to capture the traceback for debugging.
- Around line 389-394: The _extract_hints function's signature and docstring are
out of sync with its actual return values: update the return type annotation
from tuple[str, int, str, str] to include the fifth value (likely a bool for
use_frequency_backend), e.g. tuple[str, int, str, str, bool], and revise the
docstring "Returns:" line to list all five returned items in the correct order
with brief types/names (prefix_id, total_requests, osl, iat,
use_frequency_backend) so documentation matches the implementation.
- Around line 661-664: The fire-and-forget call using
asyncio.create_task(self._update_kve_metrics_async(kve_data)) can let the Task
be garbage-collected; add a class-level set (e.g., self._background_tasks) in
the class __init__ and store the created Task in it, e.g., task =
asyncio.create_task(self._update_kve_metrics_async(kve_data));
self._background_tasks.add(task), and ensure completed tasks are removed (attach
a done callback or await/cleanup in shutdown) so tasks are tracked and not
prematurely collected.
🧹 Nitpick comments (9)
external/dynamo/demo_priority_eviction.sh (1)
53-56: Consider adding defensive handling for the metrics fetch.With
set -euo pipefail, if the metrics endpoint doesn't containprefix_cache_hits_total{,grepreturns exit code 1 and terminates the script. Additionally, if the container isn't running or curl fails, downstream arithmetic on empty values will error.🛠️ Proposed fix to add fallback for robustness
get_hits() { - docker exec dynamo-vllm curl -s http://localhost:18081/metrics 2>/dev/null | \ - grep "prefix_cache_hits_total{" | grep -v external | awk '{print $NF}' + local hits + hits=$(docker exec dynamo-vllm curl -s http://localhost:18081/metrics 2>/dev/null | \ + grep "prefix_cache_hits_total{" | grep -v external | awk '{print $NF}' || true) + echo "${hits:-0}" }external/dynamo/optimized/router_multilru.py (7)
79-81: Addnoqacomment for intentionally unused arguments.The fallback implementation intentionally ignores
tokensandmin_overlapto maintain interface compatibility. Add noqa to suppress the static analysis warning and document the intent.Suggested fix
- async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores: + async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores: # noqa: ARG002 """Find overlap scores for each worker. Returns empty scores (round-robin fallback).""" return OverlapScores({})
171-171: Consider usingtempfile.gettempdir()for secure temporary directory handling.Hardcoding
/tmpcan be insecure on multi-user systems. The path is configurable via config, but the default should use the system's secure temp directory.Suggested fix
+import tempfile + def get_builtin_defaults() -> dict[str, Any]: """Return built-in default configuration (matches config.yaml).""" return { # ... other config ... "debug": { "traces_enabled": False, - "trace_dir": "/tmp/dynamo_router_traces", + "trace_dir": os.path.join(tempfile.gettempdir(), "dynamo_router_traces"), "buffer_size": 2000, }, }
378-408: Consider extracting shared Pydantic models to avoid duplication.These models (
RouterRequest,RouterResponse,FeedbackRequest,FeedbackAck) are duplicated inexternal/dynamo/generalized/router.py(lines 80-108) andprocessor_multilru.py. Consider extracting to a sharedmodels.pymodule to follow the DRY principle.
502-503: Log a warning whenlin_forgetis clamped to valid range.The clamping silently modifies user-provided values. Consider logging when the value is adjusted to help with debugging.
Suggested fix
self.lin_forget = float(lints_forget) + original_forget = self.lin_forget self.lin_forget = max(1e-6, min(self.lin_forget, 0.999999)) + if self.lin_forget != original_forget: + logger.warning("lin_forget clamped from %s to %s", original_forget, self.lin_forget)
568-574: Consider logging exception details instead of just the message.The blind
Exceptioncatch is acceptable here since trace writing should never crash the router, but logging only%sloses the exception type. Uselogger.debug(..., exc_info=True)or at least include the exception type.Suggested fix
try: path = os.path.join(self.debug_trace_dir, "router_traces.jsonl") with open(path, "a", encoding="utf-8") as f: f.write(json.dumps(item, separators=(",", ":")) + "\n") - except Exception as e: - logger.debug("Trace write failed: %s", e) + except Exception: + logger.debug("Trace write failed", exc_info=True)
1075-1075: Prefix unused variable with underscore.
chosen_ctxis unpacked but never used. Prefix with underscore to indicate intentional discard and suppress the linter warning.Suggested fix
- chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores) + chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
1257-1264: Add return type hint to_get_underloaded.Per coding guidelines, public/internal APIs should have type hints. This method returns a tuple that should be annotated.
Suggested fix
- def _get_underloaded(self, metrics: dict[str, Any] | None): + def _get_underloaded(self, metrics: dict[str, Any] | None) -> tuple[int, float]: if not metrics or not metrics.get("endpoints"): wid = int(random.choice(list(self.engine_client.instance_ids()))) return wid, 0.0external/dynamo/optimized/processor_multilru.py (1)
157-157: Sort__slots__alphabetically for consistency.Static analysis recommends natural sorting of
__slots__. This is a minor style convention.Suggested fix
- __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks") + __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")
| export HF_HOME=/path/to/local/storage/.cache/huggingface | ||
|
|
||
| export HF_TOKEN=my_huggingface_read_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists
find . -path "*/external/dynamo/.env.example" -type f 2>/dev/null | head -20Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 99
🏁 Script executed:
# Get file size to determine how to read it
wc -l ./external/dynamo/.env.example 2>/dev/null || echo "File not found or error"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 103
🏁 Script executed:
# Read lines 14-20 to see context around lines 16-18
sed -n '14,20p' ./external/dynamo/.env.exampleRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 264
🏁 Script executed:
# Also check the full content to understand the file structure
cat -n ./external/dynamo/.env.exampleRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 6685
Comment out HF_TOKEN placeholder to prevent accidental use.
Exporting a dummy token as an active value can lead to confusing authentication failures if this file is sourced or used in automation. Commenting it out forces users to explicitly set the token themselves.
Suggested fix
-export HF_TOKEN=my_huggingface_read_token
+# export HF_TOKEN=your_huggingface_read_token📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export HF_HOME=/path/to/local/storage/.cache/huggingface | |
| export HF_TOKEN=my_huggingface_read_token | |
| export HF_HOME=/path/to/local/storage/.cache/huggingface | |
| # export HF_TOKEN=your_huggingface_read_token |
🤖 Prompt for AI Agents
In `@external/dynamo/.env.example` around lines 16 - 18, The HF_TOKEN placeholder
in the .env example is active and may be accidentally sourced; change the
HF_TOKEN export to be commented out (e.g., prefix the HF_TOKEN line with a
comment marker) so users must explicitly set or export HF_TOKEN themselves;
ensure HF_HOME remains exported and leave a commented example or instruction for
users to add their own HF_TOKEN instead of exporting a dummy value.
| # HTTP port for Dynamo frontend API | ||
| # Default: 8099 | ||
| # DYNAMO_HTTP_PORT="8099" | ||
| # Default: 8000 | ||
| DYNAMO_HTTP_PORT=8000 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find the file and check if it exists
fd -type f -name ".env.example" external/dynamo/Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 239
🏁 Script executed:
# Read the file to inspect the specific lines mentioned
head -130 external/dynamo/.env.example | tail -80Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2835
🏁 Script executed:
# Get line count and read the full file to see export patterns
wc -l external/dynamo/.env.example
cat -n external/dynamo/.env.exampleRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 6718
🏁 Script executed:
# Check for dotenv-linter configuration or references in the repo
fd -name ".dotenv*" -o -name "*dotenv*" | grep -E "(config|rc|yml|yaml)"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 299
🏁 Script executed:
# Search for dotenv-linter usage or configuration
rg "dotenv-linter" --type yaml --type json --type tomlRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 51
🏁 Script executed:
# Check if dotenv-linter is installed and test it on the file
which dotenv-linter || echo "dotenv-linter not found in PATH"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 161
Add missing export statements and remove quotes for consistency with dotenv conventions.
The file uses export on some variables (lines 16, 18, 30, 33, 41) but omits it on others (lines 61, 110, 125-126). When sourced, variables without export won't propagate to child processes. Additionally, line 126 has quotes around the image URL, which is non-standard for .env files and can cause parsing issues.
Suggested fix
-DYNAMO_HTTP_PORT=8000
+export DYNAMO_HTTP_PORT=8000
...
-DYNAMO_KV_BLOCK_SIZE=64
+export DYNAMO_KV_BLOCK_SIZE=64
...
-DYNAMO_USE_MULTILRU=false
-DYNAMO_VLLM_IMAGE="nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1"
+export DYNAMO_USE_MULTILRU=false
+export DYNAMO_VLLM_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1🤖 Prompt for AI Agents
In `@external/dynamo/.env.example` around lines 59 - 62, Update the .env.example
so all variables are consistently exported and remove unnecessary quotes: add
export to variables missing it (e.g., DYNAMO_HTTP_PORT and the other
non-exported entries referenced in the comment) so they propagate to child
processes, and remove the surrounding quotes from the image URL on the quoted
line so the value follows dotenv conventions; ensure other exported lines (those
already using export) remain unchanged and keep naming identical to existing
keys.
| if [ -d "$KVBM_NEXT_DIR" ]; then | ||
| echo "Updating existing $BRANCH branch..." | ||
| cd "$KVBM_NEXT_DIR" | ||
| git fetch origin | ||
| git checkout "$BRANCH" | ||
| git pull origin "$BRANCH" | ||
| git submodule update --init --recursive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n external/dynamo/build_multi_lru_image.sh | head -150 | tail -40Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1745
🏁 Script executed:
# Test: Does git checkout <BRANCH> fail when local branch doesn't exist?
# This requires understanding git checkout behavior
# The concern is: if a repo was cloned on a different branch and BRANCH doesn't exist locally,
# does 'git checkout "$BRANCH"' fail?
# Let's verify the git behavior
git --version
# Check git documentation behavior
git checkout --help | grep -A 5 "If -b is not given"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 202
🏁 Script executed:
# Verify git checkout -B syntax
git checkout --help | grep -A 10 "\-B"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 166
🏁 Script executed:
# Test git checkout behavior when local branch doesn't exist
# Create a test repo to verify the behavior
mkdir -p /tmp/test-git
cd /tmp/test-git
# Initialize a bare repo to act as "origin"
mkdir -p origin.git
cd origin.git
git init --bare
cd ..
# Clone with depth=1 (simulating the actual script behavior)
git clone --depth 1 origin.git local_clone
cd local_clone
# Create a test branch on origin to simulate the scenario
cd ../origin.git
git config --add uploadpack.allowAnySHA1InWant true
cd ..
# Add content to the local clone and push
cd local_clone
echo "test" > file.txt
git config user.email "test@test.com"
git config user.name "Test"
git add file.txt
git commit -m "initial"
# Create a new branch locally and push to origin
git checkout -b testbranch
echo "feature" > feature.txt
git add feature.txt
git commit -m "feature"
git push -u origin testbranch
# Now simulate the problem: fetch and try checkout on a fresh scenario
# Create a new clone from origin to simulate a repo checked out on a different branch
cd ..
rm -rf fresh_clone
git clone --depth 1 --branch main origin.git fresh_clone
cd fresh_clone
# Now try: git fetch origin followed by git checkout testbranch
git fetch origin testbranch
# Test 1: Does git checkout testbranch work without -B?
echo "Test 1: git checkout testbranch (should fail or work depending on git version)"
git checkout testbranch 2>&1 || echo "CHECKOUT FAILED"
# Test 2: Does git checkout -B work?
echo ""
echo "Test 2: git checkout -B testbranch origin/testbranch (should succeed)"
git checkout -B testbranch origin/testbranch 2>&1 && echo "CHECKOUT -B SUCCEEDED"
cd /
rm -rf /tmp/test-gitRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1794
🌐 Web query:
git checkout behavior when local branch doesn't exist but remote tracking branch exists
💡 Result:
-
If you run
git checkout <name>(orgit switch <name>) and there is no local branch<name>, but there is a remote-tracking branch likeorigin/<name>on exactly one remote, Git’s default “DWIM/guess” behavior is to create a new local branch<name>and set it up to track that remote branch (equivalent togit checkout -b <name> --track <remote>/<name>). [1][2] -
If
<name>exists on multiple remotes (e.g., bothorigin/<name>andupstream/<name>exist), Git won’t guess which one you meant and will require you to be explicit—unless you setcheckout.defaultRemote(e.g.,origin) to break ties. [1][3] -
If you run
git checkout origin/<name>, you are checking out a remote-tracking branch name directly, which typically leaves you in detached HEAD. To get a local branch, usegit checkout <name>(DWIM) orgit checkout -b <name> --track origin/<name>. [2] -
You can disable/enable the guessing behavior with
--no-guess/--guess(and configure it viacheckout.guess). [1][3]
Sources: Git documentation for git-checkout (guessing rules, ambiguity, checkout.defaultRemote, --guess/--no-guess) [1][3], and examples/discussion of the DWIM behavior and detached-HEAD when checking out origin/<branch> [2].
Handle existing clones that don't yet have a local branch.
git checkout "$BRANCH" may fail when updating repositories cloned on a different branch. While modern Git versions (2.39+) can auto-create local branches via DWIM behavior, this relies on guessing logic that can be disabled or fail with non-standard configurations. The suggested fix explicitly creates the local branch from the remote tracking branch, ensuring reliable updates across all Git configurations and versions.
Suggested fix
- git fetch origin
- git checkout "$BRANCH"
- git pull origin "$BRANCH"
+ git fetch origin "$BRANCH"
+ git checkout -B "$BRANCH" "origin/$BRANCH"
+ git pull --ff-only origin "$BRANCH"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ -d "$KVBM_NEXT_DIR" ]; then | |
| echo "Updating existing $BRANCH branch..." | |
| cd "$KVBM_NEXT_DIR" | |
| git fetch origin | |
| git checkout "$BRANCH" | |
| git pull origin "$BRANCH" | |
| git submodule update --init --recursive | |
| if [ -d "$KVBM_NEXT_DIR" ]; then | |
| echo "Updating existing $BRANCH branch..." | |
| cd "$KVBM_NEXT_DIR" | |
| git fetch origin "$BRANCH" | |
| git checkout -B "$BRANCH" "origin/$BRANCH" | |
| git pull --ff-only origin "$BRANCH" | |
| git submodule update --init --recursive |
🤖 Prompt for AI Agents
In `@external/dynamo/build_multi_lru_image.sh` around lines 122 - 128, When
updating an existing clone (KVBM_NEXT_DIR) the current git sequence uses git
checkout "$BRANCH" which can fail if a local branch named BRANCH doesn't exist;
replace that step with logic that ensures a local branch is created from the
remote if missing: after git fetch origin, check if the local branch exists
(e.g., git show-ref --verify --quiet refs/heads/"$BRANCH"), and if it does git
checkout "$BRANCH", otherwise create it from origin with git checkout -b
"$BRANCH" "origin/$BRANCH" (or git branch --track "$BRANCH" "origin/$BRANCH"
then checkout) before running git pull and git submodule update --init
--recursive.
| # Run the build | ||
| # Note: --enable-kvbm is automatically set for VLLM framework, but we set it explicitly for clarity | ||
| ./container/build.sh \ | ||
| --framework VLLM \ | ||
| --target "$BUILD_TARGET" \ | ||
| --tag "$IMAGE_TAG" \ | ||
| --enable-kvbm \ | ||
| --build-arg "CARGO_BUILD_JOBS=$CARGO_BUILD_JOBS" \ | ||
| --vllm-max-jobs "$MAX_JOBS" \ | ||
| $NO_CACHE \ | ||
| $DRY_RUN | ||
|
|
||
| BUILD_EXIT_CODE=$? | ||
|
|
||
| if [ $BUILD_EXIT_CODE -eq 0 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "build_multi_lru_image.sh" | head -20Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 111
🏁 Script executed:
wc -l external/dynamo/build_multi_lru_image.shRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 113
🏁 Script executed:
head -50 external/dynamo/build_multi_lru_image.shRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1887
🏁 Script executed:
sed -n '180,220p' external/dynamo/build_multi_lru_image.shRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1295
🏁 Script executed:
sed -n '1,70p' external/dynamo/build_multi_lru_image.sh | grep -A 5 "set "Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 379
🏁 Script executed:
sed -n '200,230p' external/dynamo/build_multi_lru_image.shRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1084
Build failure handling is bypassed by set -e.
With set -e enabled, a non‑zero exit from build.sh exits the script before BUILD_EXIT_CODE is set. Disable set -e temporarily, run the build, capture the exit code, then re-enable it:
set +e
./container/build.sh \
--framework VLLM \
--target "$BUILD_TARGET" \
--tag "$IMAGE_TAG" \
--enable-kvbm \
--build-arg "CARGO_BUILD_JOBS=$CARGO_BUILD_JOBS" \
--vllm-max-jobs "$MAX_JOBS" \
$NO_CACHE \
$DRY_RUN
BUILD_EXIT_CODE=$?
set -eAlternatively, use an explicit pattern: ./container/build.sh ... && BUILD_EXIT_CODE=0 || BUILD_EXIT_CODE=$?
🤖 Prompt for AI Agents
In `@external/dynamo/build_multi_lru_image.sh` around lines 193 - 207, The script
currently calls ./container/build.sh while shell option set -e may cause the
script to exit before BUILD_EXIT_CODE is captured; modify the invocation around
the build step (the block that calls ./container/build.sh and assigns
BUILD_EXIT_CODE) to temporarily disable set -e (e.g., set +e), run
./container/build.sh with the existing flags (framework VLLM, --target
"$BUILD_TARGET", --tag "$IMAGE_TAG", --enable-kvbm, build-arg and vllm-max-jobs,
$NO_CACHE, $DRY_RUN), capture the exit status into BUILD_EXIT_CODE, then
re-enable set -e (set -e) so subsequent error handling works reliably;
alternatively implement the &&/|| pattern to set BUILD_EXIT_CODE explicitly
after the command.
| component = runtime.namespace("dynamo").component("router") | ||
| await component.create_service() | ||
| # NOTE: create_service() was removed in Dynamo 0.8.x - endpoint creation handles registration | ||
| logger.info("Initializing WorkloadAwareRouter (LinTS + feedback + timeout + traces)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check pinned Dynamo runtime versions across common dependency files
fd -a -E .git -E node_modules -E dist -E build -E .venv 'pyproject\.toml|requirements.*|poetry\.lock|uv\.lock|Pipfile|setup\.cfg' \
-x rg -n "dynamo" {}Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 406
🏁 Script executed:
#!/bin/bash
# Examine the file context around the flagged lines
cat -n external/dynamo/generalized/router.py | sed -n '1060,1080p'Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 999
🏁 Script executed:
#!/bin/bash
# Search for version guards or Dynamo version handling in the router file
rg -n "version|0\.8|create_service" external/dynamo/generalized/router.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 171
🏁 Script executed:
#!/bin/bash
# Look for Dynamo version references across the codebase
rg -n "dynamo.*version|0\.8\.|create_service" --type py -A 2 -B 2 | head -100Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2853
🏁 Script executed:
#!/bin/bash
# Check context around create_service() calls in optimized files
rg -B5 -A5 "create_service()" external/dynamo/optimized/ | head -50Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 3747
🏁 Script executed:
#!/bin/bash
# Look for any version guards or conditional Dynamo handling
rg -n "version|0\.8|0\.9|dynamo_worker|DistributedRuntime" external/dynamo/ -A 2 -B 2 | grep -E "(external/dynamo/(optimized|generalized)/(processor|router)\.py|version|0\.[0-9])" | head -80Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 7722
🏁 Script executed:
#!/bin/bash
# Check if there's documentation or setup.py defining Dynamo version requirements
fd -E .git -E node_modules 'setup.py|pyproject.toml|README.md|CHANGELOG.md' -x rg -l "dynamo" {} \; | head -10Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 413
🏁 Script executed:
#!/bin/bash
# Get more context on the actual endpoint registration mentioned in the comment
rg -n "endpoint.*creat|creat.*endpoint|register" external/dynamo/generalized/router.py -B 3 -A 3 | head -60Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 759
🏁 Script executed:
#!/bin/bash
# Search for endpoint creation in the router file after the logger.info call
sed -n '1070,1120p' external/dynamo/generalized/router.py | cat -nRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 2454
🏁 Script executed:
#!/bin/bash
# Check what differences exist between optimized/router.py and optimized/router_multilru.py
rg -B10 -A10 "create_service|NOTE: create" external/dynamo/optimized/router.py | head -40Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 854
🏁 Script executed:
#!/bin/bash
# Look for pyproject.toml to check Dynamo version constraints
cat pyproject.toml | grep -A 10 -B 2 "dynamo"Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1301
Resolve inconsistency: create_service() still used in other router implementations.
The comment states create_service() was removed in Dynamo 0.8.x, but external/dynamo/optimized/router.py (lines 1346) still explicitly calls await component.create_service(). Meanwhile, this file and external/dynamo/optimized/router_multilru.py remove the call entirely. Without version pinning in pyproject.toml, it's unclear which Dynamo version this code targets. Verify whether the removal is intentional for different deployment modes (generalized vs. optimized) or a partial migration that needs completion. If intentional, update the comment to clarify when the difference applies.
🤖 Prompt for AI Agents
In `@external/dynamo/generalized/router.py` around lines 1069 - 1071, The comment
about create_service() being removed is inconsistent with other router
implementations that still call await component.create_service(); inspect which
Dynamo version the project targets (pyproject.toml) and either (A) make removal
intentional by removing create_service() everywhere and update the comment near
runtime.namespace("dynamo").component("router") / WorkloadAwareRouter to
explicitly state “create_service removed in Dynamo >=0.8; router endpoints
register automatically” or (B) preserve compatibility by calling create_service
conditionally—e.g., check for hasattr(component, "create_service") or a
runtime/version flag and call await component.create_service() when present—so
that the behavior matches the optimized router implementation (which currently
calls create_service), keeping the comment aligned with the chosen approach.
| def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str]: | ||
| """ | ||
| Extract routing hints from PreprocessedRequest annotations. | ||
|
|
||
| Returns: (prefix_id, total_requests, osl, iat, use_frequency_backend) | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring to match actual return values.
The docstring states "Returns: (prefix_id, total_requests, osl, iat, use_frequency_backend)" but the comment shows only 4 items. The actual implementation returns 5 values including use_frequency_backend.
Suggested fix
def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str]:
"""
Extract routing hints from PreprocessedRequest annotations.
- Returns: (prefix_id, total_requests, osl, iat, use_frequency_backend)
+ Returns:
+ Tuple of (prefix_id, total_requests, osl, iat, use_frequency_backend).
"""Also update the return type annotation:
- def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str]:
+ def _extract_hints(self, request: dict[str, Any]) -> tuple[str, int, str, str, bool]:🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 389 - 394, The
_extract_hints function's signature and docstring are out of sync with its
actual return values: update the return type annotation from tuple[str, int,
str, str] to include the fifth value (likely a bool for use_frequency_backend),
e.g. tuple[str, int, str, str, bool], and revise the docstring "Returns:" line
to list all five returned items in the correct order with brief types/names
(prefix_id, total_requests, osl, iat, use_frequency_backend) so documentation
matches the implementation.
| except Exception as e: | ||
| logger.error("Failed to pick worker: %s", e) | ||
| self._metrics.router_errors_total.inc() | ||
| return None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.exception() when catching and not re-raising.
Per coding guidelines, when catching and logging exceptions without re-raising, use logger.exception() to capture the full stack trace.
Suggested fix
except Exception as e:
- logger.error("Failed to pick worker: %s", e)
+ logger.exception("Failed to pick worker")
self._metrics.router_errors_total.inc()
return None, None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.error("Failed to pick worker: %s", e) | |
| self._metrics.router_errors_total.inc() | |
| return None, None | |
| except Exception as e: | |
| logger.exception("Failed to pick worker") | |
| self._metrics.router_errors_total.inc() | |
| return None, None |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 507-507: Do not catch blind exception: Exception
(BLE001)
[warning] 508-508: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 507 - 510,
Replace the logger.error call in the except block in
external/dynamo/optimized/processor_multilru.py with logger.exception so the
full stack trace is recorded when an exception is caught (keep the existing
message string and the subsequent self._metrics.router_errors_total.inc() and
return None, None intact); locate the except block that currently does
logger.error("Failed to pick worker: %s", e) and change it to
logger.exception(...) to capture the traceback for debugging.
| # Fire-and-forget KVE metric update (async, non-blocking) | ||
| # This ensures KVE computation has ZERO impact on routing throughput | ||
| if kve_data is not None: | ||
| asyncio.create_task(self._update_kve_metrics_async(kve_data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store reference to asyncio.create_task() to prevent garbage collection.
Fire-and-forget tasks created with asyncio.create_task() can be garbage collected before completion if no reference is held. Store the task reference or add it to a background task set.
Suggested fix
Add a class-level set to track background tasks in __init__:
self._background_tasks: set[asyncio.Task] = set()Then modify the create_task call:
if kve_data is not None:
- asyncio.create_task(self._update_kve_metrics_async(kve_data))
+ task = asyncio.create_task(self._update_kve_metrics_async(kve_data))
+ self._background_tasks.add(task)
+ task.add_done_callback(self._background_tasks.discard)🧰 Tools
🪛 Ruff (0.14.14)
[warning] 664-664: Store a reference to the return value of asyncio.create_task
(RUF006)
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 661 - 664, The
fire-and-forget call using
asyncio.create_task(self._update_kve_metrics_async(kve_data)) can let the Task
be garbage-collected; add a class-level set (e.g., self._background_tasks) in
the class __init__ and store the created Task in it, e.g., task =
asyncio.create_task(self._update_kve_metrics_async(kve_data));
self._background_tasks.add(task), and ensure completed tasks are removed (attach
a done callback or await/cleanup in shutdown) so tasks are tracked and not
prematurely collected.
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/dynamo_integration/react_benchmark_agent/src/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.yml (1)
103-111:⚠️ Potential issue | 🟠 MajorUpdate
eval_llmto use port 8000 for consistency.The
eval_llmuses port 8099 whiledynamo_llmuses port 8000 in the same file. This is inconsistent. Theprofile_rethinking_full_test.ymlconfig shows thateval_llmworks correctly with port 8000, and most other eval configs maintain the same port for both LLMs. Update line 106 tobase_url: http://localhost:8000/v1.
🤖 Fix all issues with AI agents
In `@external/dynamo/optimized/processor_multilru.py`:
- Around line 742-769: Update parse_args() to declare and return an
argparse.Namespace return type and add a new CLI flag for kv_cache_block_size
(e.g., parser.add_argument("--kv-cache-block-size", type=int, default=1,
help="...")), then propagate that parsed value into register_llm where
kv_cache_block_size is currently hardcoded to 1 so the function uses
args.kv_cache_block_size; ensure register_llm signature or call sites accept the
new parameter if needed and update any references in the file (including the
other occurrence around lines 809-816) to use the parsed value.
In `@external/dynamo/optimized/processor.py`:
- Around line 497-500: Replace the plain error log with an exception log so the
full stack trace is captured: in the exception handler that currently calls
logger.error("Failed to pick worker: %s", e) (the block that then increments
self._metrics.router_errors_total and returns None, None), use
logger.exception(...) instead (preserving the message text) so the stack trace
is recorded when the method fails to pick a worker.
- Line 654: The create_task call that fires-and-forgets
self._update_kve_metrics_async(kve_data) can be garbage-collected; add a
persistent container (e.g., self._background_tasks: set[asyncio.Task]) in the
class __init__ and replace the bare asyncio.create_task(...) with creating the
task, adding it to self._background_tasks, and attaching a done callback to
remove the task from the set when finished; reference the existing call to
self._update_kve_metrics_async and the class initializer to locate where to add
the set and where to store/remove the task.
In `@external/dynamo/optimized/router_multilru.py`:
- Around line 1343-1345: The multilru router/processor files are inconsistent
with router.py/processor.py regarding component.create_service(); update
router_multilru.py and processor_multilru.py to match the non-multilru files by
conditionally invoking create_service when present: after obtaining component
via runtime.namespace("dynamo").component("router") (and similarly for
processor), check for the existence of create_service on the component and await
it if available (e.g., use a guard like hasattr or getattr to avoid failing on
Dynamo 0.8.x), and update/remove the outdated comment so both variants behave
consistently across Dynamo versions.
🧹 Nitpick comments (13)
external/dynamo/monitoring/scripts/kv_event_observer.py (2)
171-206: Consider validating URL scheme for defensive coding.The
urlopencall constructs URLs fromhostandmetrics_portparameters. While this is an operator-run monitoring script, validating that the constructed URL useshttp://scheme would be more defensive against misconfiguration.🛡️ Optional: Add scheme validation
def _poll_metrics(self): """Background thread to poll Prometheus metrics for cache hits.""" metrics_url = f"http://{self.host}:{self.metrics_port}/metrics" + if not metrics_url.startswith(("http://", "https://")): + raise ValueError(f"Invalid metrics URL scheme: {metrics_url}") while self.running:
507-508: Consider prefixing unused signal handler arguments.The signal handler signature requires
(signum, frame)arguments but they're unused. Prefixing with_improves clarity and silences linter warnings.♻️ Optional: Prefix unused args
- signal.signal(signal.SIGINT, lambda s, f: setattr(observer, 'running', False)) - signal.signal(signal.SIGTERM, lambda s, f: setattr(observer, 'running', False)) + signal.signal(signal.SIGINT, lambda _s, _f: setattr(observer, 'running', False)) + signal.signal(signal.SIGTERM, lambda _s, _f: setattr(observer, 'running', False))external/dynamo/optimized/router.py (5)
79-81: Unused parameters in fallbackKvIndexer.find_matches_for_request.The
tokensandmin_overlapparameters are intentionally unused in this fallback stub implementation. Consider prefixing them with underscores to indicate they are deliberately ignored.Suggested fix
- async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores: + async def find_matches_for_request(self, _tokens: list[int], _min_overlap: int) -> OverlapScores: """Find overlap scores for each worker. Returns empty scores (round-robin fallback).""" return OverlapScores({})
171-171: Hardcoded/tmppath for debug traces.Using
/tmpas the default trace directory can be a security concern in shared environments. Consider using a user-specific temp directory or requiring explicit configuration.Suggested fix using tempfile
+import tempfile + def get_builtin_defaults() -> dict[str, Any]: """Return built-in default configuration (matches config.yaml).""" return { ... "debug": { "traces_enabled": False, - "trace_dir": "/tmp/dynamo_router_traces", + "trace_dir": os.path.join(tempfile.gettempdir(), "dynamo_router_traces"), "buffer_size": 2000, }, }Apply similar change to the constructor default at line 473.
Also applies to: 473-473
568-573: Uselogger.debugwith exception info for trace write failures.The current implementation catches a blind
Exceptionand logs with%s. Consider usinglogger.debug(..., exc_info=True)to capture the full context for debugging trace write issues.Suggested fix
try: path = os.path.join(self.debug_trace_dir, "router_traces.jsonl") with open(path, "a", encoding="utf-8") as f: f.write(json.dumps(item, separators=(",", ":")) + "\n") - except Exception as e: - logger.debug("Trace write failed: %s", e) + except OSError: + logger.debug("Trace write failed", exc_info=True)
1073-1073: Unused variablechosen_ctxshould be prefixed with underscore.The unpacked
chosen_ctxvariable is not used. Prefix it with an underscore to indicate intentional discard.Suggested fix
- chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores) + chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
1266-1333: Add return type annotation toparse_args().Per coding guidelines, public functions require type hints on return values.
Suggested fix
-def parse_args(): +def parse_args() -> argparse.Namespace: """Parse minimal CLI arguments.external/dynamo/optimized/processor.py (1)
157-157: Sort__slots__tuple for consistency.Ruff suggests sorting
__slots__alphabetically for maintainability.Suggested fix
- __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks") + __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")external/dynamo/optimized/processor_multilru.py (1)
157-157: Sort__slots__tuple for consistency.Same issue as in processor.py - consider sorting alphabetically.
Suggested fix
- __slots__ = ("prompt_tokens", "cached_tokens", "device_blocks", "host_blocks", "disk_blocks") + __slots__ = ("cached_tokens", "device_blocks", "disk_blocks", "host_blocks", "prompt_tokens")external/dynamo/optimized/router_multilru.py (4)
79-81: Unused parameters in fallbackKvIndexer.find_matches_for_request.Same issue as router.py - prefix unused parameters with underscores.
Suggested fix
- async def find_matches_for_request(self, tokens: list[int], min_overlap: int) -> OverlapScores: + async def find_matches_for_request(self, _tokens: list[int], _min_overlap: int) -> OverlapScores: """Find overlap scores for each worker. Returns empty scores (round-robin fallback).""" return OverlapScores({})
171-171: Hardcoded/tmppath for debug traces.Same security concern as router.py - consider using
tempfile.gettempdir()or requiring explicit configuration.Also applies to: 473-473, 1388-1388
1073-1073: Unused variablechosen_ctxshould be prefixed with underscore.Same issue as router.py.
Suggested fix
- chosen, chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores) + chosen, _chosen_ctx, all_ctx, raw_scores, probs = self._select_worker(worker_ids, req, metrics, scores)
1266-1333: Add return type annotation toparse_args().Same issue as router.py - public functions require return type hints.
Suggested fix
-def parse_args(): +def parse_args() -> argparse.Namespace: """Parse minimal CLI arguments.
| def parse_args(): | ||
| """Parse command-line arguments for the processor.""" | ||
| parser = argparse.ArgumentParser(description="Optimized Thompson Sampling Processor") | ||
| parser.add_argument( | ||
| "--enable-router", | ||
| action="store_true", | ||
| default=True, | ||
| help="Enable Thompson Sampling router integration", | ||
| ) | ||
| parser.add_argument( | ||
| "--no-router", | ||
| action="store_false", | ||
| dest="enable_router", | ||
| help="Disable router (use engine load balancing only)", | ||
| ) | ||
| parser.add_argument( | ||
| "--model-path", | ||
| type=str, | ||
| required=True, | ||
| help="Path to the model directory (for loading tokenizer and model card)", | ||
| ) | ||
| parser.add_argument( | ||
| "--model-name", | ||
| type=str, | ||
| required=True, | ||
| help="Served model name (must match frontend's --model-name)", | ||
| ) | ||
| return parser.parse_args() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add return type to parse_args() and thread kv_cache_block_size through CLI.
Same issue as processor.py - parse_args() lacks return type and kv_cache_block_size is hardcoded to 1.
Suggested fix
-def parse_args():
+def parse_args() -> argparse.Namespace:
"""Parse command-line arguments for the processor."""
parser = argparse.ArgumentParser(description="Optimized Thompson Sampling Processor")
...
+ parser.add_argument(
+ "--kv-cache-block-size",
+ type=int,
+ default=int(os.environ.get("DYNAMO_KV_BLOCK_SIZE", "64")),
+ help="KV cache block size in tokens (must match workers' --page-size)",
+ )
return parser.parse_args()Then update register_llm:
await register_llm(
...
- kv_cache_block_size=1,
+ kv_cache_block_size=args.kv_cache_block_size,
)Also applies to: 809-816
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor_multilru.py` around lines 742 - 769,
Update parse_args() to declare and return an argparse.Namespace return type and
add a new CLI flag for kv_cache_block_size (e.g.,
parser.add_argument("--kv-cache-block-size", type=int, default=1, help="...")),
then propagate that parsed value into register_llm where kv_cache_block_size is
currently hardcoded to 1 so the function uses args.kv_cache_block_size; ensure
register_llm signature or call sites accept the new parameter if needed and
update any references in the file (including the other occurrence around lines
809-816) to use the parsed value.
| except Exception as e: | ||
| logger.error("Failed to pick worker: %s", e) | ||
| self._metrics.router_errors_total.inc() | ||
| return None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.exception() when catching exceptions without re-raising.
Per coding guidelines, when catching and logging exceptions without re-raising, use logger.exception() to capture full stack trace information.
Suggested fix
except Exception as e:
- logger.error("Failed to pick worker: %s", e)
+ logger.exception("Failed to pick worker")
self._metrics.router_errors_total.inc()
return None, None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.error("Failed to pick worker: %s", e) | |
| self._metrics.router_errors_total.inc() | |
| return None, None | |
| except Exception as e: | |
| logger.exception("Failed to pick worker") | |
| self._metrics.router_errors_total.inc() | |
| return None, None |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 497-497: Do not catch blind exception: Exception
(BLE001)
[warning] 498-498: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor.py` around lines 497 - 500, Replace the
plain error log with an exception log so the full stack trace is captured: in
the exception handler that currently calls logger.error("Failed to pick worker:
%s", e) (the block that then increments self._metrics.router_errors_total and
returns None, None), use logger.exception(...) instead (preserving the message
text) so the stack trace is recorded when the method fails to pick a worker.
| # Fire-and-forget KVE metric update (async, non-blocking) | ||
| # This ensures KVE computation has ZERO impact on routing throughput | ||
| if kve_data is not None: | ||
| asyncio.create_task(self._update_kve_metrics_async(kve_data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store reference to asyncio.create_task() to prevent garbage collection.
Fire-and-forget tasks can be garbage collected before completion. Store the task reference to ensure it completes.
Suggested fix
Add a class-level set in __init__:
self._background_tasks: set[asyncio.Task] = set()Then modify the create_task call:
if kve_data is not None:
- asyncio.create_task(self._update_kve_metrics_async(kve_data))
+ task = asyncio.create_task(self._update_kve_metrics_async(kve_data))
+ self._background_tasks.add(task)
+ task.add_done_callback(self._background_tasks.discard)🧰 Tools
🪛 Ruff (0.14.14)
[warning] 654-654: Store a reference to the return value of asyncio.create_task
(RUF006)
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/processor.py` at line 654, The create_task call
that fires-and-forgets self._update_kve_metrics_async(kve_data) can be
garbage-collected; add a persistent container (e.g., self._background_tasks:
set[asyncio.Task]) in the class __init__ and replace the bare
asyncio.create_task(...) with creating the task, adding it to
self._background_tasks, and attaching a done callback to remove the task from
the set when finished; reference the existing call to
self._update_kve_metrics_async and the class initializer to locate where to add
the set and where to store/remove the task.
| component = runtime.namespace("dynamo").component("router") | ||
| # NOTE: create_service() was removed in Dynamo 0.8.x - endpoint creation handles registration | ||
| logger.info("Initializing Optimized Thompson Sampling Router (Prometheus metrics)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "create_service" external/dynamo/optimized/Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 600
🏁 Script executed:
cd external/dynamo/optimized && \
echo "=== router.py around line 1344 ===" && \
sed -n '1340,1350p' router.py && \
echo "" && \
echo "=== router_multilru.py around line 1343-1345 ===" && \
sed -n '1340,1350p' router_multilru.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1012
🏁 Script executed:
# Also check if there are other router files
ls -la external/dynamo/optimized/router*.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 249
Resolve create_service() inconsistency between router.py and router_multilru.py.
router.py calls await component.create_service() at line 1344, while router_multilru.py omits it with a note that the function was removed in Dynamo 0.8.x. The same pattern appears in processor.py vs processor_multilru.py. Clarify whether the multilru variants target a different Dynamo version or if the comment is outdated.
🤖 Prompt for AI Agents
In `@external/dynamo/optimized/router_multilru.py` around lines 1343 - 1345, The
multilru router/processor files are inconsistent with router.py/processor.py
regarding component.create_service(); update router_multilru.py and
processor_multilru.py to match the non-multilru files by conditionally invoking
create_service when present: after obtaining component via
runtime.namespace("dynamo").component("router") (and similarly for processor),
check for the existence of create_service on the component and await it if
available (e.g., use a guard like hasattr or getattr to avoid failing on Dynamo
0.8.x), and update/remove the outdated comment so both variants behave
consistently across Dynamo versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@external/dynamo/monitoring/README.md`:
- Line 3: Replace the inanimate possessive phrase "ai-dynamo's Prometheus API"
in the README with a non-possessive form such as "the ai-dynamo Prometheus API"
or "the Prometheus API from ai-dynamo" so the sentence reads e.g. "Metrics are
collected at 2-second resolution directly from the ai-dynamo Prometheus API for
per-request granularity."; locate the exact phrase "ai-dynamo's Prometheus API"
in the README and update it to one of the suggested alternatives.
| @@ -0,0 +1,982 @@ | |||
| # Dynamo Monitoring Stack | |||
|
|
|||
| This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from ai-dynamo's Prometheus API for per-request granularity. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid inanimate possessive.
The phrasing "ai-dynamo's Prometheus API" uses an inanimate possessive. Per documentation style guidelines, prefer "the Prometheus API from ai-dynamo" or "the ai-dynamo Prometheus API."
✏️ Suggested change
-This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from ai-dynamo's Prometheus API for per-request granularity.
+This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from the ai-dynamo Prometheus API for per-request granularity.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from ai-dynamo's Prometheus API for per-request granularity. | |
| This directory contains a Prometheus + Grafana monitoring setup for the Dynamo LLM inference stack with Thompson Sampling router. Metrics are collected at **2-second resolution** directly from the ai-dynamo Prometheus API for per-request granularity. |
🤖 Prompt for AI Agents
In `@external/dynamo/monitoring/README.md` at line 3, Replace the inanimate
possessive phrase "ai-dynamo's Prometheus API" in the README with a
non-possessive form such as "the ai-dynamo Prometheus API" or "the Prometheus
API from ai-dynamo" so the sentence reads e.g. "Metrics are collected at
2-second resolution directly from the ai-dynamo Prometheus API for per-request
granularity."; locate the exact phrase "ai-dynamo's Prometheus API" in the
README and update it to one of the suggested alternatives.
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Signed-off-by: bbednarski9 <bbednarski@nvidia.com>
Description
This PR brings the ai-dynamo prometheus metrics API integration with grafana dash boarding into NAT. This is a core feature of the dynamo optimization pipeline for the NAT/dynamo integration. We also integrate the prometheus API with the NAT profiler, so that we can begin end-to-end agentic header-injection optimization.
Closes
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.