Skip to content

Comments

[Feature][history server] Support arbitrary Ray Dashboard endpoint collection#4529

Open
Future-Outlier wants to merge 11 commits intoray-project:masterfrom
Future-Outlier:arbitrary-endpoints
Open

[Feature][history server] Support arbitrary Ray Dashboard endpoint collection#4529
Future-Outlier wants to merge 11 commits intoray-project:masterfrom
Future-Outlier:arbitrary-endpoints

Conversation

@Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Feb 23, 2026

issue link: #4530

Summary

Allow the collector to fetch and store arbitrary Ray Dashboard API endpoints (e.g. /api/v0/placement_groups),
not just hardcoded ones. The historyserver serves them back for dead clusters via a fallback handler.

End-to-End Lifecycle

  ┌─────────────────────────────────────────────────────────────────────┐
  │  raycluster.yaml                                                    │
  │  RAY_COLLECTOR_ADDITIONAL_ENDPOINTS = "/api/v0/placement_groups"    │
  │  RAY_COLLECTOR_POLL_INTERVAL        = "30s"                         │
  └──────────────────────────┬──────────────────────────────────────────┘
                             │
            ┌────────────────▼────────────────┐
            │  Collector (sidecar on head pod) │
            │                                  │
            │  1. FetchAndStoreClusterMetadata  │──── one-time on startup
            │     GET /api/v0/cluster_metadata  │
            │                                  │
            │  2. PollAdditionalEndpoints       │──── every 30s
            │     GET /api/v0/placement_groups  │
            └────────────────┬─────────────────┘
                             │ WriteFile()
                             ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │  S3 Storage                                                          │
  │                                                                      │
  │  {cluster}_{id}/                                                     │
  │    {session}/                                                        │
  │      logs/                          ← unchanged                      │
  │      fetched_endpoints/             ← NEW (was meta/{session}/)      │
  │        restful__api__v0__cluster_metadata                            │
  │        restful__api__v0__placement_groups                            │
  └──────────────────────────┬───────────────────────────────────────────┘
                             │ GetContent()
                             ▼
            ┌─────────────────────────────────────┐
            │  Historyserver                       │
            │                                      │
            │  /api/v0/cluster_metadata             │─── dedicated handler
            │  /api/v0/placement_groups             │─── fallback handler
            │  /api/v0/{any_polled_endpoint}        │    (/{subpath:*})
            │                                      │
            │  live session  → proxy to dashboard   │
            │  dead session  → read from S3         │
            └─────────────────────────────────────┘

Key Changes

  • Collector: meta.go (one-time fetch) + poll.go (periodic poll) for arbitrary endpoints
  • Storage path: {cluster}/meta/{session}/{cluster}/{session}/fetched_endpoints/ (consistent with {cluster}/{session}/logs/)
  • Router: getAdditionalEndpoint fallback handler serves any polled endpoint from S3
  • Config: RAY_COLLECTOR_ADDITIONAL_ENDPOINTS env var (comma-separated endpoint paths)
  • RayJob: Creates a detached placement group to validate end-to-end flow

alimaazamat and others added 9 commits February 19, 2026 16:23
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 305a13fd09

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Future-Outlier Future-Outlier changed the title [history server] Support Addtional Endpoints [Feature][history server] Support arbitrary Ray Dashboard endpoint collection Feb 23, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ddb8045d48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +111 to +112
resp, err := r.HttpClient.Do(req)
cancel()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep poll request context alive through body read

pollSingleEndpoint cancels the request context immediately after HttpClient.Do returns, but in Go the request context covers reading the response body too; canceling it before io.ReadAll can turn normal/chunked responses into context canceled read errors and skip writes. This makes additional endpoint snapshots flaky or consistently missing for slower/larger endpoints, so cancel() should run only after body read/close completes (for example via defer).

Useful? React with 👍 / 👎.

Comment on lines +63 to +64
resp, err := r.HttpClient.Do(req)
cancel()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep metadata request context alive through body read

FetchAndStoreClusterMetadata has the same early-cancel pattern: it calls cancel() right after HttpClient.Do and then reads resp.Body. Because the context governs body reads, this can produce context canceled while reading metadata and force retries (or never persist metadata on slower responses). Defer cancellation until after response body handling is finished.

Useful? React with 👍 / 👎.

Comment on lines +35 to +37
// Resolve the session name first so we can store metadata under the correct session path.
sessionName, err := r.resolveSessionName()
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-resolve session before persisting cluster metadata

The metadata path is bound to sessionName resolved once at startup, so if Ray rotates session_latest while the sidecar keeps running (a case already handled elsewhere via WatchSessionLatestLoops), metadata continues targeting the old session and the new session can return 404 for /api/v0/cluster_metadata. Resolve the current session when writing (or react to symlink changes) instead of pinning it once before the retry loop.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants