[Feature][history server] Support arbitrary Ray Dashboard endpoint collection#4529
[Feature][history server] Support arbitrary Ray Dashboard endpoint collection#4529Future-Outlier wants to merge 11 commits intoray-project:masterfrom
Conversation
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 305a13fd09
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Show resolved
Hide resolved
Signed-off-by: Future-Outlier <eric901201@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ddb8045d48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resp, err := r.HttpClient.Do(req) | ||
| cancel() |
There was a problem hiding this comment.
Keep poll request context alive through body read
pollSingleEndpoint cancels the request context immediately after HttpClient.Do returns, but in Go the request context covers reading the response body too; canceling it before io.ReadAll can turn normal/chunked responses into context canceled read errors and skip writes. This makes additional endpoint snapshots flaky or consistently missing for slower/larger endpoints, so cancel() should run only after body read/close completes (for example via defer).
Useful? React with 👍 / 👎.
| resp, err := r.HttpClient.Do(req) | ||
| cancel() |
There was a problem hiding this comment.
Keep metadata request context alive through body read
FetchAndStoreClusterMetadata has the same early-cancel pattern: it calls cancel() right after HttpClient.Do and then reads resp.Body. Because the context governs body reads, this can produce context canceled while reading metadata and force retries (or never persist metadata on slower responses). Defer cancellation until after response body handling is finished.
Useful? React with 👍 / 👎.
| // Resolve the session name first so we can store metadata under the correct session path. | ||
| sessionName, err := r.resolveSessionName() | ||
| if err != nil { |
There was a problem hiding this comment.
Re-resolve session before persisting cluster metadata
The metadata path is bound to sessionName resolved once at startup, so if Ray rotates session_latest while the sidecar keeps running (a case already handled elsewhere via WatchSessionLatestLoops), metadata continues targeting the old session and the new session can return 404 for /api/v0/cluster_metadata. Resolve the current session when writing (or react to symlink changes) instead of pinning it once before the retry loop.
Useful? React with 👍 / 👎.
issue link: #4530
Summary
Allow the collector to fetch and store arbitrary Ray Dashboard API endpoints (e.g.
/api/v0/placement_groups),not just hardcoded ones. The historyserver serves them back for dead clusters via a fallback handler.
End-to-End Lifecycle
Key Changes
meta.go(one-time fetch) +poll.go(periodic poll) for arbitrary endpoints{cluster}/meta/{session}/→{cluster}/{session}/fetched_endpoints/(consistent with{cluster}/{session}/logs/)getAdditionalEndpointfallback handler serves any polled endpoint from S3RAY_COLLECTOR_ADDITIONAL_ENDPOINTSenv var (comma-separated endpoint paths)