Skip to content

Comments

[Test][HistoryServer] E2E test for dead cluster actor endpoint#4461

Merged
rueian merged 11 commits intoray-project:masterfrom
fangyinc:issues4379
Feb 23, 2026
Merged

[Test][HistoryServer] E2E test for dead cluster actor endpoint#4461
rueian merged 11 commits intoray-project:masterfrom
fangyinc:issues4379

Conversation

@fangyinc
Copy link
Contributor

@fangyinc fangyinc commented Jan 29, 2026

Why are these changes needed?

Add E2E test to verify history server can fetch actors from dead clusters (after RayCluster deletion)

Dead cluster actors test:

  • Verifies /logical/actors endpoint returns actors from S3 for dead clusters
  • Verifies /logical/actors/{actor_id} endpoint returns single actor details
  • Verifies non-existent actor queries return appropriate error responses

Related issue number

Closes #4379

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
Cursor Bugbot reviewed your changes and found no issues for commit 4f3229b

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high quality, thank you!

@Future-Outlier
Copy link
Member

thank @JiangJiaWei1103 for helping the review

@Future-Outlier
Copy link
Member

Hi, @fangyinc do you mind help me solve the merge conflict?

@fangyinc
Copy link
Contributor Author

fangyinc commented Feb 9, 2026

Hi, @fangyinc do you mind help me solve the merge conflict?

Done.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
@andrewsykim
Copy link
Member

Please fix merge conflicts

g.Eventually(func() error {
_, err := GetRayCluster(test, namespace.Name, rayCluster.Name)
return err
}, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue()))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined deletion duplicates existing DeleteRayClusterAndWait helper

Low Severity

The new testLogicalActorsEndpointDeadCluster function manually inlines the RayCluster deletion and wait logic (delete, expect no error, log, Eventually wait for IsNotFound), which is exactly what the existing DeleteRayClusterAndWait helper in historyserver.go already does. The adjacent testLogFileEndpointDeadCluster test uses DeleteRayClusterAndWait for the same purpose, making this inconsistency more noticeable. Duplicating this logic increases maintenance burden and risks divergence if the deletion flow is updated.

Fix in Cursor Fix in Web

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian
Copy link
Collaborator

rueian commented Feb 13, 2026

Hi @fangyinc, please help to resolve the conflict.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +218 to +228
// Delete RayCluster to trigger log upload to S3
err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Delete(test.Ctx(), rayCluster.Name, metav1.DeleteOptions{})
g.Expect(err).NotTo(HaveOccurred())
LogWithTimestamp(test.T(), "Deleted RayCluster %s/%s", namespace.Name, rayCluster.Name)

// Wait for cluster to be fully deleted (ensures logs are uploaded to S3 and events are processed)
g.Eventually(func() error {
_, err := GetRayCluster(test, namespace.Name, rayCluster.Name)
return err
}, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue()))

Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster deletion logic is duplicated here instead of using the existing DeleteRayClusterAndWait helper function. This is inconsistent with testLogFileEndpointDeadCluster (line 171) and testDeadClusterTasks which use the helper. Using the helper function improves code maintainability and ensures consistent deletion behavior across tests.

Suggested change
// Delete RayCluster to trigger log upload to S3
err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Delete(test.Ctx(), rayCluster.Name, metav1.DeleteOptions{})
g.Expect(err).NotTo(HaveOccurred())
LogWithTimestamp(test.T(), "Deleted RayCluster %s/%s", namespace.Name, rayCluster.Name)
// Wait for cluster to be fully deleted (ensures logs are uploaded to S3 and events are processed)
g.Eventually(func() error {
_, err := GetRayCluster(test, namespace.Name, rayCluster.Name)
return err
}, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue()))
// Delete RayCluster to trigger log upload to S3 and wait for full deletion
DeleteRayClusterAndWait(test, g, namespace, rayCluster)

Copilot uses AI. Check for mistakes.
// 3. Delete RayCluster to trigger log upload to S3 (and event processing)
// 4. Apply History Server and get its URL
// 5. Verify that the history server returns actors via /logical/actors endpoint
// 6. Verify that the history server returns a single actor via /logical/actors/{actor_id} endpoint
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test documentation lists 7 steps but misses the third sub-test in the step list. Step 6 should be expanded to include all three actor endpoint tests: (a) fetching all actors, (b) fetching a single actor by ID, and (c) handling non-existent actor queries. Consider updating the documentation to: "6. Verify that the history server returns actors via /logical/actors endpoint, returns a single actor via /logical/actors/{actor_id} endpoint, and handles non-existent actor queries appropriately"

Suggested change
// 6. Verify that the history server returns a single actor via /logical/actors/{actor_id} endpoint
// 6. Verify that the history server returns actors via /logical/actors endpoint, returns a single actor via /logical/actors/{actor_id} endpoint, and handles non-existent actor queries appropriately

Copilot uses AI. Check for mistakes.
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
@rueian rueian merged commit 7ae0657 into ray-project:master Feb 23, 2026
30 of 31 checks passed
@github-project-automation github-project-automation bot moved this from To Review to Done in My Kuberay & Ray Feb 23, 2026
@github-project-automation github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] E2E test for dead cluster actor endpoint

6 participants