[Test][HistoryServer] E2E test for dead cluster actor endpoint#4461
[Test][HistoryServer] E2E test for dead cluster actor endpoint#4461rueian merged 11 commits intoray-project:masterfrom
Conversation
Future-Outlier
left a comment
There was a problem hiding this comment.
high quality, thank you!
|
thank @JiangJiaWei1103 for helping the review |
|
Hi, @fangyinc do you mind help me solve the merge conflict? |
Done. |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
Please fix merge conflicts |
| g.Eventually(func() error { | ||
| _, err := GetRayCluster(test, namespace.Name, rayCluster.Name) | ||
| return err | ||
| }, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue())) |
There was a problem hiding this comment.
Inlined deletion duplicates existing DeleteRayClusterAndWait helper
Low Severity
The new testLogicalActorsEndpointDeadCluster function manually inlines the RayCluster deletion and wait logic (delete, expect no error, log, Eventually wait for IsNotFound), which is exactly what the existing DeleteRayClusterAndWait helper in historyserver.go already does. The adjacent testLogFileEndpointDeadCluster test uses DeleteRayClusterAndWait for the same purpose, making this inconsistency more noticeable. Duplicating this logic increases maintenance burden and risks divergence if the deletion flow is updated.
|
Hi @fangyinc, please help to resolve the conflict. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Delete RayCluster to trigger log upload to S3 | ||
| err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Delete(test.Ctx(), rayCluster.Name, metav1.DeleteOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| LogWithTimestamp(test.T(), "Deleted RayCluster %s/%s", namespace.Name, rayCluster.Name) | ||
|
|
||
| // Wait for cluster to be fully deleted (ensures logs are uploaded to S3 and events are processed) | ||
| g.Eventually(func() error { | ||
| _, err := GetRayCluster(test, namespace.Name, rayCluster.Name) | ||
| return err | ||
| }, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue())) | ||
|
|
There was a problem hiding this comment.
The cluster deletion logic is duplicated here instead of using the existing DeleteRayClusterAndWait helper function. This is inconsistent with testLogFileEndpointDeadCluster (line 171) and testDeadClusterTasks which use the helper. Using the helper function improves code maintainability and ensures consistent deletion behavior across tests.
| // Delete RayCluster to trigger log upload to S3 | |
| err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Delete(test.Ctx(), rayCluster.Name, metav1.DeleteOptions{}) | |
| g.Expect(err).NotTo(HaveOccurred()) | |
| LogWithTimestamp(test.T(), "Deleted RayCluster %s/%s", namespace.Name, rayCluster.Name) | |
| // Wait for cluster to be fully deleted (ensures logs are uploaded to S3 and events are processed) | |
| g.Eventually(func() error { | |
| _, err := GetRayCluster(test, namespace.Name, rayCluster.Name) | |
| return err | |
| }, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue())) | |
| // Delete RayCluster to trigger log upload to S3 and wait for full deletion | |
| DeleteRayClusterAndWait(test, g, namespace, rayCluster) |
| // 3. Delete RayCluster to trigger log upload to S3 (and event processing) | ||
| // 4. Apply History Server and get its URL | ||
| // 5. Verify that the history server returns actors via /logical/actors endpoint | ||
| // 6. Verify that the history server returns a single actor via /logical/actors/{actor_id} endpoint |
There was a problem hiding this comment.
The test documentation lists 7 steps but misses the third sub-test in the step list. Step 6 should be expanded to include all three actor endpoint tests: (a) fetching all actors, (b) fetching a single actor by ID, and (c) handling non-existent actor queries. Consider updating the documentation to: "6. Verify that the history server returns actors via /logical/actors endpoint, returns a single actor via /logical/actors/{actor_id} endpoint, and handles non-existent actor queries appropriately"
| // 6. Verify that the history server returns a single actor via /logical/actors/{actor_id} endpoint | |
| // 6. Verify that the history server returns actors via /logical/actors endpoint, returns a single actor via /logical/actors/{actor_id} endpoint, and handles non-existent actor queries appropriately |
Signed-off-by: Future-Outlier <eric901201@gmail.com>


Why are these changes needed?
Add E2E test to verify history server can fetch actors from dead clusters (after RayCluster deletion)
Dead cluster actors test:
Related issue number
Closes #4379
Checks
- I've made sure the tests are passing.
- Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(
Cursor Bugbot reviewed your changes and found no issues for commit 4f3229b