Skip to content

Comments

[History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints#4470

Closed
sb-hakunamatata wants to merge 2 commits intoray-project:masterfrom
sb-hakunamatata:fix-history-server-dead-url-tests-4378
Closed

[History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints#4470
sb-hakunamatata wants to merge 2 commits intoray-project:masterfrom
sb-hakunamatata:fix-history-server-dead-url-tests-4378

Conversation

@sb-hakunamatata
Copy link

@sb-hakunamatata sb-hakunamatata commented Jan 31, 2026

Why are these changes needed?

This PR fixes RayJob submission failures in History Server E2E tests and adds coverage for History Server behavior when the Ray cluster is deleted (dead cluster scenario).

What changed

  • Added a new dead-cluster E2E test suite under tests/e2e to verify History Server endpoints return HTTP 200 and valid JSON after the live Ray cluster is deleted:

    • /api/v0/tasks
    • /api/v0/tasks?filter_keys=job_id...
    • /api/v0/tasks/summarize
    • /logical/actors
    • /logical/actors/{actor_id}
    • /nodes?view=summary
  • Updated the test helper ApplyRayJobAndWaitForCompletion to assign a unique RayJob name on every submission by appending a UUID.

Why

  • The same RayJob manifest is submitted multiple times within a single E2E test.

  • Reusing the same RayJob name caused submission failures due to Kubernetes resource name collisions.

  • Making the RayJob name unique ensures:

    • RayJob submissions succeed reliably
    • Tests remain deterministic and isolated
    • No interference between multiple job submissions in the same namespace

Testing

  • Ran History Server E2E tests locally.
  • Verified RayJob submission succeeds when the same manifest is applied multiple times.
  • Confirmed all History Server endpoints return 200 OK and non-empty JSON responses after the cluster is deleted.
  • Soft Validation for the data returned by those endpoints. (currently none of the endpoint is returning exact same json as of the live server, hence those data validation assertions are set to false)
  • Test Result persent at https://pub.microbin.eu/p/eel-swan-gecko

Rollback plan

  • Revert this PR to restore the previous RayJob naming behavior and remove the new E2E coverage.

Related issue number

Fixes #4378

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@sb-hakunamatata
Copy link
Author

@Future-Outlier please review.

@Future-Outlier Future-Outlier self-assigned this Feb 2, 2026
Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR.

  1. can you do something like this PR? #4461
  2. don't add a new file, instead, add endpoints to historyserver/test/e2e/historyserver_test.go
  3. the goal of your PR is to add task endpoint, so plz don't include actor
  4. please change the title.

Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Future-Outlier. Please follow the pattern used in #4461 and keep this PR focused on task-related endpoints, without including actors or nodes. Thanks.

tests := []struct {
name string
testFunc func(Test, *WithT, *corev1.Namespace, *s3.S3, bool)
dataValidation bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the rationale for introducing dataValidation while disabling it across all tests?

Copy link
Author

@sb-hakunamatata sb-hakunamatata Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current set of tests are failing with dataValidation, for all endpoints, the rational is to have soft tests and later once we fix the data endpoints, we can just enable the data validation to have the test validate them.

The idea of this tasks as I understood was to validate the endpoints by capturing the data from live cluster, destroy and then history server should be able to replicate them, so added those data validations.
however upon tests failure got to know that certain tasks are missing from them.
To allow iterative development kept the validation behind a flag in the test, and currently it just output the difference, for e.g. what's is missing / additional in the endpoints. this info is visible in the logs of the test.

Once we fix those endpoints around the tasks and data we can enable the flag to verify.

@sb-hakunamatata sb-hakunamatata changed the title [History Server Tests] end-to-end Tests for Dead Cluster Endpoints [History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints Feb 8, 2026
@sb-hakunamatata
Copy link
Author

Hi, thanks for the PR.

  1. can you do something like this PR? [Test][HistoryServer] E2E test for dead cluster actor endpoint #4461
  2. don't add a new file, instead, add endpoints to historyserver/test/e2e/historyserver_test.go
  3. the goal of your PR is to add task endpoint, so plz don't include actor
  4. please change the title.

@Future-Outlier done please check now moved the tests to the same file, removed actor and changed the title.

Comment on lines 420 to 423
val := compareJsons(test, gg, string(requiredBody), string(body))
if dataValidation {
gg.Expect(val).To(BeTrue())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be ?

Suggested change
val := compareJsons(test, gg, string(requiredBody), string(body))
if dataValidation {
gg.Expect(val).To(BeTrue())
}
if dataValidation {
gg.Expect(body).To(MatchJSON(requiredBody))
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote compare jsons to allow for logging the diff between jsons when it mismatches, can this be made soft check ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the output of MatchJSON, does that satisfy your need?

        Expected
            <string>: {
              "a": "aa"
            }
        to match JSON of
            <string>: {
              "b": "bb",
              "a": "aa"
            }

However, it is not recommended to use cmp.Diff to compare two json. It likes a string comparison. Which means the key ordering should be the same. With the different order keys, it would output as follow:

 string(
- 	`{"a": "aa", "b": "bb"}`,
+ 	`{"b": "bb", "a": "aa"}`,
)

If you still want the diff of JSON, I guess it might need to introduce some library shipped with JSON diff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now let me just put it with the flag only
if dataValidation, then use MatchJson to validate otherwise skip that check and do basic sanity of available fields.

gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(len(body)).To(BeNumerically(">", 0))

compareJsons(test, gg, jobResponses[job1], string(body))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it use MatchJSON ?

gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(len(body)).To(BeNumerically(">", 0))

_ = compareJsons(test, gg, jobResponses[job2], string(body))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 366 to 367
ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster)
ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we need two RayJobs ? The following code seems work on one RayJob.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I added to validate that the tasks actually got merged, for one /tasks endpoint, ensuring live and history gives same responses, in case of multiple jobs present, but yes it would work with just one ray job.

Comment on lines 482 to 484
if len(jobs) != 2 {
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a redundant code. The assertion above has guarantee the length is equal to 2. Why should it check again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think during iteration first I added this and then above. fixing it.

@sb-hakunamatata
Copy link
Author

some how during rebase the other commits got pulled fixing it.

@sb-hakunamatata sb-hakunamatata force-pushed the fix-history-server-dead-url-tests-4378 branch from f5d7f28 to 8f8820f Compare February 14, 2026 12:13
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@sb-hakunamatata
Copy link
Author

Closing this MR since the work was alrady done in #4436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] E2E test for dead cluster task endpoint

4 participants