[History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints by sb-hakunamatata · Pull Request #4470 · ray-project/kuberay

sb-hakunamatata · 2026-01-31T12:27:53Z

Why are these changes needed?

This PR fixes RayJob submission failures in History Server E2E tests and adds coverage for History Server behavior when the Ray cluster is deleted (dead cluster scenario).

What changed

Added a new dead-cluster E2E test suite under tests/e2e to verify History Server endpoints return HTTP 200 and valid JSON after the live Ray cluster is deleted:
- /api/v0/tasks
- /api/v0/tasks?filter_keys=job_id...
- /api/v0/tasks/summarize
- /logical/actors
- /logical/actors/{actor_id}
- /nodes?view=summary
Updated the test helper ApplyRayJobAndWaitForCompletion to assign a unique RayJob name on every submission by appending a UUID.

Why

The same RayJob manifest is submitted multiple times within a single E2E test.
Reusing the same RayJob name caused submission failures due to Kubernetes resource name collisions.
Making the RayJob name unique ensures:
- RayJob submissions succeed reliably
- Tests remain deterministic and isolated
- No interference between multiple job submissions in the same namespace

Testing

Ran History Server E2E tests locally.
Verified RayJob submission succeeds when the same manifest is applied multiple times.
Confirmed all History Server endpoints return 200 OK and non-empty JSON responses after the cluster is deleted.
Soft Validation for the data returned by those endpoints. (currently none of the endpoint is returning exact same json as of the live server, hence those data validation assertions are set to false)
Test Result persent at https://pub.microbin.eu/p/eel-swan-gecko

Rollback plan

Revert this PR to restore the previous RayJob naming behavior and remove the new E2E coverage.

Related issue number

Fixes #4378

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

historyserver/test/e2e/historyserver_deadcluster_test.go

sb-hakunamatata · 2026-01-31T14:05:19Z

@Future-Outlier please review.

Future-Outlier

Hi, thanks for the PR.

can you do something like this PR? #4461
don't add a new file, instead, add endpoints to historyserver/test/e2e/historyserver_test.go
the goal of your PR is to add task endpoint, so plz don't include actor
please change the title.

JiangJiaWei1103

I agree with @Future-Outlier. Please follow the pattern used in #4461 and keep this PR focused on task-related endpoints, without including actors or nodes. Thanks.

JiangJiaWei1103 · 2026-02-08T00:03:58Z

historyserver/test/e2e/historyserver_deadcluster_test.go

+	tests := []struct {
+		name           string
+		testFunc       func(Test, *WithT, *corev1.Namespace, *s3.S3, bool)
+		dataValidation bool


Could you clarify the rationale for introducing dataValidation while disabling it across all tests?

The current set of tests are failing with dataValidation, for all endpoints, the rational is to have soft tests and later once we fix the data endpoints, we can just enable the data validation to have the test validate them.

The idea of this tasks as I understood was to validate the endpoints by capturing the data from live cluster, destroy and then history server should be able to replicate them, so added those data validations.
however upon tests failure got to know that certain tasks are missing from them.
To allow iterative development kept the validation behind a flag in the test, and currently it just output the difference, for e.g. what's is missing / additional in the endpoints. this info is visible in the logs of the test.

Once we fix those endpoints around the tasks and data we can enable the flag to verify.

sb-hakunamatata · 2026-02-08T12:24:34Z

Hi, thanks for the PR.

can you do something like this PR? [Test][HistoryServer] E2E test for dead cluster actor endpoint #4461

don't add a new file, instead, add endpoints to historyserver/test/e2e/historyserver_test.go

the goal of your PR is to add task endpoint, so plz don't include actor

please change the title.

@Future-Outlier done please check now moved the tests to the same file, removed actor and changed the title.

historyserver/test/e2e/historyserver_test.go

fscnick · 2026-02-11T12:05:01Z

historyserver/test/e2e/historyserver_test.go

+			val := compareJsons(test, gg, string(requiredBody), string(body))
+			if dataValidation {
+				gg.Expect(val).To(BeTrue())
+			}


Could it be ?

Suggested change

val := compareJsons(test, gg, string(requiredBody), string(body))

if dataValidation {

gg.Expect(val).To(BeTrue())

}

if dataValidation {

gg.Expect(body).To(MatchJSON(requiredBody))

}

I wrote compare jsons to allow for logging the diff between jsons when it mismatches, can this be made soft check ?

Here is the output of MatchJSON, does that satisfy your need?

Expected <string>: { "a": "aa" } to match JSON of <string>: { "b": "bb", "a": "aa" }

However, it is not recommended to use cmp.Diff to compare two json. It likes a string comparison. Which means the key ordering should be the same. With the different order keys, it would output as follow:

string( - `{"a": "aa", "b": "bb"}`, + `{"b": "bb", "a": "aa"}`, )

If you still want the diff of JSON, I guess it might need to introduce some library shipped with JSON diff.

for now let me just put it with the flag only
if dataValidation, then use MatchJson to validate otherwise skip that check and do basic sanity of available fields.

fscnick · 2026-02-11T12:07:37Z

historyserver/test/e2e/historyserver_test.go

+			gg.Expect(err).NotTo(HaveOccurred())
+			gg.Expect(len(body)).To(BeNumerically(">", 0))
+
+			compareJsons(test, gg, jobResponses[job1], string(body))


Could it use MatchJSON ?

fscnick · 2026-02-11T12:08:13Z

historyserver/test/e2e/historyserver_test.go

+			gg.Expect(err).NotTo(HaveOccurred())
+			gg.Expect(len(body)).To(BeNumerically(">", 0))
+
+			_ = compareJsons(test, gg, jobResponses[job2], string(body))


fscnick · 2026-02-11T12:11:07Z

historyserver/test/e2e/historyserver_test.go

+	ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster)
+	ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster)


Why should we need two RayJobs ? The following code seems work on one RayJob.

This I added to validate that the tasks actually got merged, for one /tasks endpoint, ensuring live and history gives same responses, in case of multiple jobs present, but yes it would work with just one ray job.

historyserver/test/e2e/historyserver_test.go

fscnick · 2026-02-11T12:14:32Z

historyserver/test/e2e/historyserver_test.go

+		if len(jobs) != 2 {
+			return
+		}


This looks like a redundant code. The assertion above has guarantee the length is equal to 2. Why should it check again?

yeah, I think during iteration first I added this and then above. fixing it.

historyserver/test/e2e/historyserver_test.go

sb-hakunamatata · 2026-02-14T11:10:55Z

some how during rebase the other commits got pulled fixing it.

historyserver/pkg/historyserver/router.go

historyserver/test/support/constant.go

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

historyserver/test/support/historyserver.go

sb-hakunamatata · 2026-02-14T12:50:19Z

Closing this MR since the work was alrady done in #4436

cursor bot reviewed Jan 31, 2026

View reviewed changes

historyserver/test/e2e/historyserver_deadcluster_test.go Outdated Show resolved Hide resolved

historyserver/test/e2e/historyserver_deadcluster_test.go Outdated Show resolved Hide resolved

cursor bot reviewed Jan 31, 2026

View reviewed changes

historyserver/test/e2e/historyserver_deadcluster_test.go Outdated Show resolved Hide resolved

historyserver/test/e2e/historyserver_deadcluster_test.go Outdated Show resolved Hide resolved

cursor bot reviewed Jan 31, 2026

View reviewed changes

historyserver/test/e2e/historyserver_deadcluster_test.go Outdated Show resolved Hide resolved

cursor bot reviewed Jan 31, 2026

View reviewed changes

historyserver/test/e2e/historyserver_deadcluster_test.go Outdated Show resolved Hide resolved

Future-Outlier self-assigned this Feb 2, 2026

Copilot AI mentioned this pull request Feb 5, 2026

Review all open pull requests #4482

Closed

4 tasks

Future-Outlier requested changes Feb 5, 2026

View reviewed changes

JiangJiaWei1103 reviewed Feb 8, 2026

View reviewed changes

sb-hakunamatata changed the title ~~[History Server Tests] end-to-end Tests for Dead Cluster Endpoints~~ [History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints Feb 8, 2026

cursor bot reviewed Feb 8, 2026

View reviewed changes

historyserver/test/e2e/historyserver_test.go Outdated Show resolved Hide resolved

Future-Outlier moved this to others in @Future-Outlier's kuberay project Feb 9, 2026

Future-Outlier added this to @Future-Outlier's kuberay project Feb 9, 2026

fscnick reviewed Feb 11, 2026

View reviewed changes

cursor bot reviewed Feb 11, 2026

View reviewed changes

historyserver/test/e2e/historyserver_test.go Show resolved Hide resolved

historyserver/test/e2e/historyserver_test.go Show resolved Hide resolved

sb-hakunamatata requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners February 14, 2026 11:09

cursor bot reviewed Feb 14, 2026

View reviewed changes

historyserver/pkg/historyserver/router.go Show resolved Hide resolved

fixing the task summrize endpoint

8f8820f

sb-hakunamatata force-pushed the fix-history-server-dead-url-tests-4378 branch from f5d7f28 to 8f8820f Compare February 14, 2026 12:13

cursor bot reviewed Feb 14, 2026

View reviewed changes

historyserver/test/support/constant.go Outdated Show resolved Hide resolved

fixing the task summrize endpoint

bdeadd1

cursor bot reviewed Feb 14, 2026

View reviewed changes

historyserver/test/support/historyserver.go Show resolved Hide resolved

sb-hakunamatata closed this Feb 14, 2026

github-project-automation bot moved this from others to Done in @Future-Outlier's kuberay project Feb 14, 2026

		ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster)
		ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster)

Comments

Conversation

sb-hakunamatata commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

What changed

Why

Testing

Rollback plan

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sb-hakunamatata commented Jan 31, 2026

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiangJiaWei1103 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sb-hakunamatata Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sb-hakunamatata commented Feb 8, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sb-hakunamatata commented Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sb-hakunamatata commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sb-hakunamatata commented Jan 31, 2026 •

edited

Loading

Future-Outlier left a comment •

edited

Loading

sb-hakunamatata Feb 8, 2026 •

edited

Loading