[History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints#4470
[History Server Tests] end-to-end Tests for Dead Cluster Tasks Endpoints#4470sb-hakunamatata wants to merge 2 commits intoray-project:masterfrom
Conversation
|
@Future-Outlier please review. |
There was a problem hiding this comment.
Hi, thanks for the PR.
- can you do something like this PR? #4461
- don't add a new file, instead, add endpoints to historyserver/test/e2e/historyserver_test.go
- the goal of your PR is to add
taskendpoint, so plz don't include actor - please change the title.
JiangJiaWei1103
left a comment
There was a problem hiding this comment.
I agree with @Future-Outlier. Please follow the pattern used in #4461 and keep this PR focused on task-related endpoints, without including actors or nodes. Thanks.
| tests := []struct { | ||
| name string | ||
| testFunc func(Test, *WithT, *corev1.Namespace, *s3.S3, bool) | ||
| dataValidation bool |
There was a problem hiding this comment.
Could you clarify the rationale for introducing dataValidation while disabling it across all tests?
There was a problem hiding this comment.
The current set of tests are failing with dataValidation, for all endpoints, the rational is to have soft tests and later once we fix the data endpoints, we can just enable the data validation to have the test validate them.
The idea of this tasks as I understood was to validate the endpoints by capturing the data from live cluster, destroy and then history server should be able to replicate them, so added those data validations.
however upon tests failure got to know that certain tasks are missing from them.
To allow iterative development kept the validation behind a flag in the test, and currently it just output the difference, for e.g. what's is missing / additional in the endpoints. this info is visible in the logs of the test.
Once we fix those endpoints around the tasks and data we can enable the flag to verify.
@Future-Outlier done please check now moved the tests to the same file, removed actor and changed the title. |
| val := compareJsons(test, gg, string(requiredBody), string(body)) | ||
| if dataValidation { | ||
| gg.Expect(val).To(BeTrue()) | ||
| } |
There was a problem hiding this comment.
Could it be ?
| val := compareJsons(test, gg, string(requiredBody), string(body)) | |
| if dataValidation { | |
| gg.Expect(val).To(BeTrue()) | |
| } | |
| if dataValidation { | |
| gg.Expect(body).To(MatchJSON(requiredBody)) | |
| } |
There was a problem hiding this comment.
I wrote compare jsons to allow for logging the diff between jsons when it mismatches, can this be made soft check ?
There was a problem hiding this comment.
Here is the output of MatchJSON, does that satisfy your need?
Expected
<string>: {
"a": "aa"
}
to match JSON of
<string>: {
"b": "bb",
"a": "aa"
}
However, it is not recommended to use cmp.Diff to compare two json. It likes a string comparison. Which means the key ordering should be the same. With the different order keys, it would output as follow:
string(
- `{"a": "aa", "b": "bb"}`,
+ `{"b": "bb", "a": "aa"}`,
)
If you still want the diff of JSON, I guess it might need to introduce some library shipped with JSON diff.
There was a problem hiding this comment.
for now let me just put it with the flag only
if dataValidation, then use MatchJson to validate otherwise skip that check and do basic sanity of available fields.
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(len(body)).To(BeNumerically(">", 0)) | ||
|
|
||
| compareJsons(test, gg, jobResponses[job1], string(body)) |
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(len(body)).To(BeNumerically(">", 0)) | ||
|
|
||
| _ = compareJsons(test, gg, jobResponses[job2], string(body)) |
| ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster) | ||
| ApplyRayJobAndWaitForCompletion(test, g, namespace, rayCluster) |
There was a problem hiding this comment.
Why should we need two RayJobs ? The following code seems work on one RayJob.
There was a problem hiding this comment.
This I added to validate that the tasks actually got merged, for one /tasks endpoint, ensuring live and history gives same responses, in case of multiple jobs present, but yes it would work with just one ray job.
| if len(jobs) != 2 { | ||
| return | ||
| } |
There was a problem hiding this comment.
This looks like a redundant code. The assertion above has guarantee the length is equal to 2. Why should it check again?
There was a problem hiding this comment.
yeah, I think during iteration first I added this and then above. fixing it.
|
some how during rebase the other commits got pulled fixing it. |
f5d7f28 to
8f8820f
Compare
|
Closing this MR since the work was alrady done in #4436 |
Why are these changes needed?
This PR fixes RayJob submission failures in History Server E2E tests and adds coverage for History Server behavior when the Ray cluster is deleted (dead cluster scenario).
What changed
Added a new dead-cluster E2E test suite under
tests/e2eto verify History Server endpoints return HTTP 200 and valid JSON after the live Ray cluster is deleted:/api/v0/tasks/api/v0/tasks?filter_keys=job_id.../api/v0/tasks/summarize/logical/actors/logical/actors/{actor_id}/nodes?view=summaryUpdated the test helper
ApplyRayJobAndWaitForCompletionto assign a unique RayJob name on every submission by appending a UUID.Why
The same RayJob manifest is submitted multiple times within a single E2E test.
Reusing the same RayJob name caused submission failures due to Kubernetes resource name collisions.
Making the RayJob name unique ensures:
Testing
Rollback plan
Related issue number
Fixes #4378
Checks