-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Update Nexus error model to use Temporal failures #8799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bergundy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I will have time to fully review this before I go on PTO but the overall direction looks great.
When you plan on marking this ready for review, I would suggest adding the following tests:
- Old server caller is compatible with a new server handler for start, cancel and callback requests
- Encoded attributes returned from the SDK are passed through
bergundy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took about a half hour reviewing this and I still don't think I caught everything. Before we can merge this we need:
- Definitions and tests for what happens at each boundary for new and old components on either side.
- End to end tests that cover the behavior with old SDKs.
- End to end tests that cover the behavior with a new SDK after we have an implementation for the new paths.
| } | ||
| apiFailure.FailureInfo = &failurepb.Failure_ApplicationFailureInfo{ | ||
| ApplicationFailureInfo: &failurepb.ApplicationFailureInfo{ | ||
| // Make up a type here, it's not part of the Nexus Failure spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have NexusSDKFailureErrorFailureInfo in the API PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NexusSDKFailureErrorFailureInfo doesn't have a Details field. We only get to this point if we get an unexpected error type, so I thought it was better to capture the full information.
common/nexus/failure.go
Outdated
| } | ||
|
|
||
| func OperationErrorToTemporalFailure(opErr *nexus.OperationError) (*failurepb.Failure, error) { | ||
| func OperationErrorToTemporalFailure(opErr *nexus.OperationError, retryable bool) (*failurepb.Failure, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need the retryable flag here, operation errors are non-retryable by definition and the resulting failure object should be a NexusOperationFailureInfo. This function doesn't seem necessary anymore, you should already have the original nexus failure on the operation error so all you need to do is convert to a temporal failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought it would be unnecessary but I had some trouble removing it as this is the only place where we have the specific handling for CanceledFailureInfo.
| } | ||
|
|
||
| func (c *HTTPClient) bestEffortHandlerErrorFromResponse(response *http.Response, body []byte) error { | ||
| func httpStatusCodeToHandlerErrorType(response *http.Response) (nexus.HandlerErrorType, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please double check that we cover all of the error types in the nexus SDK?
| } | ||
| return &nexus.HandlerError{ | ||
| Type: errorType, | ||
| Message: response.Status, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes the HTTP status code, we need to trim that.
| RetryBehavior: retryBehavior, | ||
| }, | ||
| }, | ||
| nf, err := nexus.DefaultFailureConverter().ErrorToFailure(handlerErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use the original failure here and then convert it to a temporal failure.
| require.Equal(t, enumsspb.NEXUS_OPERATION_STATE_BACKING_OFF, op.State()) | ||
| require.NotNil(t, op.LastAttemptFailure.GetNexusHandlerFailureInfo()) | ||
| require.Equal(t, "handler error (INTERNAL): internal server error", op.LastAttemptFailure.Message) | ||
| require.Equal(t, "internal server error", op.LastAttemptFailure.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also check the handler error type in all of these tests.
| var failureErr *nexus.FailureError | ||
| var operationErr *nexus.OperationError | ||
| switch { | ||
| case errors.As(r.Error, &failureErr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should always get an operation error here.
| switch t := response.GetOutcome().(type) { | ||
| case *matchingservice.DispatchNexusTaskResponse_Failure: | ||
| oc.metricsHandler = oc.metricsHandler.WithTags(metrics.OutcomeTag("handler_error:" + t.Failure.GetNexusHandlerFailureInfo().GetType())) | ||
| nf, err := commonnexus.APIFailureToNexusFailure(t.Failure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to create a properly structured nexus failure here with metadata type set to nexus.HandlerError or nexus.OperationError and make sure the cause chain is populated correctly.
What changed?
Many changes to support the Temporal SDK sending Temporal
Failures instead of errors.Depends on temporalio/api#682 and nexus-rpc/sdk-go#69
Why?
Consistency with other APIs and more rich information.
How did you test it?
Potential risks
Still need to validate older SDK + newer server and vice versa still work correctly.