Fix: Correctly handle malformed dynamic workflows to avoid 'failed + succeeded + running' Schroedinger state #6854
+205
−118
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tracking issue
Fixes #4466
This is truly my arch nemesis in terms of Flyte issues. I reported this problem more than 2 years ago and it was discussed multiple times in the contributors' sync. Our platform users have run into this problem dozens of times, always leaving them profoundly confused and frustrated.
Why are the changes needed?
When a dynamic workflow is malformed in a way that is not detected on the flytekit level but only when compiling the workflow in flytepropeller, such as ...
... the user is presented with this confusing view in flyteconsole, showing a succeeded attempt, failed node, and running workflow:
Only ~1.5h later, the workflow fails with
RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: 0: [User] malformed dynamic workflow ...What changes were proposed in this pull request?
The discussions in the issue and in the contributors' sync always revolved around changing flyteconsole to show the underlying error to the user in the UI, see e.g. here.
I realized that this is actually not a UI issue but rather a bug in the dynamic handler logic:
AbortandFinalizemethods are called but both try to build the dynamic workflow another time - which again fails.AbortandFinalizereturn an error.max number of system retry attempts [51/50] exhausted ... malformed dynamic workflow ...This PR modifies
AbortandFinalizeto not return an error when the dynamic workflow is malformed. This way, the workflow can be aborted and finalized immediately and the user is presented with there underlying user error right away.How was this patch tested?
Ran propeller with the proposed changes and the example above. The workflow immediately fails and shows the underlying error to the user:
Added unit tests for the fixed behaviour.
Check all the applicable boxes