-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Description of the bug:
We are observing random build breakages when our deployment of Build Barn is suffering from unstability.
To further diagnose this, I've been trying to inject manual failures into our Build Barn (by forcibly terminating individual server processes while a build is ongoing) to see how Bazel reacts. In the vast majority of the cases, Bazel correctly detects the failure and falls back to local execution (provided --remote_local_fallback is enabled) / retries the failed action.
However, there seems to be one case from which Bazel cannot recover from. If I terminate a remote Build Barn worker while a long-running action is being executed on it, Bazel will immediately fail the build with:
ERROR: .../test/BUILD:1:8: Executing genrule //test:foo failed: (Exit -1): bash failed: error executing command (from target //test:foo) /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; sleep 120; touch bazel-out/aarch64-fastbuild/bin/test/foo.txt'
Action details (uncached result): blobs/sha256/historical_execute_response/50faea922bc3495d9e18c3928b471949651b5adc66243a8e3f2790c02d72c434-814/
Neither remote retries nor the local fallback kick in, failing the build for what should be a retryable error.
I'm not sure if this is something that ought to be fixed at the Bazel level or at the Build Barn level. I suspect this is a Build Barn issue based on what I'll explain below, but I'm not completely sure if that's the case or if Bazel could do better -- hence why I'm starting here. cc @EdSchouten
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
- Create a genrule that sleeps for 2 minutes (
sleep 120 ; touch $@). - Configure Bazel to use Build Barn and set
--remote_local_fallback. - Build the genrule.
- Kill the worker process while the genrule is executing.
Observe Bazel immediately abort the build as described above.
Which operating system are you running Bazel on?
N/A
What is the output of bazel info release?
bazel-6.1.1
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
One detail that caught my attention in the above error message is Exit -1. Process exit codes are 8 bits and should not be negative. And if we try to run a genrule that does exit -1, Bazel will (correctly) claim (Exit 255) in the failure.
But Bazel (and the RBE protocol) use 32-bit signed integers to propagate exit codes. So I suspect the -1 is coming from Build Barn itself or from the Bazel remote execution code – and if that’s true, we could potentially use this to discern between real action failures and infrastructure failures, and implement retries or local execution fallback.
Bazel’s remote execution module seems to propagate the ExitCode of the remote worker verbatim in the SpawnResult instances, so I do not think Bazel is generating this fictitious -1. It has to come from Build Barn.
The only place where I see this -1 originating from is in the pkg/runner/local_runner.go file in bb-remote-execution where the result of cmd.ProcessState.ExitCode() is propagated as the exit code of the action result. Go has the following to say about this function:
ExitCode returns the exit code of the exited process, or -1 if the process hasn't exited or was terminated by a signal.
So, there is the -1 right there.
However… this is in the “local runner” code, which I understand runs within the worker. Forcibly terminating the worker process wouldn’t give a chance to the worker code to process and return the -1, unless the worker caught a graceful termination signal and tried to do something with it. But the bb-remote-execution code does not seem to catch any signal… so this is very strange.
This all means that we cannot use -1 as a discriminating factor for worker deaths. A genrule that does kill -9 $$ is also reported as Exit -1 from the Build Farm due to the above explanation from Go. (But, interestingly, running this same genrule locally reports Killed instead of Exit -1 because, well, the code in Bazel is doing the right thing regarding exit codes—unlike Go’s simplistic view of the world.)