-
Notifications
You must be signed in to change notification settings - Fork 44
WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation. The liveliness check is configured via the `liveliness_check_command` option. In addition, this commit also includes: - Updates to Go dependencies. - A version bump to 0.8.1. - The addition of `.envrc` and `.gitignore` files.
This commit adds documentation for the new `livelinessCheckCommand` option in `docs/generated-module-options.md` and `docs/howtos.md`.
ce04038 to
9699f75
Compare
This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative. The `showDerivation` function now normalizes the derivation and output paths to ensure they are absolute.
This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative. The executor now determines the Nix store directory at startup and uses it to normalize the derivation and output paths to ensure they are absolute.
This commit refactors the Nix executor and related tests to improve robustness and remove dependencies on the local Nix environment during testing. The following changes are included: - The `NewNixExecutor` function in `internal/executor/nix.go` is updated to accept an optional `storeDir` argument. This allows tests to provide a mock store path, avoiding the need to call `GetNixStoreDir` and interact with the actual Nix environment. - The `init()` function in `internal/executor/utils.go` is removed. This function was causing test failures by calling `nix` before the tests could mock the environment. - The `showDerivation` function in `internal/executor/utils.go` is updated to accept the `storeDir` as an argument, ensuring that it uses the correct store path. - All tests in `internal/executor/executor_test.go` and `internal/manager/manager_test.go` are updated to use the `NewNixExecutor` function with a mock store path. This ensures that the tests are hermetic and do not rely on the presence of a Nix environment.
This commit introduces a liveliness check after a deployment. If the liveliness check fails, the deployment is rolled back to the previous version. The following changes are included: - The `deployer.Run` function is updated to perform a liveliness check after a deployment. If the check fails, the deployment is marked as failed, and a rollback is performed. - A new test file `internal/deployer/liveliness_check_test.go` is added to test the new liveliness check and rollback logic. - The `store.DeploymentFinished` function is updated to correctly handle the rollback and update the deployment status.
The `vendorHash` in `nix/package.nix` was updated to reflect the addition of a new dependency. The `vendor/` directory was also added to `.gitignore` to avoid committing it to the repository.
On Linux, run 'systemctl daemon-reload' before exiting to ensure that systemd picks up any changes to the service unit file.
The liveliness check was logging the command output instead of the error when the command failed. This made it difficult to debug failures. This commit fixes the logging to log the error, which will provide more useful information when the liveliness check fails.
The previous commit introduced a 'declared and not used' error for the 'output' variable in the liveliness check. This commit fixes the error by replacing the 'output' variable with a blank identifier.
This commit introduces a new option 'random_delay' to the poller configuration. This allows to specify a maximum random delay in seconds before fetching the remote, which helps to avoid the 'thundering herd' problem when managing a large fleet of devices. The documentation and NixOS module options have been updated accordingly.
|
I will be adding more tho this PR. There were some bug fixes and small QOL features also to be added. . |
|
Hello, Being able to rollback is really useful and this is in definitely in the wish list. First, i would like to allow a user to manually rollback via the CLI. Something such as a After a rollback. we need to define what is the behavior on next deployments, especially if several deployments fail. I think we should preserve the list of deployments in order to provide the full deployment history to the user (currently available with Once we have a clean
In conclusion, i think we should start by implementing a manual rollback. The main advantage is to force us to expose the feature and the behavior to users. |
|
All great points, I have some thoughts on how to architect this (though you covered most of them already). For now I have converted this to a draft. On the current branch I am working on, I am kinda implementing some of those checks. I also have a few bugs that are getting fixed:
I will try and look through what you have so far with the CLI options, because that was a question I was going to ask regarding automatic generation tracking. I figured it would be: the rollback command could allow a user to pick from a list of known good generations, and the automated version would just stick with the last known valid deployment: i.e. successfully built, switched and passed liveliness checks. Otherwise it will sit on the last valid until a new commit comes along that is different than previous tried/failed generations. Lastly, adding another option for randomDelayed starts to the poller to prevent thundering herd, which could be a problem with a fleet of devices that all use the same config, therefore the same polling interval. Don't necessarily want a ton of devices rebuilding all at once. |
This reverts commit f76a8da.
This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation. The liveliness check is configured via the `liveliness_check_command` option. In addition, this commit also includes: - Updates to Go dependencies. - A version bump to 0.8.1. - The addition of `.envrc` and `.gitignore` files.
This commit adds documentation for the new `livelinessCheckCommand` option in `docs/generated-module-options.md` and `docs/howtos.md`.
|
@nlewo Been cooking up some things. You were right on the large feature aspect, however I believe I have a working version now. Done some manual tests, but might be worth looking into for a next release. As of now, if there are failed deployments/health-checks, they get tagged in the list. For automated rollbacks it will grab and rollback to the last deployment marked successful. |
In response to #59 an Idea. I will try and set this up and test sometime this weekend.
This would essentially allow one to add a shell command, binary path, script, nix wrapped package etc. That way if there are more complicated deployments where gaining physical access to a device might not be feasible, the end user can specify how complex or simply their liveliness checks can be.
This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation.
The liveliness check is configured via the
liveliness_check_commandoption.In addition, this commit also includes:
.envrcand.gitignorefiles.