WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks #112

ProjectInitiative · 2025-10-25T05:00:57Z

In response to #59 an Idea. I will try and set this up and test sometime this weekend.

This would essentially allow one to add a shell command, binary path, script, nix wrapped package etc. That way if there are more complicated deployments where gaining physical access to a device might not be feasible, the end user can specify how complex or simply their liveliness checks can be.

This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation.

The liveliness check is configured via the liveliness_check_command option.

In addition, this commit also includes:

Updates to Go dependencies.
A version bump to 0.9.0.
The addition of .envrc and .gitignore files.

Fixes nlewo#38

This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation. The liveliness check is configured via the `liveliness_check_command` option. In addition, this commit also includes: - Updates to Go dependencies. - A version bump to 0.8.1. - The addition of `.envrc` and `.gitignore` files.

This commit adds documentation for the new `livelinessCheckCommand` option in `docs/generated-module-options.md` and `docs/howtos.md`.

This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative. The `showDerivation` function now normalizes the derivation and output paths to ensure they are absolute.

This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative. The executor now determines the Nix store directory at startup and uses it to normalize the derivation and output paths to ensure they are absolute.

This commit refactors the Nix executor and related tests to improve robustness and remove dependencies on the local Nix environment during testing. The following changes are included: - The `NewNixExecutor` function in `internal/executor/nix.go` is updated to accept an optional `storeDir` argument. This allows tests to provide a mock store path, avoiding the need to call `GetNixStoreDir` and interact with the actual Nix environment. - The `init()` function in `internal/executor/utils.go` is removed. This function was causing test failures by calling `nix` before the tests could mock the environment. - The `showDerivation` function in `internal/executor/utils.go` is updated to accept the `storeDir` as an argument, ensuring that it uses the correct store path. - All tests in `internal/executor/executor_test.go` and `internal/manager/manager_test.go` are updated to use the `NewNixExecutor` function with a mock store path. This ensures that the tests are hermetic and do not rely on the presence of a Nix environment.

This commit introduces a liveliness check after a deployment. If the liveliness check fails, the deployment is rolled back to the previous version. The following changes are included: - The `deployer.Run` function is updated to perform a liveliness check after a deployment. If the check fails, the deployment is marked as failed, and a rollback is performed. - A new test file `internal/deployer/liveliness_check_test.go` is added to test the new liveliness check and rollback logic. - The `store.DeploymentFinished` function is updated to correctly handle the rollback and update the deployment status.

The `vendorHash` in `nix/package.nix` was updated to reflect the addition of a new dependency. The `vendor/` directory was also added to `.gitignore` to avoid committing it to the repository.

On Linux, run 'systemctl daemon-reload' before exiting to ensure that systemd picks up any changes to the service unit file.

The liveliness check was logging the command output instead of the error when the command failed. This made it difficult to debug failures. This commit fixes the logging to log the error, which will provide more useful information when the liveliness check fails.

The previous commit introduced a 'declared and not used' error for the 'output' variable in the liveliness check. This commit fixes the error by replacing the 'output' variable with a blank identifier.

This commit introduces a new option 'random_delay' to the poller configuration. This allows to specify a maximum random delay in seconds before fetching the remote, which helps to avoid the 'thundering herd' problem when managing a large fleet of devices. The documentation and NixOS module options have been updated accordingly.

ProjectInitiative · 2025-10-27T20:01:29Z

I will be adding more tho this PR. There were some bug fixes and small QOL features also to be added. .

nlewo · 2025-10-27T21:51:56Z

Hello,
Thank you for your interest in comin 🎉

Being able to rollback is really useful and this is in definitely in the wish list.
However, this is not a small feature and I think this should be decomposed into, at least, two steps.

First, i would like to allow a user to manually rollback via the CLI. Something such as a comin rollback command. This should then be exposed via the comin status command to let the user understand the current comin state.
I'm trying to expose the comin internal state to the user, otherwise, it could become really hard to understand what comin is doing.

After a rollback. we need to define what is the behavior on next deployments, especially if several deployments fail. I think we should preserve the list of deployments in order to provide the full deployment history to the user (currently available with comin deployment list). In this situation, we don't want to switch to the previous deployment, but to the last successful deployment. Note the current implementation of the store doesn't allow to keep the last deployment, only the n last deployments. If n deployment fails, the last successful deployment is evicted.

Once we have a clean comin rollback command, we could then add the automatic rollback feature. There are several ways to implement it.

comin deploys and rollback if liveness probe failed (this is your implementation)
comin deploys, waits for an external confirmation (kind of webhook), rollback if it doesn't get notified
comin runs switch-to-configuration test, then runs the liveness probe. if it succeeds, it runs switch-to-configuration boot. This would allow a user to reboot the machine to rollback is comin/systemd/... is broken by the upgrade.

In conclusion, i think we should start by implementing a manual rollback. The main advantage is to force us to expose the feature and the behavior to users.

ProjectInitiative · 2025-10-28T16:20:05Z

All great points, I have some thoughts on how to architect this (though you covered most of them already). For now I have converted this to a draft. On the current branch I am working on, I am kinda implementing some of those checks. I also have a few bugs that are getting fixed:

There was an issue with systemd restarts of the service when config changes happened, there needs a systemctl daemon-reload before the service auto restarts. Just killing the app and having it restart didn't seem to pick up comin specific config changes in the nixos module.

I will try and look through what you have so far with the CLI options, because that was a question I was going to ask regarding automatic generation tracking. I figured it would be: the rollback command could allow a user to pick from a list of known good generations, and the automated version would just stick with the last known valid deployment: i.e. successfully built, switched and passed liveliness checks. Otherwise it will sit on the last valid until a new commit comes along that is different than previous tried/failed generations.

Lastly, adding another option for randomDelayed starts to the poller to prevent thundering herd, which could be a problem with a fleet of devices that all use the same config, therefore the same polling interval. Don't necessarily want a ton of devices rebuilding all at once.

This reverts commit f76a8da.

This reverts commit 45823bb, reversing changes made to 5d0f3c2.

This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation. The liveliness check is configured via the `liveliness_check_command` option. In addition, this commit also includes: - Updates to Go dependencies. - A version bump to 0.8.1. - The addition of `.envrc` and `.gitignore` files.

This commit adds documentation for the new `livelinessCheckCommand` option in `docs/generated-module-options.md` and `docs/howtos.md`.

ProjectInitiative · 2025-11-12T04:27:03Z

@nlewo Been cooking up some things. You were right on the large feature aspect, however I believe I have a working version now. Done some manual tests, but might be worth looking into for a next release.

As of now, if there are failed deployments/health-checks, they get tagged in the list. For automated rollbacks it will grab and rollback to the last deployment marked successful.
For the manual command, no args provided will rollback to last successful tag, or a user can provide the generation-uuid, commit-id, or deployment-uuid to rollback to.

nlewo and others added 2 commits May 5, 2024 10:28

Expose Auth.Username for other forges

5efc0fd

Fixes nlewo#38

ProjectInitiative changed the title ~~wip: UNTESTED feat(deploy): Add liveliness check~~ WIP: UNTESTED feat(deploy): Add liveliness check Oct 25, 2025

docs: Add documentation for livelinessCheckCommand

9699f75

This commit adds documentation for the new `livelinessCheckCommand` option in `docs/generated-module-options.md` and `docs/howtos.md`.

ProjectInitiative force-pushed the main branch from ce04038 to 9699f75 Compare October 26, 2025 01:35

ProjectInitiative added 11 commits October 25, 2025 21:07

fix(executor): Normalize store paths

cac3b82

This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative. The `showDerivation` function now normalizes the derivation and output paths to ensure they are absolute.

removing sh dependency

dddb0c3

adding shell words to parse more dynamic input

deba8ca

fix: Update vendorHash for new dependency

f371334

The `vendorHash` in `nix/package.nix` was updated to reflect the addition of a new dependency. The `vendor/` directory was also added to `.gitignore` to avoid committing it to the repository.

fix(restart): run daemon-reload before restarting

d88323f

On Linux, run 'systemctl daemon-reload' before exiting to ensure that systemd picks up any changes to the service unit file.

fix(deployer): fix 'declared and not used' error

bbcb71a

The previous commit introduced a 'declared and not used' error for the 'output' variable in the liveliness check. This commit fixes the error by replacing the 'output' variable with a blank identifier.

ProjectInitiative marked this pull request as draft October 28, 2025 15:14

ProjectInitiative added 10 commits November 5, 2025 19:43

feat: add manual rollback command

0d3b3e8

Merge branch 'nlewo:main' into main

0f8a649

Merge branch 'main' into wip/liveliness-and-bugs

9dd92aa

Merge branch 'main' into wip/manual-rollback

79011b8

docs: add documentation for the rollback command

8a9f646

moving command to cmd file

5d0f3c2

Merge branch 'pr-42' into wip/manual-rollback

45823bb

adding logging

f76a8da

Revert "adding logging"

ff647d6

This reverts commit f76a8da.

Revert "Merge branch 'pr-42' into wip/manual-rollback"

89c89b6

This reverts commit 45823bb, reversing changes made to 5d0f3c2.

ProjectInitiative added 7 commits November 8, 2025 21:01

preventing rollbacks to a failed deployment

d25af0e

only rolling back to last successful deployment

966f734

changing how comin handles restarts

746785d

feat(rollback): Improve rollback command flexibility

9f89a42

docs: Add documentation for livelinessCheckCommand

bddbee0

This commit adds documentation for the new `livelinessCheckCommand` option in `docs/generated-module-options.md` and `docs/howtos.md`.

Merge branch 'wip/manual-rollback'

ecc34a5

ProjectInitiative changed the title ~~WIP: UNTESTED feat(deploy): Add liveliness check~~ WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks Nov 12, 2025

removing old comment

9fd8da4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks #112

WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks #112

Uh oh!

ProjectInitiative commented Oct 25, 2025 •

edited

Loading

Uh oh!

ProjectInitiative commented Oct 27, 2025

Uh oh!

nlewo commented Oct 27, 2025

Uh oh!

ProjectInitiative commented Oct 28, 2025 •

edited

Loading

Uh oh!

ProjectInitiative commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks #112

Are you sure you want to change the base?

WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks #112

Uh oh!

Conversation

ProjectInitiative commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProjectInitiative commented Oct 27, 2025

Uh oh!

nlewo commented Oct 27, 2025

Uh oh!

ProjectInitiative commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProjectInitiative commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ProjectInitiative commented Oct 25, 2025 •

edited

Loading

ProjectInitiative commented Oct 28, 2025 •

edited

Loading