Skip to content

Conversation

@ProjectInitiative
Copy link

@ProjectInitiative ProjectInitiative commented Oct 25, 2025

In response to #59 an Idea. I will try and set this up and test sometime this weekend.

This would essentially allow one to add a shell command, binary path, script, nix wrapped package etc. That way if there are more complicated deployments where gaining physical access to a device might not be feasible, the end user can specify how complex or simply their liveliness checks can be.

This introduces a liveliness check that can be run after a deployment to ensure the new generation is healthy. If the check fails, the deployment is automatically rolled back to the previous generation.

The liveliness check is configured via the liveliness_check_command option.

In addition, this commit also includes:

  • Updates to Go dependencies.
  • A version bump to 0.9.0.
  • The addition of .envrc and .gitignore files.

nlewo and others added 2 commits May 5, 2024 10:28
This introduces a liveliness check that can be run after a deployment to
ensure the new generation is healthy. If the check fails, the deployment
is automatically rolled back to the previous generation.

The liveliness check is configured via the `liveliness_check_command`
option.

In addition, this commit also includes:

- Updates to Go dependencies.
- A version bump to 0.8.1.
- The addition of `.envrc` and `.gitignore` files.
@ProjectInitiative ProjectInitiative changed the title wip: UNTESTED feat(deploy): Add liveliness check WIP: UNTESTED feat(deploy): Add liveliness check Oct 25, 2025
This commit adds documentation for the new `livelinessCheckCommand`
option in `docs/generated-module-options.md` and `docs/howtos.md`.
This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative.

The `showDerivation` function now normalizes the derivation and output paths to ensure they are absolute.
This commit fixes a bug where comin would fail to build derivations when the path returned by `nix derivation show` was relative.

The executor now determines the Nix store directory at startup and uses it to normalize the derivation and output paths to ensure they are absolute.
This commit refactors the Nix executor and related tests to improve robustness and remove dependencies on the local Nix environment during testing.

The following changes are included:

- The `NewNixExecutor` function in `internal/executor/nix.go` is updated to accept an optional `storeDir` argument. This allows tests to provide a mock store path, avoiding the need to call `GetNixStoreDir` and interact with the actual Nix environment.

- The `init()` function in `internal/executor/utils.go` is removed. This function was causing test failures by calling `nix` before the tests could mock the environment.

- The `showDerivation` function in `internal/executor/utils.go` is updated to accept the `storeDir` as an argument, ensuring that it uses the correct store path.

- All tests in `internal/executor/executor_test.go` and `internal/manager/manager_test.go` are updated to use the `NewNixExecutor` function with a mock store path. This ensures that the tests are hermetic and do not rely on the presence of a Nix environment.
This commit introduces a liveliness check after a deployment. If the liveliness check fails, the deployment is rolled back to the previous version.

The following changes are included:

- The `deployer.Run` function is updated to perform a liveliness check after a deployment. If the check fails, the deployment is marked as failed, and a rollback is performed.

- A new test file `internal/deployer/liveliness_check_test.go` is added to test the new liveliness check and rollback logic.

- The `store.DeploymentFinished` function is updated to correctly handle the rollback and update the deployment status.
The `vendorHash` in `nix/package.nix` was updated to reflect the
addition of a new dependency.

The `vendor/` directory was also added to `.gitignore` to avoid
committing it to the repository.
On Linux, run 'systemctl daemon-reload' before exiting to ensure that
systemd picks up any changes to the service unit file.
The liveliness check was logging the command output instead of the
error when the command failed. This made it difficult to debug
failures.

This commit fixes the logging to log the error, which will provide
more useful information when the liveliness check fails.
The previous commit introduced a 'declared and not used' error for the
'output' variable in the liveliness check.

This commit fixes the error by replacing the 'output' variable with a
blank identifier.
This commit introduces a new option 'random_delay' to the poller
configuration. This allows to specify a maximum random delay in
seconds before fetching the remote, which helps to avoid the
'thundering herd' problem when managing a large fleet of devices.

The documentation and NixOS module options have been updated
accordingly.
@ProjectInitiative
Copy link
Author

I will be adding more tho this PR. There were some bug fixes and small QOL features also to be added. .

@nlewo
Copy link
Owner

nlewo commented Oct 27, 2025

Hello,
Thank you for your interest in comin 🎉

Being able to rollback is really useful and this is in definitely in the wish list.
However, this is not a small feature and I think this should be decomposed into, at least, two steps.

First, i would like to allow a user to manually rollback via the CLI. Something such as a comin rollback command. This should then be exposed via the comin status command to let the user understand the current comin state.
I'm trying to expose the comin internal state to the user, otherwise, it could become really hard to understand what comin is doing.

After a rollback. we need to define what is the behavior on next deployments, especially if several deployments fail. I think we should preserve the list of deployments in order to provide the full deployment history to the user (currently available with comin deployment list). In this situation, we don't want to switch to the previous deployment, but to the last successful deployment. Note the current implementation of the store doesn't allow to keep the last deployment, only the n last deployments. If n deployment fails, the last successful deployment is evicted.

Once we have a clean comin rollback command, we could then add the automatic rollback feature. There are several ways to implement it.

  • comin deploys and rollback if liveness probe failed (this is your implementation)
  • comin deploys, waits for an external confirmation (kind of webhook), rollback if it doesn't get notified
  • comin runs switch-to-configuration test, then runs the liveness probe. if it succeeds, it runs switch-to-configuration boot. This would allow a user to reboot the machine to rollback is comin/systemd/... is broken by the upgrade.

In conclusion, i think we should start by implementing a manual rollback. The main advantage is to force us to expose the feature and the behavior to users.

@ProjectInitiative ProjectInitiative marked this pull request as draft October 28, 2025 15:14
@ProjectInitiative
Copy link
Author

ProjectInitiative commented Oct 28, 2025

All great points, I have some thoughts on how to architect this (though you covered most of them already). For now I have converted this to a draft. On the current branch I am working on, I am kinda implementing some of those checks. I also have a few bugs that are getting fixed:

  • There was an issue with systemd restarts of the service when config changes happened, there needs a systemctl daemon-reload before the service auto restarts. Just killing the app and having it restart didn't seem to pick up comin specific config changes in the nixos module.

I will try and look through what you have so far with the CLI options, because that was a question I was going to ask regarding automatic generation tracking. I figured it would be: the rollback command could allow a user to pick from a list of known good generations, and the automated version would just stick with the last known valid deployment: i.e. successfully built, switched and passed liveliness checks. Otherwise it will sit on the last valid until a new commit comes along that is different than previous tried/failed generations.

Lastly, adding another option for randomDelayed starts to the poller to prevent thundering herd, which could be a problem with a fleet of devices that all use the same config, therefore the same polling interval. Don't necessarily want a ton of devices rebuilding all at once.

This introduces a liveliness check that can be run after a deployment to
ensure the new generation is healthy. If the check fails, the deployment
is automatically rolled back to the previous generation.

The liveliness check is configured via the `liveliness_check_command`
option.

In addition, this commit also includes:

- Updates to Go dependencies.
- A version bump to 0.8.1.
- The addition of `.envrc` and `.gitignore` files.
This commit adds documentation for the new `livelinessCheckCommand`
option in `docs/generated-module-options.md` and `docs/howtos.md`.
@ProjectInitiative ProjectInitiative changed the title WIP: UNTESTED feat(deploy): Add liveliness check WIP: feat(deploy): Add liveliness check, Manual/Automated Rollbacks Nov 12, 2025
@ProjectInitiative
Copy link
Author

@nlewo Been cooking up some things. You were right on the large feature aspect, however I believe I have a working version now. Done some manual tests, but might be worth looking into for a next release.

As of now, if there are failed deployments/health-checks, they get tagged in the list. For automated rollbacks it will grab and rollback to the last deployment marked successful.
For the manual command, no args provided will rollback to last successful tag, or a user can provide the generation-uuid, commit-id, or deployment-uuid to rollback to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants