Skip to content

Conversation

@martindekov
Copy link
Contributor

@martindekov martindekov commented Oct 13, 2025

Validating disk removals in case the disk contains the volume with last healthy replica. The healthy
replica is determined by the FailedAt and HealthyAt. In case FailedAt field is empty and HealthyAt field is not empty, then the replica is deemed healthy following examples directly from longhorn project:
https://github.com/longhorn/longhorn-manager/blob/master/datastore/longhorn.go#L2172-L2184
https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L935-L951

Also rejection happens in case it hosts the last
usable backing image. "Usable backing image"
is one which in ready state.

The change also renames the pipeline step:
"Build the Image for the Integration Test"
to:
"Build Integration Test Image and run Unit Tests"
in order to make it more obvious where we run
the unit tests in the pipeline as the scripts are
called internally.

Problem:
If disk contains last healthy replica of a volume or last valid backingimage we should reject removals of the disk as we might lose data.

Solution:
From the condition above, add webhook validation for blockdevice where on attempted removal, we will make sure the disk which is to be removed does not contain last volumes/backingimages

Related Issue:
harvester/harvester#3344

Original PR with single node validation: #244

Test plan:

Added unit tests in the second commit

Negative scenarios

Reject volume removal:

  1. Create multi node harvester cluster
  2. Attach SATA disks before installation
  3. Once installed add the disks through the Hosts -> Edit Config -> Storage -> Add Disk dropdown for both nodes
  4. Tag the added disk with new tag for example (make sure to add unique tags for each node)
  5. Create storage class with 1 replica and attach the disk Advanced -> Storage Classes -> Create -> Disk Selector -> new (same as disk tag above)
  6. Create virtual machine and attach SATA volume to use that storage class
  7. Since replicas is 1 the volume would have single healthy replica
  8. Try to delete the disk - see rejection

Replicate Backing Image removal rejection:

  1. Follow steps 1 to 4 above
  2. In images tab click Create button to create a new image
  3. While creating pick file image and click Storage Class tab and pick storage class created from above - "new"
  4. Try deleting the disk from Hosts page - check error as there is only 1 healthy backingimage

Note: make sure the virtualmachineimage has block device as backend before testing so when image is created corresponding blockdevice is created as well

Positve scenarios

Follow the negative scenarios from above, but when creating the storage class, match the replicas to the amount of nodes in the harvester cluster. That way both backingimages and volumes should have multiple valid replicas and deletion of the disk would be allowed.

Removing disk with last healthy volume webhook UI result image
Removing disk with last healthy backing image webhook UI result image

Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindekov Please check the comments, thanks.

Validating disk removals in case the disk contains
the volume with last healthy replica. The healthy
replica is determined by the FailedAt. In case this
field is empty, then the replica is deemed healthy.

Also rejection happens in case it hosts the last
usable backing image. "Usable backing image"
is one which in ready state.

Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Adding unit tests for blockdevice validations' Update
method. The change requires very basic fake implementation
of the cache objects used.

Also pipeline has additional step and very basic go test
which can be further optimized to output the coverage
in file.

Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Refactoring code so logic in:
* validateLHDisk - is series of checks which return early in case of
no matches to reduce the previous complex if statement
* validateVolumes - extracted diskUUID retrival in stand alone function
to be reused and for each general step extracted in stand alone methods
* validateBackingImages - reused the diskUUID from the volumes and
extracted the logic of determining whether it's safe or not to delete
backingimage to stand alone mehod

Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Including HealthyAt in addition to FailedAt field when deciding
whether replica is failed or healthy. Updating tests as well when
constructing replicas.

The condiiton follows the established approach by the longhorn
team which use this condition as well:
https://github.com/longhorn/longhorn-manager/blob/master/datastore/longhorn.go#L2172-L2184
https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L935-L951

Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Copy link
Collaborator

@Vicente-Cheng Vicente-Cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, just some questions.

Addressing feedback from Vicente including the following:
* renamed getDiskUUID to validateDiskInNode and I call it
outside directly in validateLHDisk to avoid duplication
* renamed some functions and varibles to reduce lenght due to
new functions on counting backing images and healthy replicas
* Added trailing spaces which were previously removed
* Prepend longhorn related object keys with lh as longhorn has
a lot of common objects with k8s like volumes/replicas/nodes which
can be confused

Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Renaming pipeline integration image build step from:
"Build the Image for the Integration Test"
to:
"Build Integration Test Image and run Unit Tests"

To make it obvious when unit tests are being ran as
we don't mention this anywhere explicitly. Due to this
also removed the `unit-test` script and step.

Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Copy link
Collaborator

@Vicente-Cheng Vicente-Cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please remember to squash after all reviews are complete.

Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nit, the PR is great, thanks.

@martindekov martindekov merged commit 26cf4ce into harvester:master Oct 23, 2025
10 of 11 checks passed
@albinsun
Copy link

albinsun commented Oct 29, 2025

Hi @martindekov ,
I'm verifying the source issue, a quick question, you mentioned

Validating disk removals in case the disk contains the volume with last healthy replica.

Could we say that this enhancement ONLY for the "single replica" storage class scenarios?
How about "multi-replicas" scenarios? In such cases, volume with last replica will not be "healthy".

By the way, for single replica but Detached volume, what is the expected behavior?

cc. @Vicente-Cheng @WebberHuang1118

@martindekov
Copy link
Contributor Author

martindekov commented Oct 29, 2025

Hey @albinsun so this might sound misleading if red in isolation:

Validating disk removals in case the disk contains the volume with last healthy replica.

So on the question:

Could we say that this enhancement only for the "single replica" storage class scenarios?
Cuz for "multi-replicas" scenarios, volume with last replica will not be "healthy".

We don evaluate volume status when we determine whether it's ok to remove disk or not. We evaluate the replicas of the volume. If there is only 1 healthy replica of a volume and it's on the disk user is trying to remove we reject the disk removal operation.

So shortly this covers all scenarios - when volume has one or more replicas.

By the way, for single replica but Detached volume, what is the expected behavior?

My understanding is that in order to get to a detached volume, I assume you mean detached from virtual machine, it would've been attached to it at some point and replicated in hat case it would be still on a disk so depending on the state of it's replicas we reject or accept the disk's removal.

We are just cautious to not losing the last healthy replica of a volume or backing image unintentionally so in case we decide to recover it we have a healthy place from where we can start. I think the test plan shows the intent better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants