Validate disk removals #248

martindekov · 2025-10-13T12:58:50Z

Validating disk removals in case the disk contains the volume with last healthy replica. The healthy
replica is determined by the FailedAt and HealthyAt. In case FailedAt field is empty and HealthyAt field is not empty, then the replica is deemed healthy following examples directly from longhorn project:
https://github.com/longhorn/longhorn-manager/blob/master/datastore/longhorn.go#L2172-L2184
https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L935-L951

Also rejection happens in case it hosts the last
usable backing image. "Usable backing image"
is one which in ready state.

The change also renames the pipeline step:
"Build the Image for the Integration Test"
to:
"Build Integration Test Image and run Unit Tests"
in order to make it more obvious where we run
the unit tests in the pipeline as the scripts are
called internally.

Problem:
If disk contains last healthy replica of a volume or last valid backingimage we should reject removals of the disk as we might lose data.

Solution:
From the condition above, add webhook validation for blockdevice where on attempted removal, we will make sure the disk which is to be removed does not contain last volumes/backingimages

Related Issue:
harvester/harvester#3344

Original PR with single node validation: #244

Test plan:

Added unit tests in the second commit

Negative scenarios

Reject volume removal:

Create multi node harvester cluster
Attach SATA disks before installation
Once installed add the disks through the Hosts -> Edit Config -> Storage -> Add Disk dropdown for both nodes
Tag the added disk with new tag for example (make sure to add unique tags for each node)
Create storage class with 1 replica and attach the disk Advanced -> Storage Classes -> Create -> Disk Selector -> new (same as disk tag above)
Create virtual machine and attach SATA volume to use that storage class
Since replicas is 1 the volume would have single healthy replica
Try to delete the disk - see rejection

Replicate Backing Image removal rejection:

Follow steps 1 to 4 above
In images tab click Create button to create a new image
While creating pick file image and click Storage Class tab and pick storage class created from above - "new"
Try deleting the disk from Hosts page - check error as there is only 1 healthy backingimage

Note: make sure the virtualmachineimage has block device as backend before testing so when image is created corresponding blockdevice is created as well

Positve scenarios

Follow the negative scenarios from above, but when creating the storage class, match the replicas to the amount of nodes in the harvester cluster. That way both backingimages and volumes should have multiple valid replicas and deletion of the disk would be allowed.

Removing disk with last healthy volume webhook UI result

Removing disk with last healthy backing image webhook UI result

WebberHuang1118

@martindekov Please check the comments, thanks.

pkg/webhook/blockdevice/validator.go

Validating disk removals in case the disk contains the volume with last healthy replica. The healthy replica is determined by the FailedAt. In case this field is empty, then the replica is deemed healthy. Also rejection happens in case it hosts the last usable backing image. "Usable backing image" is one which in ready state. Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Adding unit tests for blockdevice validations' Update method. The change requires very basic fake implementation of the cache objects used. Also pipeline has additional step and very basic go test which can be further optimized to output the coverage in file. Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Refactoring code so logic in: * validateLHDisk - is series of checks which return early in case of no matches to reduce the previous complex if statement * validateVolumes - extracted diskUUID retrival in stand alone function to be reused and for each general step extracted in stand alone methods * validateBackingImages - reused the diskUUID from the volumes and extracted the logic of determining whether it's safe or not to delete backingimage to stand alone mehod Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Including HealthyAt in addition to FailedAt field when deciding whether replica is failed or healthy. Updating tests as well when constructing replicas. The condiiton follows the established approach by the longhorn team which use this condition as well: https://github.com/longhorn/longhorn-manager/blob/master/datastore/longhorn.go#L2172-L2184 https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L935-L951 Signed-off-by: Martin Dekov <martin.dekov@suse.com>

cmd/node-disk-manager-webhook/main.go

pkg/utils/utils.go

Vicente-Cheng

overall lgtm, just some questions.

scripts/unit-test

.github/workflows/basic-ci.yaml

deploy/charts/harvester-node-disk-manager/templates/rbac.yaml

pkg/utils/utils.go

pkg/webhook/blockdevice/validator.go

Addressing feedback from Vicente including the following: * renamed getDiskUUID to validateDiskInNode and I call it outside directly in validateLHDisk to avoid duplication * renamed some functions and varibles to reduce lenght due to new functions on counting backing images and healthy replicas * Added trailing spaces which were previously removed * Prepend longhorn related object keys with lh as longhorn has a lot of common objects with k8s like volumes/replicas/nodes which can be confused Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Renaming pipeline integration image build step from: "Build the Image for the Integration Test" to: "Build Integration Test Image and run Unit Tests" To make it obvious when unit tests are being ran as we don't mention this anywhere explicitly. Due to this also removed the `unit-test` script and step. Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Vicente-Cheng

LGTM, please remember to squash after all reviews are complete.

WebberHuang1118

LGTM, just a nit, the PR is great, thanks.

pkg/webhook/blockdevice/validator.go

albinsun · 2025-10-29T09:59:32Z

Hi @martindekov ,
I'm verifying the source issue, a quick question, you mentioned

Validating disk removals in case the disk contains the volume with last healthy replica.

Could we say that this enhancement ONLY for the "single replica" storage class scenarios?
How about "multi-replicas" scenarios? In such cases, volume with last replica will not be "healthy".

By the way, for single replica but Detached volume, what is the expected behavior?

cc. @Vicente-Cheng @WebberHuang1118

martindekov · 2025-10-29T10:40:27Z

Hey @albinsun so this might sound misleading if red in isolation:

Validating disk removals in case the disk contains the volume with last healthy replica.

So on the question:

Could we say that this enhancement only for the "single replica" storage class scenarios?
Cuz for "multi-replicas" scenarios, volume with last replica will not be "healthy".

We don evaluate volume status when we determine whether it's ok to remove disk or not. We evaluate the replicas of the volume. If there is only 1 healthy replica of a volume and it's on the disk user is trying to remove we reject the disk removal operation.

So shortly this covers all scenarios - when volume has one or more replicas.

By the way, for single replica but Detached volume, what is the expected behavior?

My understanding is that in order to get to a detached volume, I assume you mean detached from virtual machine, it would've been attached to it at some point and replicated in hat case it would be still on a disk so depending on the state of it's replicas we reject or accept the disk's removal.

We are just cautious to not losing the last healthy replica of a volume or backing image unintentionally so in case we decide to recover it we have a healthy place from where we can start. I think the test plan shows the intent better.

This was referenced Oct 13, 2025

Validate single node disk removals #244

Closed

Validate disk removals (Test) martindekov/node-disk-manager#1

Closed

martindekov requested review from Vicente-Cheng, WebberHuang1118 and ihcsim October 13, 2025 14:55

martindekov changed the title ~~Validate disk removals (Code)~~ Validate disk removals Oct 13, 2025

martindekov mentioned this pull request Oct 13, 2025

[ENHANCEMENT] Delete disk operation should consider data availability harvester/harvester#3344

Closed

WebberHuang1118 requested changes Oct 17, 2025

View reviewed changes

martindekov force-pushed the diskremoval3344-c branch from f12a8ff to 2937bd6 Compare October 17, 2025 12:42

martindekov added 3 commits October 21, 2025 12:39

martindekov force-pushed the diskremoval3344-c branch from 2937bd6 to 67a8c68 Compare October 21, 2025 09:39

martindekov force-pushed the diskremoval3344-c branch from 67a8c68 to a477738 Compare October 21, 2025 09:42

martindekov requested a review from WebberHuang1118 October 21, 2025 09:49

votdev reviewed Oct 21, 2025

View reviewed changes

cmd/node-disk-manager-webhook/main.go Outdated Show resolved Hide resolved

votdev reviewed Oct 21, 2025

View reviewed changes

pkg/utils/utils.go Show resolved Hide resolved

Vicente-Cheng reviewed Oct 21, 2025

View reviewed changes

martindekov added 3 commits October 22, 2025 12:11

Add comment above index constants

45a9bab

Signed-off-by: Martin Dekov <martin.dekov@suse.com>

Vicente-Cheng approved these changes Oct 22, 2025

View reviewed changes

WebberHuang1118 approved these changes Oct 23, 2025

View reviewed changes

pkg/webhook/blockdevice/validator.go Show resolved Hide resolved

martindekov mentioned this pull request Oct 23, 2025

Add new roles to node-disk-manager-webhook harvester/charts#423

Merged

martindekov merged commit 26cf4ce into harvester:master Oct 23, 2025
10 of 11 checks passed

martindekov mentioned this pull request Oct 27, 2025

[REFACTOR] Extract common constants in go-common harvester/harvester#9369

Closed

Validate disk removals #248

Validate disk removals #248

Uh oh!

Conversation

martindekov commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WebberHuang1118 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vicente-Cheng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vicente-Cheng left a comment

Choose a reason for hiding this comment

Uh oh!

WebberHuang1118 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albinsun commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindekov commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

martindekov commented Oct 13, 2025 •

edited

Loading

WebberHuang1118 left a comment •

edited

Loading

albinsun commented Oct 29, 2025 •

edited

Loading

martindekov commented Oct 29, 2025 •

edited

Loading