-
Notifications
You must be signed in to change notification settings - Fork 23
Validate single node disk removals #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
w13915984028
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI.
d0a8ce6 to
0cc1d7c
Compare
0deeccb to
a62db7c
Compare
Addressing feedback from Moritz including the following: * adding longhorn volume resource to the ndm clusterrole * adding virtualmachineimage resource to the clusterrole * remove redundant usage of slice after logic was changed Signed-off-by: Martin Dekov <martin.dekov@suse.com>
Signed-off-by: Martin Dekov <martin.dekov@suse.com>
38e105b to
00c5d9c
Compare
Adding fake implementations and unit test skeleton for the pipeline runs. For now there is only one test which has minimal coverage for the purposes of validating the tests work as expected and cover the basic part of the code. Test cases will be added in incoming commit. Signed-off-by: Martin Dekov <martin.dekov@suse.com>
00c5d9c to
d1725ba
Compare
Adding unit test cases covering mostly the new logic. Removed slice allocation size since it produced empty values which would've led to new logic to avoid empty values so for simplicity removal was the best choice. Also augmented the instantiaton of the objects to allow empty slices and extend test coverage when caches are empty. Signed-off-by: Martin Dekov <martin.dekov@suse.com>
m-ildefons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| cd $(dirname $0)/.. | ||
|
|
||
| echo Running unit tests | ||
| go test -v -cover ./pkg/... No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, you don't have to correct this:
If you want coverage output from the containerized build (i.e. using make unit-test), then you need to do some additional things:
- specify a coverage output file
go test -v -cover -coverprofile=coverage.out ./pkg/...- do the conversion to the final output format (e.g. HTML) inside the container
go test [...]
go tool cover -html=coverage.out -o coverage.html- add the final coverage output file in
Dockerfile.dapperwhere the environment variableDAPPER_OUTPUTis defined, e.g.:
[...]
ENV DAPPER_OUTPUT ./bin ./manifests ./pkg ./coverage.htmlThen you can do make unit-test, the coverage report will generated and placed outside of the build container so you can analyze it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Moritz, I will leave this open. I intentionally made it simple, didn't want to overcrowd this change as it already contains too much different things. Augmenting pipeline to output coverage in the right way is a work on it's own. Also I think there were badges to show coverage on front page etc etc.
We can open an separate issue for this I think in this repo? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's completely fair, in fact I think it would be better to have a separate PR for that. I just assumed that you added the -cover flag to get coverage data.
Feel free to tag me as reviewer when you open that PR.
WebberHuang1118
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the discussion related to the Longhorn backing image, thanks.
Replacing validation of blockdevice deletion from virtualmachineimage to backingimage. Logic rejects disk removals in case there are backingimages attached in non ready state to the blockdevice. We index the caches for both longhorn nodes and backingimages to reach all backingimages related with blockdevices through the longhorn nodes. Additonal changes * Replaces fakes and testing for the new objects * Fixed some RBAC permissions for the new objects * Registered longhorn factory as caches didn't actually return anything * Removed unsued objects and prints * Fixed selection based on Spec.Tags to Status.Tags as in real environment tags are stored in status, not in the spec Signed-off-by: Martin Dekov <martin.dekov@suse.com>
924f071 to
42148b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the logic for checking backing images seems incorrect to me; please see my comment.
Also suggest reviewing the Backing Image LEP for the relevant data structures.
Thanks
| var failedBackingImages []string | ||
| for _, backingImage := range backingImages { | ||
| if backingImage.Status.DiskFileStatusMap[uuid].State != lhv1.BackingImageStateReady { | ||
| failedBackingImages = append(failedBackingImages, backingImage.Name) | ||
| } | ||
| } | ||
| if len(failedBackingImages) > 0 { | ||
| failedBackingImageList := strings.Join(failedBackingImages, ",") | ||
| return fmt.Errorf("the following backingimages: %v attached to blockdevice: %v are in a %s state; make sure state is fixed before disk deletion", | ||
| failedBackingImageList, old.Name, lhv1.BackingImageStateFailed) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the current logic is: for the disk that’s about to be removed, check whether any backing image has a copy on that disk, and whether the copy is not in the ready state.
However, what we actually need to check is: for the disk that’s about to be removed, determine if it holds the only valid copy of certain backing images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this:
determine if it holds the only valid copy of certain backing images.
Is part of the UI Warning change in that issue: harvester/harvester#9218 There is no PR yet for this.
The current PR refers to the case where we have a single harvester node and we have bad volumes/backingimages. And all storage classes have replica 1 when harvester consists of a single node covering this requirement by @Vicente-Cheng:
- block by webhook - number of replica = 1
IMO if we have single node whatever you do effectively those disks will be attached to that single node, the moment you add more nodes, we don't reach this logic here at all. As per my current understanding In multi node set ups, replication of data becomes more than 1 and then we go to the warning before deletion, but we don't reject it through webhook, but this is to be discussed I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this PR only target a single-node environment? If I understand correctly, the original issue #3344 should also cover multi-node environments.
As I see it, this PR is attempting to address #3344, so this PR should not be limited to single-node setups. From my perspective, #3344 and #9218 appear to be duplicates.
Regarding:
The current PR refers to the case where we have a single harvester node and we have bad volumes/backingimages. And all storage classes have replica 1 when harvester consists of a single node covering this requirement by @Vicente-Cheng:
block by webhook - number of replica = 1
Even in a single-node environment, users can attach multiple disks so that a storageClass targets more than one replica.
If I understand correctly, what @Vicente-Cheng means is that the webhook should block node removal if a volume has only one healthy replica on the disk that’s about to be deleted. We should trace this through the route volume → engine → replicas, rather than relying on the selector, since the selector is part of the volume spec (the desired state) and not the actual state we need to verify.
The same logic should also apply to backing images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I captured our offline discussions in comment to the issue:
harvester/harvester#3344 (comment)
That:
- Prevent the user through webhook to delete disk in the hosts page in a single node harvester in case volume is in Degraded state and is attached to the disk through a storage class
- In all cases warn the user that when disk is deleted from the hosts page in harvester you should be cautious and evict it first - 1st click warn - second click allow to pass
So the implementation reflects that
| func (v *Validator) validateDegradedVolumes(old *diskv1.BlockDevice) error { | ||
| volumeList, err := v.volumeCache.List(utils.LonghornSystemNamespaceName, labels.Everything()) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if len(volumeList) == 0 { | ||
| return nil | ||
| } | ||
| degradedVolumes := []string{} | ||
| for _, vol := range volumeList { | ||
| if vol.Status.Robustness == lhv1.VolumeRobustnessDegraded { | ||
| degradedVolumes = append(degradedVolumes, vol.Name) | ||
| } | ||
| } | ||
| if len(degradedVolumes) == 0 { | ||
| return nil | ||
| } | ||
| selectorDegradedVol := make(map[string][]string) | ||
| for _, name := range degradedVolumes { | ||
| pv, err := v.pvCache.Get(name) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| diskSelector := "" | ||
| if pv.Spec.CSI != nil { | ||
| diskSelector = pv.Spec.CSI.VolumeAttributes[utils.DiskSelectorKey] | ||
| } | ||
| if len(diskSelector) != 0 { | ||
| selectorDegradedVol[diskSelector] = append(selectorDegradedVol[diskSelector], pv.Name) | ||
| } | ||
| } | ||
| degradedVolString := "" | ||
| for _, diskTag := range old.Status.Tags { | ||
| if val, ok := selectorDegradedVol[diskTag]; ok { | ||
| degradedVolString += fmt.Sprintf(" %s: %v", diskTag, val) | ||
| } | ||
| } | ||
| if len(degradedVolString) > 0 { | ||
| return fmt.Errorf("the following tags with volumes:%s attached to disk: %s are in degraded state; evict disk before proceeding", | ||
| degradedVolString, old.Spec.DevPath) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I know why we need to care about the selector? If I understand correctly, the only check required is whether the disk slated for removal contains the sole valid replica of any volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selector is how we trace which volumes are attached to the disk we currently we are trying to remove so:
- Tagging a disk
- We use that tag to create storage class
- We use that storage class to create volumes/vms
And this tag is used as selector across those components. That way If I Delete the disk I trace back the volumes created through the disk -> tag -> sc -> volumes way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also both this logic and backingimages logic above are triggered only on single node set ups, so replication can be only 1 in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in a single-node environment, users can attach multiple disks so that a storageClass targets more than one replica, so we cannot make that assumption.
|
@WebberHuang1118 thanks for pointing the HEP, I will definetely go through it! Generally - current change is handling bad backingimages and volumes when harvester is 1 node meaning replication is 1 as well. Any bad backingimage or volume is actually the last valid one in that single node set up. |
Indexing bcacking image by disk UUID taken from the status, not from the spec as per feedback from Webber Signed-off-by: Martin Dekov <martin.dekov@suse.com>
|
Hi @martindekov The original issue #3344 should also cover multi-node environments. As I see it, this PR is attempting to address #3344, so the PR shouldn’t be limited to single-node setups. From my perspective, #3344 and #9218 appear to be duplicates. IMO, we should design a unified solution that applies to both single-node and multi-node environments in this PR. There’s no need to separate it into another issue (#9218) and another PR. |
|
To re-iterate, generally we agreed the solution for all node set ups is warning, not webhook rejection. Webhook rejection is for the corner case where we cannot really have replicated storage even if we set the affinity rule as per the comment: The issue - harvester/harvester#9218 - refers to the Warning/UI So the work between the issues is split - one for the UI and one for Backend. If we decide all warning work to go to the webhook backend, and leave the front end warning aspect, we also need to find ways to force delete disks in bad state as it will be rejected on API level. This design would further extend the scope of the issue. |
Not every detail becomes clear until the PR is created. IMO, for reviewers, changes or updates to the original design or proposal are sometimes necessary.
I agree that #3344 targets the backend part only. However, based on my code reading, the current implementation only includes a webhook to reject disk removal for preventing data loss in a single-node cluster. This is not sufficient for #3344 for the following reasons:
TL;DR Meanwhile, #3344 should focus on the original context — if removing a disk would cause any volume to lose its only healthy replica or any backing image to lose the only ready copy, the webhook should strongly block the operation to prevent data loss, regardless of how many nodes are in the cluster. If we continue with the current PR design, I believe the context and scope of #3344 should at least be reduced — for example, to Ping @Vicente-Cheng @ihcsim for your thoughts on this topic. |
|
Closing and giving priority to multi node implementation here: #248 Won't update this as it became overcrowded with comments and hard to review |
Rejecting disk removals when there are attached volumes in Degraded
state on single node harvester set ups.
We also make sure to reject removals when block devices are
in failed state as well.
The logic is split in two parts:
compare the selectors against the disktags
and through the node we pick the backing image all through indexing
In both scenarios in case there is match while disk is being
disabled, we list to the user which volumes and then images
needs to get fixed.
Signed-off-by: Martin Dekov martin.dekov@suse.com
Problem:
Users can directly delete disks without any warnings explained in issue:
harvester/harvester#3344
Solution:
Reject deletions of disks when there are degraded volumes when there is only single node harvester cluster
Related Issue:
harvester/harvester#3344
Test plan:
Unit testing added
Replicate Degraded Volume Scenario:
newtag for examplenew(same as disk tag above)Replicate Failed Backing Image scenario: