-
Notifications
You must be signed in to change notification settings - Fork 0
docs: virtual machine feature proposal #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Caleb Tallquist <55416214+tallquist10@users.noreply.github.com>
Co-authored-by: Caleb Tallquist <55416214+tallquist10@users.noreply.github.com>
5b7f47d to
7ef34ca
Compare
| - **No self healing**: The current approach of deploying `VirtualMachines` results in the inability for the platform to attempt to resolve problems without outside interaction. | ||
|
|
||
| ### **3. Proposed Solution** | ||
| The proposed solution is a new `VirtualMachine` CR and controller along with the expansion of the capabilities of the `VMDiskImage` controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that there would be two separate controllers, one for VirtualMachine related resources and one for VMDiskImage resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a VMDiskImage controller. We just expand it's feature set to count references to VMDIs from VMs when a VMDI goes through the reconcile loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question is more are we expanding the VMDI controller to handle VM stuff, or making a separate controller for the new VM CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have two controllers if we implement that change. One that already exists which handles VMDIs and a new one that will handle VMs. There is going to be some required expansion of the VMDI controller to track references on VMs but the controller won't need to watch them. Just look when we send a VMDI through the loop.
|
|
||
| The `VirtualMachine` CR will act as a thin wrapper around what the teams existing VM solution. This will allow OT to have it's own interface to represent a virtual machine decoupling us from direct reference to underlying resources which actually spin up virtual machines in the cluster. This CR paired with the controller will allow the platform to interact with the creation lifecycle of underlying resources as well. We can use this to ensure that we always have the required backing resources for virtual machines allowing the platform to self heal. | ||
|
|
||
| To address the second pain point of resource pruning the team can expand the `VMDiskImage` controller to also record the number of referencing `VirtualMachines` on `VMDiskImages`. We can prevent deletion of `VMDiskImages` while they're referencing vms and delete these resources if there are no referencing vms created within a given time period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the deletion process for a VMDiskImage in this new route? Does every VM try to delete its respective VMDiskImage when it's done, and if so, are VMDiskImage resources going to be in a semi-constant state of Terminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No VMs won't directly try to cleanup VMDis. I imagine that VMs and VMDIs will not be in the same namespace. To account for this removing a VM won't immediately trigger a removal of a VMDI in this approach. Instead we will expand the VMDI controller to track references from VMs on VMDIs. We can use these reference to determine if a VMDI can be removed. We can remove the VMDI automatically if it's references go to zero or remove it after a certain amount of time. This keeps management of the VMDI in it's own controller and the only thing a VM would need to do is create one if it doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that's good to know. I wonder if instead of having them automatically deleted, we allow them to be manually deleted, but we use these references as the means for determining that it's safe to delete them. Using UKI as an example, they will want to get rid of an old version of a VMDiskImage, so they will try. If it's being referenced by running labs, it will stick around until those labs go away, and then once they all go away and no new ones are using it, it will be deleted. If we try to automatically delete them when there is no reference, then I could see scenarios where VMDIs are being recreated unnecessarily because there's some downtime between uses of it in a lab, which doesn't necessarily feel like it's the right move.
In short, I think that once a VMDI is created, it should stick around until explicitly deleted, and then it should be in the terminated state until all of its references are gone, and then it should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can run it like that too if we want. The real thing here is the reference counting resolves the issue of determining if a resource is eligible for deletion. If we can track references we can devise all sorts of schemes to do cleanup on resources with no references. We can delete them right away, delete them after sometime or just know that they are eligible for deletion if we want to clean them up manually.
| vmDiskImageRef: | ||
| name: demo-vmdi | ||
| namespace: vmdi-farm | ||
| vmDiskImageTemplate: | ||
| storageClass: "gp3" | ||
| snapshotClass: "ebs-snapshot" | ||
| secretRef: "foo-bar" | ||
| name: "harrison-vm" | ||
| url: "https://s3.us-gov-west-1.amazonaws.com/vm-images/images/harrison-vm/1.0.0/vm.qcow2" | ||
| sourceType: "s3" | ||
| diskSize: "24Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what we're looking at here? Is the idea that you're creating demo-vmdi with the contents of vmDiskImageTemplate, assuming it doesn't already exist? And what if it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is we are attempting to reference a VMDI named demo-vmdi. If it exists we just use whatever it specifies. If it does not we make one with the below template. VMDI existence will trump the template in my mind so in the case we are referencing something that already exists we ignore the template but could maybe log an event on the VM resource to specificy that there was a mismatch.
We could get explicit about it and add a mode field to the spec or something. The modes could be CreateIfNotExist and FailIfNotExist this way you can tell how the VM will handle a VMDI by just looking at the spec.
Overview
Howdy, howdy gang. This PR contains the proposal for the next phase of work on the Operator to enable it to handle 2 pain points common to OT.
By baking this into the operator we can also do stuff like emit metrics during this workflow and setup alerting on said metrics.
I'd love to use this pull request as a living document for any and all discussion so please feel free to leave comments, concerns, expressions of rage, anything really here.