Skip to content

Conversation

@hmbill694
Copy link
Collaborator

@hmbill694 hmbill694 commented Dec 2, 2025

Overview

Howdy, howdy gang. This PR contains the proposal for the next phase of work on the Operator to enable it to handle 2 pain points common to OT.

  • Virtual Machine creation before backing data is ready
  • Cleaning up Virtual Machine Data when we don't need it anymore

By baking this into the operator we can also do stuff like emit metrics during this workflow and setup alerting on said metrics.

I'd love to use this pull request as a living document for any and all discussion so please feel free to leave comments, concerns, expressions of rage, anything really here.

hmbill694 and others added 2 commits December 3, 2025 08:42
Co-authored-by: Caleb Tallquist <55416214+tallquist10@users.noreply.github.com>
Co-authored-by: Caleb Tallquist <55416214+tallquist10@users.noreply.github.com>
@hmbill694 hmbill694 changed the title docs: workspace feature proposal docs: virtual machine feature proposal Dec 4, 2025
@hmbill694 hmbill694 force-pushed the chore/workspace-feature-proposal branch from 5b7f47d to 7ef34ca Compare December 4, 2025 05:04
- **No self healing**: The current approach of deploying `VirtualMachines` results in the inability for the platform to attempt to resolve problems without outside interaction.

### **3. Proposed Solution**
The proposed solution is a new `VirtualMachine` CR and controller along with the expansion of the capabilities of the `VMDiskImage` controller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that there would be two separate controllers, one for VirtualMachine related resources and one for VMDiskImage resources?

Copy link
Collaborator Author

@hmbill694 hmbill694 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a VMDiskImage controller. We just expand it's feature set to count references to VMDIs from VMs when a VMDI goes through the reconcile loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is more are we expanding the VMDI controller to handle VM stuff, or making a separate controller for the new VM CR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have two controllers if we implement that change. One that already exists which handles VMDIs and a new one that will handle VMs. There is going to be some required expansion of the VMDI controller to track references on VMs but the controller won't need to watch them. Just look when we send a VMDI through the loop.


The `VirtualMachine` CR will act as a thin wrapper around what the teams existing VM solution. This will allow OT to have it's own interface to represent a virtual machine decoupling us from direct reference to underlying resources which actually spin up virtual machines in the cluster. This CR paired with the controller will allow the platform to interact with the creation lifecycle of underlying resources as well. We can use this to ensure that we always have the required backing resources for virtual machines allowing the platform to self heal.

To address the second pain point of resource pruning the team can expand the `VMDiskImage` controller to also record the number of referencing `VirtualMachines` on `VMDiskImages`. We can prevent deletion of `VMDiskImages` while they're referencing vms and delete these resources if there are no referencing vms created within a given time period.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the deletion process for a VMDiskImage in this new route? Does every VM try to delete its respective VMDiskImage when it's done, and if so, are VMDiskImage resources going to be in a semi-constant state of Terminated?

Copy link
Collaborator Author

@hmbill694 hmbill694 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No VMs won't directly try to cleanup VMDis. I imagine that VMs and VMDIs will not be in the same namespace. To account for this removing a VM won't immediately trigger a removal of a VMDI in this approach. Instead we will expand the VMDI controller to track references from VMs on VMDIs. We can use these reference to determine if a VMDI can be removed. We can remove the VMDI automatically if it's references go to zero or remove it after a certain amount of time. This keeps management of the VMDI in it's own controller and the only thing a VM would need to do is create one if it doesn't exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that's good to know. I wonder if instead of having them automatically deleted, we allow them to be manually deleted, but we use these references as the means for determining that it's safe to delete them. Using UKI as an example, they will want to get rid of an old version of a VMDiskImage, so they will try. If it's being referenced by running labs, it will stick around until those labs go away, and then once they all go away and no new ones are using it, it will be deleted. If we try to automatically delete them when there is no reference, then I could see scenarios where VMDIs are being recreated unnecessarily because there's some downtime between uses of it in a lab, which doesn't necessarily feel like it's the right move.

In short, I think that once a VMDI is created, it should stick around until explicitly deleted, and then it should be in the terminated state until all of its references are gone, and then it should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can run it like that too if we want. The real thing here is the reference counting resolves the issue of determining if a resource is eligible for deletion. If we can track references we can devise all sorts of schemes to do cleanup on resources with no references. We can delete them right away, delete them after sometime or just know that they are eligible for deletion if we want to clean them up manually.

Comment on lines +59 to +69
vmDiskImageRef:
name: demo-vmdi
namespace: vmdi-farm
vmDiskImageTemplate:
storageClass: "gp3"
snapshotClass: "ebs-snapshot"
secretRef: "foo-bar"
name: "harrison-vm"
url: "https://s3.us-gov-west-1.amazonaws.com/vm-images/images/harrison-vm/1.0.0/vm.qcow2"
sourceType: "s3"
diskSize: "24Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what we're looking at here? Is the idea that you're creating demo-vmdi with the contents of vmDiskImageTemplate, assuming it doesn't already exist? And what if it does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is we are attempting to reference a VMDI named demo-vmdi. If it exists we just use whatever it specifies. If it does not we make one with the below template. VMDI existence will trump the template in my mind so in the case we are referencing something that already exists we ignore the template but could maybe log an event on the VM resource to specificy that there was a mismatch.

We could get explicit about it and add a mode field to the spec or something. The modes could be CreateIfNotExist and FailIfNotExist this way you can tell how the VM will handle a VMDI by just looking at the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants