-
Notifications
You must be signed in to change notification settings - Fork 63
Add blueprint/sled agent details #9524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e13289c to
044523e
Compare
andrewjstone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! 🚀
| )] | ||
| #[serde(tag = "type", rename_all = "snake_case")] | ||
| pub enum BlueprintMeasurementSetDesiredContents { | ||
| /// This measurement source is whatever happens to be on the sled's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment is copy-pasta and could use an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-read this and it looks fine. I can't tell if I already fixed it and github is being weird or if this was in reference to another comment
| } | ||
|
|
||
| impl IdOrdItem for ReconciledSingleMeasurement { | ||
| type Key<'a> = String; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can change this to be type Key<'a> = &'a str and then return &self.file_name from key(&self) and you could get rid of the clone.
4d5b5be to
4acf358
Compare
davepacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made it very far yet but wanted to post some early questions.
| path on boot disk: /fake/path/install/zones.json | ||
| boot disk inventory: | ||
| manifest generated by installinator (mupdate ID: 00000000-0000-0000-0000-000000000000) | ||
| no artifacts in install dataset (this should only be seen in simulated systems) | ||
| no non-boot disks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it right that this duplicates the block above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should at least change the path for testing but this is going to look very similar since they are being generated in very similar circumstances. The mupdate-id does get duplicated but I found it difficult to pull the mupdate ID up a level.
| b957d6cf-f7b2-4bee-9928-c5fde8c59e04 crucible install-dataset | ||
| e246f5e3-0650-4afc-860f-ee7114d309c5 crucible install-dataset | ||
| measurements: | ||
| install dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? That the current set of measurements that are in-use are coming from the install dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see the comment in https://rfd.shared.oxide.computer/rfd/0512#_measurements_in_reconfigurator about install dataset
| reference measurements: | ||
| (measurement set is empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "reference measurements" here? Is this the set of measurements that Nexus has told Sled Agent are allowed right now?
I think more generally I'm confused about the difference between these three sections here ("measurements", "measurement manifest", and "reference measurements").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "reference measurements" here? Is this the set of measurements that Nexus has told Sled Agent are allowed right now?
Correct
I think more generally I'm confused about the difference between these three sections here ("measurements", "measurement manifest", and "reference measurements").
"measurement manifest" is the set of measurements on the install dataset that get placed during a MUPdate.
the "measurements" above are either a list of hashes for reconfigurator based update or a directive to use the install dataset.
the "reference measurements" are the resolved paths we should actually pass to sprockets
| ALTER COLUMN measurement_manifest_boot_disk_path DROP default, | ||
| ALTER COLUMN measurement_manifest_source DROP default, | ||
| ALTER COLUMN measurement_manifest_mupdate_id DROP default, | ||
| ALTER COLUMN measurement_manifest_boot_disk_error DROP default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in other cases when we want to add a new column without a default, but we need to supply a default temporarily when we add the column in order to migrate existing data:
https://github.com/oxidecomputer/omicron/tree/main/schema/crdb#211-adding-a-new-column-without-a-default-value
| -- similar to `usable_hardware_threads` and friends above. | ||
| cpu_family omicron.public.sled_cpu_family NOT NULL, | ||
|
|
||
| -- The path to the boot disk image file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "boot disk image file"? (It doesn't seem to be a disk image.)
| -- The path to the boot disk image file. | ||
| measurement_manifest_boot_disk_path TEXT NOT NULL, | ||
| -- The source of the zone manifest on the boot disk: from installinator or | ||
| -- sled-agent (synthetic). NULL means there is an error reading the zone manifest. | ||
| measurement_manifest_source omicron.public.inv_zone_manifest_source, | ||
| -- The mupdate ID that created the zone manifest if this is from installinator. If | ||
| -- this is NULL, then either the zone manifest is synthetic or there was an | ||
| -- error reading the zone manifest. | ||
| measurement_manifest_mupdate_id UUID, | ||
| -- Message describing the status of the zone manifest on the boot disk. If | ||
| -- this is NULL, then the zone manifest was successfully read, and the | ||
| -- inv_zone_manifest_zone table has entries corresponding to the zone | ||
| -- manifest. | ||
| measurement_manifest_boot_disk_error TEXT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the comments here seem to be copy/paste of the zone manifest comments. What's different about these fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are tracking the measurement manifest specifically. I missed updating the comments because it does look a lot like the zone manifest.
The measurement manifest exists on the boot disk and the non-boot disk and can have a separate error (e.g. the measurement file doesn't exist). As noted before, it also tracks the same mupdate override because trying to pull that to a higher level was a pretty bit mess.
| host_phase_2_desired_slot_b STRING(64), | ||
|
|
||
| -- measurement contents | ||
| measurements STRING(64)[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this, exactly? How big is the array? What do the elements correspond to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clarify this comment
davepacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the blueprint-related types here. I have not yet gone through the inventory or the db-model/db-queries stuff but I feel like I'm getting a feel for the structure of things now!
| } | ||
| } | ||
| } | ||
| #[derive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[derive( | |
| /// Describes the set of software measurements that should be trusted by this sled (for trust quorum) | |
| #[derive( |
(or something -- this seems like an important type and could use a doc comment)
| )] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub struct BlueprintMeasurementsDesiredContents { | ||
| pub measurements: BTreeSet<BlueprintSingleMeasurement>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal but maybe contributing to some confusion: it seems like an extra level of indirection here (sled.measurements.measurements is the set of measurements)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is a hold over from a very early draft where we were explicitly modeling "old measurements" and "new measurements". That turned out to not work well/feel redundant. I went back and forth on collapsing it back in. I kind of like keeping the abstraction in case we need it later but I also realize it's probably going to be just as much work. Let me see if I can remove the extra layer.
| } | ||
| } | ||
|
|
||
| #[derive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[derive( | |
| /// Identifies a specific TUF repo artifact containing measurement data | |
| #[derive( |
?
| pub version: BlueprintArtifactVersion, | ||
| pub hash: ArtifactHash, | ||
| pub prune: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry, rust-analyzer is failing me here)
I expected this to just be an ArtifactHash. Am I understanding correctly that version is just here for nicer human-readable output?
What's prune for? I can't find a use of it besides serialization and display. Maybe it's used in the bigger PR for decision making? If so, maybe add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the version is mostly for debugging/extra checks.
prune definitely deserves a bigger comment here. This is to decide what measurements get included in the active set of measurements. https://rfd.shared.oxide.computer/rfd/0512#_reconfigurator_implementation_details has an example sequence and measurements that will be dropped next time will be marked as prune
(I saw there's been some discussion about word choosing for similar(?) behavior with other parts of the system. I'm open to name changes to match)
| write!(f, "{} prune: {}", self.hash, self.prune) | ||
| } | ||
| } | ||
| /// Where the measurement source is located |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Where the measurement source is located | |
| /// For a particular sled, which measurements should be used for trusting other members of the cluster |
| } | ||
|
|
||
| impl BlueprintMeasurementsDesiredContents { | ||
| pub fn default_contents() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always worried about people interpreting "default" differently. In particular, "default" seems like something that should be safe if you don't know or care what you're doing, but this is not a safe value because it would cause a sled to trust no measurements at all. (Right?)
This seems to be mostly used in tests. How about calling this empty_for_tests()?
The other user is blueprint_read(). We could instead load the per-sled measurement data first and then construct a fully-formed BlueprintMeasurementsDesiredContents with what we read... I like this approach much better but I see that blueprint_read() is already not doing this with zones and datasets so maybe that's not worth it. In that case though maybe have this called none() so it at least doesn't suggest to people that it's a safe default?
Edit: after reading more code, I think I was wrong in interpreting the empty set of measurements as "trusting nothing" because when we convert this to an OmicronMeasurements we translate an empty set to InstallDataset. That's counter intuitive to me -- I'd strongly suggest making that more explicit with an enum with two variants ... which brings me back to feeling like: could we delete BlueprintMeasurementsDesiredContents as it currently exists and rename BlueprintMeasurementSetDesiredContents to BlueprintMeasurementsDesiredContents? Or even just BlueprintSledMeasurements? ("desired" is implied for everything in the blueprint)
| } | ||
| } | ||
|
|
||
| writeln!(indented, "reference measurements:")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can be slightly more specific so that people who aren't steeped in all the measurement work would know which measurements these are. Would it be accurate to call this "computed set of measurements acceptable from other sleds"?
| host_phase_2: | ||
| BlueprintHostPhase2DesiredSlots::current_contents(), | ||
| remove_mupdate_override: None, | ||
| measurements: BlueprintMeasurementsDesiredContents::default_contents(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right. This means "trust nothing", but after RSS, the sled should be trusting what's in the install dataset, right? That would also make it analogous to how this works for zone images. There, the per-sled value in the blueprint is more analogous to OmicronMeasurementSetDesiredContents, which lets you say InstallDataset. I think we might want to do that here (but I might be missing why OmicronMeasurementsDesiredContents exists).
Edit: I see now that an empty set here does mean InstallDataset and I've commented elsewhere that I think we should make that more explicit.
| // TODO this will come in a subsequent PR | ||
| measurements: | ||
| BlueprintMeasurementsDesiredContents::default_contents(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could better track these call sites with a separate constructor, like placeholder()? Instead of relying on these TODOs?
(I just suggested something similar to Karen on another PR)
| .finalize(), | ||
| host_phase_2: self.host_phase_2.finalize(), | ||
| // TODO this will come in a subsequent PR | ||
| measurements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, I feel like a safer default would be to use whatever was in the BlueprintSledConfig that we started with in creating this ActiveSledEditor. I feel like you should be able to grab that in new.
No description provided.