Skip to content

Conversation

@labbott
Copy link
Contributor

@labbott labbott commented Dec 16, 2025

No description provided.

@labbott labbott marked this pull request as draft December 16, 2025 14:14
@labbott labbott force-pushed the measurement_inventory_blueprints branch from e13289c to 044523e Compare December 16, 2025 20:46
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! 🚀

)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BlueprintMeasurementSetDesiredContents {
/// This measurement source is whatever happens to be on the sled's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment is copy-pasta and could use an update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read this and it looks fine. I can't tell if I already fixed it and github is being weird or if this was in reference to another comment

}

impl IdOrdItem for ReconciledSingleMeasurement {
type Key<'a> = String;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can change this to be type Key<'a> = &'a str and then return &self.file_name from key(&self) and you could get rid of the clone.

@labbott labbott force-pushed the measurement_inventory_blueprints branch 3 times, most recently from 4d5b5be to 4acf358 Compare December 17, 2025 18:47
@labbott labbott marked this pull request as ready for review December 18, 2025 15:44
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made it very far yet but wanted to post some early questions.

Comment on lines +1620 to +1624
path on boot disk: /fake/path/install/zones.json
boot disk inventory:
manifest generated by installinator (mupdate ID: 00000000-0000-0000-0000-000000000000)
no artifacts in install dataset (this should only be seen in simulated systems)
no non-boot disks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right that this duplicates the block above it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should at least change the path for testing but this is going to look very similar since they are being generated in very similar circumstances. The mupdate-id does get duplicated but I found it difficult to pull the mupdate ID up a level.

b957d6cf-f7b2-4bee-9928-c5fde8c59e04 crucible install-dataset
e246f5e3-0650-4afc-860f-ee7114d309c5 crucible install-dataset
measurements:
install dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? That the current set of measurements that are in-use are coming from the install dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1644 to +1645
reference measurements:
(measurement set is empty)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "reference measurements" here? Is this the set of measurements that Nexus has told Sled Agent are allowed right now?

I think more generally I'm confused about the difference between these three sections here ("measurements", "measurement manifest", and "reference measurements").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "reference measurements" here? Is this the set of measurements that Nexus has told Sled Agent are allowed right now?

Correct

I think more generally I'm confused about the difference between these three sections here ("measurements", "measurement manifest", and "reference measurements").

"measurement manifest" is the set of measurements on the install dataset that get placed during a MUPdate.

the "measurements" above are either a list of hashes for reconfigurator based update or a directive to use the install dataset.

the "reference measurements" are the resolved paths we should actually pass to sprockets

ALTER COLUMN measurement_manifest_boot_disk_path DROP default,
ALTER COLUMN measurement_manifest_source DROP default,
ALTER COLUMN measurement_manifest_mupdate_id DROP default,
ALTER COLUMN measurement_manifest_boot_disk_error DROP default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in other cases when we want to add a new column without a default, but we need to supply a default temporarily when we add the column in order to migrate existing data:
https://github.com/oxidecomputer/omicron/tree/main/schema/crdb#211-adding-a-new-column-without-a-default-value

-- similar to `usable_hardware_threads` and friends above.
cpu_family omicron.public.sled_cpu_family NOT NULL,

-- The path to the boot disk image file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "boot disk image file"? (It doesn't seem to be a disk image.)

Comment on lines +3947 to +3960
-- The path to the boot disk image file.
measurement_manifest_boot_disk_path TEXT NOT NULL,
-- The source of the zone manifest on the boot disk: from installinator or
-- sled-agent (synthetic). NULL means there is an error reading the zone manifest.
measurement_manifest_source omicron.public.inv_zone_manifest_source,
-- The mupdate ID that created the zone manifest if this is from installinator. If
-- this is NULL, then either the zone manifest is synthetic or there was an
-- error reading the zone manifest.
measurement_manifest_mupdate_id UUID,
-- Message describing the status of the zone manifest on the boot disk. If
-- this is NULL, then the zone manifest was successfully read, and the
-- inv_zone_manifest_zone table has entries corresponding to the zone
-- manifest.
measurement_manifest_boot_disk_error TEXT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the comments here seem to be copy/paste of the zone manifest comments. What's different about these fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are tracking the measurement manifest specifically. I missed updating the comments because it does look a lot like the zone manifest.

The measurement manifest exists on the boot disk and the non-boot disk and can have a separate error (e.g. the measurement file doesn't exist). As noted before, it also tracks the same mupdate override because trying to pull that to a higher level was a pretty bit mess.

host_phase_2_desired_slot_b STRING(64),

-- measurement contents
measurements STRING(64)[],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this, exactly? How big is the array? What do the elements correspond to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify this comment

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the blueprint-related types here. I have not yet gone through the inventory or the db-model/db-queries stuff but I feel like I'm getting a feel for the structure of things now!

}
}
}
#[derive(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(
/// Describes the set of software measurements that should be trusted by this sled (for trust quorum)
#[derive(

(or something -- this seems like an important type and could use a doc comment)

)]
#[serde(rename_all = "snake_case")]
pub struct BlueprintMeasurementsDesiredContents {
pub measurements: BTreeSet<BlueprintSingleMeasurement>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but maybe contributing to some confusion: it seems like an extra level of indirection here (sled.measurements.measurements is the set of measurements)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a hold over from a very early draft where we were explicitly modeling "old measurements" and "new measurements". That turned out to not work well/feel redundant. I went back and forth on collapsing it back in. I kind of like keeping the abstraction in case we need it later but I also realize it's probably going to be just as much work. Let me see if I can remove the extra layer.

}
}

#[derive(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(
/// Identifies a specific TUF repo artifact containing measurement data
#[derive(

?

Comment on lines +1587 to +1589
pub version: BlueprintArtifactVersion,
pub hash: ArtifactHash,
pub prune: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry, rust-analyzer is failing me here)

I expected this to just be an ArtifactHash. Am I understanding correctly that version is just here for nicer human-readable output?

What's prune for? I can't find a use of it besides serialization and display. Maybe it's used in the bigger PR for decision making? If so, maybe add a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the version is mostly for debugging/extra checks.

prune definitely deserves a bigger comment here. This is to decide what measurements get included in the active set of measurements. https://rfd.shared.oxide.computer/rfd/0512#_reconfigurator_implementation_details has an example sequence and measurements that will be dropped next time will be marked as prune

(I saw there's been some discussion about word choosing for similar(?) behavior with other parts of the system. I'm open to name changes to match)

write!(f, "{} prune: {}", self.hash, self.prune)
}
}
/// Where the measurement source is located
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Where the measurement source is located
/// For a particular sled, which measurements should be used for trusting other members of the cluster

}

impl BlueprintMeasurementsDesiredContents {
pub fn default_contents() -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always worried about people interpreting "default" differently. In particular, "default" seems like something that should be safe if you don't know or care what you're doing, but this is not a safe value because it would cause a sled to trust no measurements at all. (Right?)

This seems to be mostly used in tests. How about calling this empty_for_tests()?

The other user is blueprint_read(). We could instead load the per-sled measurement data first and then construct a fully-formed BlueprintMeasurementsDesiredContents with what we read... I like this approach much better but I see that blueprint_read() is already not doing this with zones and datasets so maybe that's not worth it. In that case though maybe have this called none() so it at least doesn't suggest to people that it's a safe default?

Edit: after reading more code, I think I was wrong in interpreting the empty set of measurements as "trusting nothing" because when we convert this to an OmicronMeasurements we translate an empty set to InstallDataset. That's counter intuitive to me -- I'd strongly suggest making that more explicit with an enum with two variants ... which brings me back to feeling like: could we delete BlueprintMeasurementsDesiredContents as it currently exists and rename BlueprintMeasurementSetDesiredContents to BlueprintMeasurementsDesiredContents? Or even just BlueprintSledMeasurements? ("desired" is implied for everything in the blueprint)

}
}

writeln!(indented, "reference measurements:")?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can be slightly more specific so that people who aren't steeped in all the measurement work would know which measurements these are. Would it be accurate to call this "computed set of measurements acceptable from other sleds"?

host_phase_2:
BlueprintHostPhase2DesiredSlots::current_contents(),
remove_mupdate_override: None,
measurements: BlueprintMeasurementsDesiredContents::default_contents(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. This means "trust nothing", but after RSS, the sled should be trusting what's in the install dataset, right? That would also make it analogous to how this works for zone images. There, the per-sled value in the blueprint is more analogous to OmicronMeasurementSetDesiredContents, which lets you say InstallDataset. I think we might want to do that here (but I might be missing why OmicronMeasurementsDesiredContents exists).

Edit: I see now that an empty set here does mean InstallDataset and I've commented elsewhere that I think we should make that more explicit.

Comment on lines +591 to +593
// TODO this will come in a subsequent PR
measurements:
BlueprintMeasurementsDesiredContents::default_contents(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could better track these call sites with a separate constructor, like placeholder()? Instead of relying on these TODOs?

(I just suggested something similar to Karen on another PR)

.finalize(),
host_phase_2: self.host_phase_2.finalize(),
// TODO this will come in a subsequent PR
measurements:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, I feel like a safer default would be to use whatever was in the BlueprintSledConfig that we started with in creating this ActiveSledEditor. I feel like you should be able to grab that in new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants