Skip to content

Conversation

@labbott
Copy link
Contributor

@labbott labbott commented Dec 18, 2025

as will silently succed which is not what we want

@labbott labbott requested a review from andrewjstone December 18, 2025 17:01
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the fix!

@labbott labbott enabled auto-merge (squash) December 18, 2025 17:56
@davepacheco
Copy link
Collaborator

Definitely like trading as for an explicit conversion, but should we not be handling the error? Why is it okay to unwrap() here?

@labbott labbott disabled auto-merge December 18, 2025 19:29
@labbott
Copy link
Contributor Author

labbott commented Dec 18, 2025

Definitely like trading as for an explicit conversion, but should we not be handling the error? Why is it okay to unwrap() here?

This came out of a comment from @andrewjstone #9524 (comment)

To actually hit this panic, we'd either have to have a zone manifest from a sled return a wildly large size from one part of the databse query or the other

// Pull zone manifest zones out of all sled agents.
let zone_manifest_zones: Vec<_> = collection
.sled_agents
.iter()
.filter_map(|sled_agent| {
sled_agent
.zone_image_resolver
.zone_manifest
.boot_inventory
.as_ref()
.ok()
.map(|artifacts| {
artifacts.artifacts.iter().map(|artifact| {
InvZoneManifestZone::new(
collection_id,
sled_agent.sled_id,
artifact,
)
})
})
})
.flatten()
.collect();
// Load zone_manifest_zone rows.
let mut zone_manifest_artifacts_by_sled_id = {
use nexus_db_schema::schema::inv_zone_manifest_zone::dsl;
let mut by_sled_id: BTreeMap<
SledUuid,
IdOrdMap<ZoneArtifactInventory>,
> = BTreeMap::new();
let mut paginator = Paginator::new(
batch_size,
dropshot::PaginationOrder::Ascending,
);
while let Some(p) = paginator.next() {
let batch = paginated_multicolumn(
dsl::inv_zone_manifest_zone,
(dsl::sled_id, dsl::zone_file_name),
&p.current_pagparams(),
)
.filter(dsl::inv_collection_id.eq(db_id))
.select(InvZoneManifestZone::as_select())
.load_async(&*conn)
.await
.map_err(|e| {
public_error_from_diesel(e, ErrorHandler::Server)
})?;
paginator = p.found_batch(&batch, &|row| {
(row.sled_id, row.zone_file_name.clone())
});
for row in batch {
by_sled_id
.entry(row.sled_id.into())
.or_default()
.insert_unique(row.into())
.expect("database ensures the row is unique");
}
}
by_sled_id
};

we could either say this is unlikely enough to happen or I could attempt to do something like

let error_values = collection
.errors
.iter()
.enumerate()
.map(|(i, message)| {
let index = u16::try_from(i).map_err(|e| {
Error::internal_error(&format!(
"failed to convert error index to u16 (too \
many errors in inventory collection?): {}",
e
))
})?;
Ok(InvCollectionError::new(
collection_id,
index,
message.clone(),
))
})
.collect::<Result<Vec<_>, Error>>()?;

and return an internal error?

@davepacheco
Copy link
Collaborator

That's what I would do (InternalError), but I realize that will require changing at least this function to be fallible and I don't know how big the blast radius is. But in general it doesn't seem great if reading bad data from the database can result in a panic. (Could uploading a TUF repo with bad metadata also trigger this?)

@labbott
Copy link
Contributor Author

labbott commented Dec 18, 2025

That's what I would do (InternalError), but I realize that will require changing at least this function to be fallible and I don't know how big the blast radius is. But in general it doesn't seem great if reading bad data from the database can result in a panic. (Could uploading a TUF repo with bad metadata also trigger this?)

I think you're correct that a bad manifest would potentially trigger this.

I think both of these functions are already fallible so I don't think that would be extra work. I am going to raise the question if we don't just want to unwrap if returning an internal error is better for our purposes than the current behavior we get with as.

@andrewjstone
Copy link
Contributor

That's what I would do (InternalError), but I realize that will require changing at least this function to be fallible and I don't know how big the blast radius is. But in general it doesn't seem great if reading bad data from the database can result in a panic. (Could uploading a TUF repo with bad metadata also trigger this?)

I think you're correct that a bad manifest would potentially trigger this.

I think both of these functions are already fallible so I don't think that would be extra work. I am going to raise the question if we don't just want to unwrap if returning an internal error is better for our purposes than the current behavior we get with as.

I think it's definitely better than using as. We could accidentally end up with negative sizes otherwise or extremely large ones depending upon conversion direction. We'd want to know if the values are incorrect.

I think in order of best to worst the preference should be:

return an error, panic, use as

And we should only panic if the data is untrusted, meaning it comes from code that generates it and not based on user input.

@andrewjstone
Copy link
Contributor

That's what I would do (InternalError), but I realize that will require changing at least this function to be fallible and I don't know how big the blast radius is. But in general it doesn't seem great if reading bad data from the database can result in a panic. (Could uploading a TUF repo with bad metadata also trigger this?)

I think you're correct that a bad manifest would potentially trigger this.
I think both of these functions are already fallible so I don't think that would be extra work. I am going to raise the question if we don't just want to unwrap if returning an internal error is better for our purposes than the current behavior we get with as.

I think it's definitely better than using as. We could accidentally end up with negative sizes otherwise or extremely large ones depending upon conversion direction. We'd want to know if the values are incorrect.

I think in order of best to worst the preference should be:

return an error, panic, use as

And we should only panic if the data is untrusted, meaning it comes from code that generates it and not based on user input.

Also, I had assumed, perhaps incorrectly that this was trusted data, which is why I figured panicking was fine. If not, I'd suggest returning an error if we can. We've definitely seen problems with sizes of things being incorrect and not discovering it for a while after it was silently causing issues. I think this was on disk sizes but I don't recall. For that reason, panicking is better than silently doing the wrong thing, but if we can not do the wrong thing and not crash, we are in a better place.

`as` will silently succed which is not what we want
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants