Skip to content

Conversation

@plaidfinch
Copy link
Contributor

Adds version 13 (ADD_TRUST_QUORUM) to the Sled Agent API with the following endpoints for trust quorum reconfiguration:

  • POST /trust-quorum/reconfigure - Initiate a reconfiguration
  • POST /trust-quorum/upgrade-from-lrtq - Upgrade from low-rent (legacy) trust quorum
  • POST /trust-quorum/commit - Commit a trust-quorum
  • GET /trust-quorum/coordinator-status - Get coordinator status
  • POST /trust-quorum/prepare-and-commit - Prepare and commit a configuration

Types are organized per RFD 619 (via feeding Claude the RFD):

  • API types defined in sled-agent-types-versions/src/add_trust_quorum/
  • Re-exported via latest.rs and sled-agent-types/src/trust_quorum.rs
  • API trait uses latest:: paths for all trust quorum types

Also exports EncryptedRackSecrets, Salt, and Sha3_256Digest from trust-quorum-protocol for use in the prepare_and_commit handler.

Co-authored by Claude Code

@plaidfinch plaidfinch force-pushed the sled-agent-trust-quorum-api branch from cfdce9d to 79e918f Compare December 21, 2025 00:51
The following endpoints are created for trust quorum reconfiguration:

- POST `/trust-quorum/reconfigure` - Initiate a reconfiguration
- POST `/trust-quorum/upgrade-from-lrtq` - Upgrade from low-rent (legacy) trust quorum
- POST `/trust-quorum/commit` - Commit a trust-quorum
- GET `/trust-quorum/coordinator-status` - Get coordinator status
- POST `/trust-quorum/prepare-and-commit` - Prepare and commit a configuration

Types are organized per RFD 619 (via feeding Claude the RFD):

- API types defined in `sled-agent-types-versions/src/add_trust_quorum/`
- Re-exported via `latest.rs` and `sled-agent-types/src/trust_quorum.rs`
- API trait uses `latest::` paths for all trust quorum types

Also exports `EncryptedRackSecrets`, `Salt`, and `Sha3_256Digest` from `trust-quorum-protocol` for use in the `prepare_and_commit` handler.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@plaidfinch plaidfinch force-pushed the sled-agent-trust-quorum-api branch from 79e918f to 5ccf78a Compare December 21, 2025 00:52
@plaidfinch
Copy link
Contributor Author

This is still missing the proxy methods; these will be added in a separate commit. Planning to do a self-review of the PR to identify areas of uncertainty and cross-check for any LLM-introduced confabulations during the RFD 619 refactor.

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewjstone You mentioned that this response is always a fatal error during reconfiguration of the TQ. Does that mean we should return an error response here, or should that be generated higher-up in Nexus?

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here: should we ever return Pending from this level of the API?

) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> {
let sa = request_context.context();
let request = body.into_inner();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bunch of messy code to parse hex-encoded parameters representing the binary salt and data for the encrypted rack secret. I don't like it, and I want to do this better, but there's a tension because the messy parsing here prevents the necessity of tight coupling to the underlying TQ types, which was originally a desideratum. I will revisit this shortly and it should be checked carefully for correctness and cleanliness before merging.

) -> Result<HttpResponseUpdatedNoContent, HttpError> {
Ok(HttpResponseUpdatedNoContent())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to implement these methods in the simulator or not?

pub coordinator: BaseboardId,
/// All members of the configuration and the hex-encoded SHA3-256 hash of
/// their key shares.
pub members: BTreeMap<BaseboardId, String>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of hex-encoded Strings here is not my favorite. As above — not sure what to do about it, let's discuss.

#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
pub struct TrustQuorumEncryptedRackSecrets {
/// Hex-encoded 32-byte salt used to derive the encryption key.
pub salt: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hex-encoded string manually parsed in API handler: gross, how to better?

/// Hex-encoded 32-byte salt used to derive the encryption key.
pub salt: String,
/// Hex-encoded encrypted data.
pub data: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A final hex-encoded string manually parsed in the API handler.

Comment on lines +45 to +46
EncryptedRackSecrets, RackSecret, ReconstructedRackSecret, Salt,
Sha3_256Digest,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to export Salt and Sha3_256Digest here? Or is there a different or better way to handle the need for these in the API?

@plaidfinch
Copy link
Contributor Author

plaidfinch commented Dec 21, 2025

Looks like the build is failing in CI because of: a missing implementation of the added SledAgentApi methods in nexus/mgs-updates/src/test_util/host_phase_2_test_state.rs but this does not show up on cargo build in omicron on my local workstation — is this file feature-gated? How should these methods be filled in for HostPhase2SledAgentImpl? It looks like the LRTQ bootstore-related methods are all unimplemented!(); should I just do the same here? Tentatively I have done this in 40eecf0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants