Skip to content

Conversation

@andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Dec 6, 2025

EDIT: I added a bunch of usage for the new SQL. It should be fairly complete.


I tested the migrations in CRDB SQL shell. This structure matches the property based tests and I think is all that is required for Trust Quorum. It may require small tweaks as I build out the implementation. I can either make those tweaks in a follow up migration or keep this open and push it in with some nexus changes. Either is fine with me.

I'm mainly opening this to get eyes on it since I rarely touch SQL.

@andrewjstone
Copy link
Contributor Author

I'm going to continue iterating on this and adding in db-queries to use this SQL. But please take a look at the SQL.

@andrewjstone andrewjstone marked this pull request as draft December 8, 2025 16:17
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL changes look good to me!

@andrewjstone
Copy link
Contributor Author

andrewjstone commented Dec 17, 2025

There are a few things remaining on this PR that I wanted to document here before I wrap up this work:

  • Merge main and incorporate latest versioning changes and TransactionError related changes.
  • Add ability to abort a configuration if it has not yet committed
  • Tests for adding new configurations on top of committed and aborted configurations
    • Test that we can't abort committed configurations
    • Test that we can't commit aborted configurations

Furthermore, this PR contains DB migrations from LRTQ to trust-quorum. It's likely that we'll want to separate those migrations and related code into a separate PR that we only merge once the rest of the TQ and LRTQ upgrade code is ready. This way if we ship a release before TQ and LRTQ upgrade are ready we won't have stale data around in the DB that may not correspond to existing sleds we want to migrate.

@andrewjstone
Copy link
Contributor Author

This PR has significantly changed since I opened it originally. The SQL has been simplified and a set of DB queries have been added to utilize the SQL.

It's expected that new configurations will be added and aborted via an end-user API and then background tasks will read the configurations and drive them to completion. That will come in follow up PRs.

I believe this is now ready for review.

@andrewjstone andrewjstone marked this pull request as ready for review December 18, 2025 01:54
@andrewjstone
Copy link
Contributor Author

Of note to reviewers, this PR also collapses the duplicate BaseboardId types to use the one in sled-agent-types, which is permitted as of #9488. I also fixed a few imports using sled-agent-types-versions::latest where I saw them. We agreed to use sled-agent-types in those cases.

@andrewjstone andrewjstone changed the title Add CRDB schema changes required for trust quorum Add CRDB schema changes and queries required for trust quorum Dec 18, 2025
@andrewjstone
Copy link
Contributor Author

After I started using this in a background task, I realized that it's missing a query. Namely, it's missing the ability to retrieve the latest configuration for all racks.

I also realized, more importantly for this PR, that not having an efficient mechanism in the database to know when Nexus is no longer required to operate on the trust quorum makes everything more expensive. For instance, in the common case, all the latest epochs would always have to be retrieved and the filtering of the state of each member as committed would have to be done inside the background task.

Therefore, once Nexus has seen acked commits for all members it should change from the Committed state to some other terminal state, like Done. I thought about changing the existing committed state to Committing, and then have the terminal state remain Committed, but I"m worried about the semantic implications. As soon as Nexus marks a configuration as committed, it cannot go backwards and be aborted. Committing makes this sound less final than it is.

I'm open to other alternatives.

@andrewjstone
Copy link
Contributor Author

Discussed in chat, but I'm going to remove the LRTQ table altogether and make the upgrade from LRTQ an operator operation because it can fail.

@andrewjstone
Copy link
Contributor Author

Discussed in chat, but I'm going to remove the LRTQ table altogether and make the upgrade from LRTQ an operator operation because it can fail.

Relatedly, I"m going to add a state for this operation as well to trust_quorum_configuration_state. So stay tuned.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been following the TQ work very closely, so all my comments are basically nits on naming / clarity. I'd feel more comfortable if someone with more familiarity approved too, but I'll give it my stamp from a purely "reviewing the Rust/SQL implementation" point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants