Skip to content

feat: VM restart with database recovery#160

Open
ARR4N wants to merge 36 commits intomainfrom
arr4n/recovery
Open

feat: VM restart with database recovery#160
ARR4N wants to merge 36 commits intomainfrom
arr4n/recovery

Conversation

@ARR4N
Copy link
Collaborator

@ARR4N ARR4N commented Feb 11, 2026

Adds support for shutdown and restart without state sync, recovering entirely from the local database.

Recommended review order

  1. Changes to blocks.Block.Mark{Executed,Synchronous}() and the new Block.RestoreExecutionArtefacts(); plus accompanying tests.
  2. Changes to sae.NewVM() and sae.SinceGenesis.Initialize().
  3. Introduction of sae/recovery.go to support (2), and associated test in recovery_test.go.
  4. All other changes, which are motivated by the above and generally self-contained.

Mempool rationale

The upstream legacypool implementation expects a synchronous blockchain, initially requesting the current block and then updating based on chain-head events, in both cases opening a state.StateDB at the latest types.Header.Root. In an asynchronous implementation this results in the mempool acting on settled, not executed state. So far this has caused two undesirable properties:

  1. $\tau$ seconds of empty blocks. Until settled, included transactions remain in the mempool, unblocking sae.VM.WaitForEvent(), only to be filtered out by worstcase. This also suggests an underlying inefficiency in which every BuildBlock() first discards some prefix of already-included transactions.
  2. VMs recovered after shutdown may experience a false nonce gap (discovered by sae.TestRecoverFromDatabase() in this PR) that don't allow their BuildBlock() method to include any transaction from an EOA with included but not settled transactions.

The wrapper returned by txgossip.NewBlockChain() addresses this by always serving the latest executed state, regardless of which root is requested. The impossibility of re-orgs makes this safe and efficient (no mempool resets), and addresses (2) entirely. Although this doesn't address all of (1) and some empty blocks and discarded prefixes can occur, it significantly curtails the issue.

Copy link
Contributor

@StephenButtolph StephenButtolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not doing a full review since it isn't marked as r4r yet. Just dumping my thoughts as I looked through things

Copy link
Contributor

@alarso16 alarso16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not ready, but I had some questions that might be easier to address early (especially since I'll be out starting late tomorrow)

assert.Equal(t, b, lastExecuted.Load(), "Atomic pointer to last-executed block")
require.NoError(t, b.MarkExecuted(db, gasTime, wallTime, baseFee.ToBig(), receipts, stateRoot, lastExecuted), "MarkExecuted()")

fromDB := newBlock(t, b.EthBlock(), b.ParentBlock(), b.LastSettled())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are identical to the old ones, just placed into a test-table loop to allow them to be run on the original (post-MarkExecuted) and the restored Blocks.

@ARR4N ARR4N marked this pull request as ready for review February 12, 2026 16:00
Copy link
Contributor

@alarso16 alarso16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok a more full review


func (b *Block) setAncestors(parent, lastSettled *Block) error {
// SetAncestors sets the block's ancestry while enforcing invariants.
func (b *Block) SetAncestors(parent, lastSettled *Block) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't very helpful (maybe just to satisfy the linter), but why would the parent be nil? Is it just the genesis block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would typically have both or neither be nil. For example, in VerifyBlock() the rebuilding is performed without known ancestry (i.e. both nil via a call from New()) and then the ancestors are copied in with this function.

Comment on lines +25 to +26
"bounds",
"interimExecutionTime",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmputils question, but you didn't allow these even though they're unexported - why do you list them explicitly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For full context:

		cmp.AllowUnexported(Block{}, ancestry{}),
		cmpopts.IgnoreFields(
			Block{},
			"bounds",
			"interimExecutionTime",
		),

The first line tells it to compare the un-exported fields of Block while the second option says "buuut, ignore these ones". The latter also supports ignoring exported fields. The two ignored ones are effectively just optional scratch space that aren't critical to normal operation.

lastExecuted *atomic.Pointer[Block],
) error {
if it := b.interimExecutionTime.Load(); it != nil && byGas.Compare(it) < 0 {
// The final execution time is scaled to the new gas target but interim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this comment confusing. It took me several minutes to understand:

  1. What the point of the interim execution time is
  2. Why the post-target scaling is monotonic
  3. The expected relation between these two variables.

I think I was mostly confused because it's not a "rounding error", but just actually different, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this code isn't introduced by this PR, it's just moved. Have a look at the call site in saexec/execution.go to see how it's set and blocks/settlement.go (LastToSettleAt()) to see how it's used.

I think I was mostly confused because it's not a "rounding error", but just actually different, right?

It is a rounding error.

We have the interim clock that ticks for each transaction and the execution clock that ticks for the sum of per-transaction gas. In total they have both ticked by the same amount so are initially equal value.

But then the execution clock MUST be scaled to the new gas target to keep with ACP-176. This scaling might induce a rounding error due to the fractional numerator not being properly divisible by the new denominator. If we didn't handle that in a monotonic fashion (achieved by rounding up) then LastToSettleAt() could return different blocks based on whether interim or execution clocks were checked.

// execution so no error is returned and execution MUST continue optimistically.
// Any such log in development will cause tests to fail.
func (b *Block) CheckBaseFeeBound(actual *uint256.Int) {
if b.bounds == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used during execution, so even though the bounds aren't instantiated when loading from disk, this doesn't seem necessary. Am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block replay at recovery requires execution of all blocks since the last one with an available state root. The iter.Seq2 returned in recovery.go will yield blocks that hit this bit of the code.

return vm.close()
}

func (vm *VM) close() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think you need this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? The vm.close() method is used in NewVM() to tear down things already constructed if there's a later failure.

Copy link
Contributor

@StephenButtolph StephenButtolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at any of the tests yet, but the actual code makes sense to me.

Comment on lines +41 to +43
// This would require the node to crash at such a precise point in time
// that it's not worth a preemptive fix. If this ever occurs then just
// try the root [params.CommitTrieDBEvery] blocks earlier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, when can this case happen? We commit the state tree before we update the head block, so shouldn't we be guaranteed that the state is always available here?

I guess can we be more specific about what precise point in time a crash would have to occur? I'm hoping to determine whether or not it is actually a problem that needs a preemptive fix haha

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I had only considered the point between state.StateDB.Commit() and triedb.Database.Commit() but forgot that that would then go back 4096 blocks.

This only leaves the Firewood scenario that @alarso16 described.

Comment on lines +97 to +99
if err := canonicaliseLastSynchronous(db, lastSynchronous); err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be able to explain the rationale on why we MarkSynchronous in SinceGenesis.Initialize and then canonicalize in NewVM?

To me, it feels like it would flow more naturally for NewVM to take in a lastSynchronous *types.Block, and then inside NewVM manage marking the block as synchronous (if needed) and making sure the state is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that then NewVM() would also have to take the starting gas excess. It's absolutely doable, but I'm not sure how much is gained because there's only a single "degree of freedom" in the call to MarkSynchronous() so it's not like NewVM() would be ensuring any invariants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants