feat: VM restart with database recovery by ARR4N · Pull Request #160 · ava-labs/strevm

ARR4N · 2026-02-11T19:13:42Z

Adds support for shutdown and restart without state sync, recovering entirely from the local database.

Recommended review order

Changes to blocks.Block.Mark{Executed,Synchronous}() and the new Block.RestoreExecutionArtefacts(); plus accompanying tests.
Changes to sae.NewVM() and sae.SinceGenesis.Initialize().
Introduction of sae/recovery.go to support (2), and associated test in recovery_test.go.
All other changes, which are motivated by the above and generally self-contained.

Mempool rationale

The upstream legacypool implementation expects a synchronous blockchain, initially requesting the current block and then updating based on chain-head events, in both cases opening a state.StateDB at the latest types.Header.Root. In an asynchronous implementation this results in the mempool acting on settled, not executed state. So far this has caused two undesirable properties:

$\tau$ seconds of empty blocks. Until settled, included transactions remain in the mempool, unblocking sae.VM.WaitForEvent(), only to be filtered out by worstcase. This also suggests an underlying inefficiency in which every BuildBlock() first discards some prefix of already-included transactions.
VMs recovered after shutdown may experience a false nonce gap (discovered by sae.TestRecoverFromDatabase() in this PR) that don't allow their BuildBlock() method to include any transaction from an EOA with included but not settled transactions.

The wrapper returned by txgossip.NewBlockChain() addresses this by always serving the latest executed state, regardless of which root is requested. The impossibility of re-orgs makes this safe and efficient (no mempool resets), and addresses (2) entirely. Although this doesn't address all of (1) and some empty blocks and discarded prefixes can occur, it significantly curtails the issue.

…rkExecuted()`

StephenButtolph

not doing a full review since it isn't marked as r4r yet. Just dumping my thoughts as I looked through things

blocks/execution.go

blocks/block.go

params/params.go

txgossip/blockchain.go

sae/vm.go

alarso16

I know it's not ready, but I had some questions that might be easier to address early (especially since I'll be out starting late tomorrow)

blocks/execution.go

blocks/settlement.go

blocks/execution.go

blocks/settlement.go

sae/recovery.go

txgossip/blockchain.go

sae/vm.go

…d no longer exported

…ent methods

Co-authored-by: Austin Larson <78000745+alarso16@users.noreply.github.com> Signed-off-by: Arran Schlosberg <519948+ARR4N@users.noreply.github.com>

ARR4N · 2026-02-12T14:34:17Z

blocks/execution_test.go

-		assert.Equal(t, b, lastExecuted.Load(), "Atomic pointer to last-executed block")
+	require.NoError(t, b.MarkExecuted(db, gasTime, wallTime, baseFee.ToBig(), receipts, stateRoot, lastExecuted), "MarkExecuted()")
+
+	fromDB := newBlock(t, b.EthBlock(), b.ParentBlock(), b.LastSettled())


These tests are identical to the old ones, just placed into a test-table loop to allow them to be run on the original (post-MarkExecuted) and the restored Blocks.

alarso16

Ok a more full review

alarso16 · 2026-02-12T17:57:33Z

blocks/block.go


-func (b *Block) setAncestors(parent, lastSettled *Block) error {
+// SetAncestors sets the block's ancestry while enforcing invariants.
+func (b *Block) SetAncestors(parent, lastSettled *Block) error {


The comment isn't very helpful (maybe just to satisfy the linter), but why would the parent be nil? Is it just the genesis block?

You would typically have both or neither be nil. For example, in VerifyBlock() the rebuilding is performed without known ancestry (i.e. both nil via a call from New()) and then the ancestors are copied in with this function.

alarso16 · 2026-02-12T18:00:27Z

blocks/cmpopt.go

+			"bounds",
+			"interimExecutionTime",


cmputils question, but you didn't allow these even though they're unexported - why do you list them explicitly?

For full context:

cmp.AllowUnexported(Block{}, ancestry{}), cmpopts.IgnoreFields( Block{}, "bounds", "interimExecutionTime", ),

The first line tells it to compare the un-exported fields of Block while the second option says "buuut, ignore these ones". The latter also supports ignoring exported fields. The two ignored ones are effectively just optional scratch space that aren't critical to normal operation.

alarso16 · 2026-02-12T18:23:07Z

blocks/execution.go

 	lastExecuted *atomic.Pointer[Block],
 ) error {
+	if it := b.interimExecutionTime.Load(); it != nil && byGas.Compare(it) < 0 {
+		// The final execution time is scaled to the new gas target but interim


I found this comment confusing. It took me several minutes to understand:

What the point of the interim execution time is

Why the post-target scaling is monotonic

The expected relation between these two variables.

I think I was mostly confused because it's not a "rounding error", but just actually different, right?

FWIW, this code isn't introduced by this PR, it's just moved. Have a look at the call site in saexec/execution.go to see how it's set and blocks/settlement.go (LastToSettleAt()) to see how it's used.

I think I was mostly confused because it's not a "rounding error", but just actually different, right?

It is a rounding error.

We have the interim clock that ticks for each transaction and the execution clock that ticks for the sum of per-transaction gas. In total they have both ticked by the same amount so are initially equal value.

But then the execution clock MUST be scaled to the new gas target to keep with ACP-176. This scaling might induce a rounding error due to the fractional numerator not being properly divisible by the new denominator. If we didn't handle that in a monotonic fashion (achieved by rounding up) then LastToSettleAt() could return different blocks based on whether interim or execution clocks were checked.

alarso16 · 2026-02-12T18:30:44Z

blocks/invariants.go

 // execution so no error is returned and execution MUST continue optimistically.
 // Any such log in development will cause tests to fail.
 func (b *Block) CheckBaseFeeBound(actual *uint256.Int) {
+	if b.bounds == nil {


This function is only used during execution, so even though the bounds aren't instantiated when loading from disk, this doesn't seem necessary. Am I missing something?

The block replay at recovery requires execution of all blocks since the last one with an available state root. The iter.Seq2 returned in recovery.go will yield blocks that hit this bit of the code.

cmputils/types.go

sae/recovery.go

sae/vm.go

alarso16 · 2026-02-12T19:19:53Z

sae/vm.go

+	return vm.close()
+}
+
+func (vm *VM) close() error {


nit: I don't think you need this change.

Why not? The vm.close() method is used in NewVM() to tear down things already constructed if there's a later failure.

StephenButtolph

I haven't looked at any of the tests yet, but the actual code makes sense to me.

StephenButtolph · 2026-02-12T16:45:05Z

sae/recovery.go

+		// This would require the node to crash at such a precise point in time
+		// that it's not worth a preemptive fix. If this ever occurs then just
+		// try the root [params.CommitTrieDBEvery] blocks earlier.


I'm a bit confused, when can this case happen? We commit the state tree before we update the head block, so shouldn't we be guaranteed that the state is always available here?

I guess can we be more specific about what precise point in time a crash would have to occur? I'm hoping to determine whether or not it is actually a problem that needs a preemptive fix haha

Good point. I had only considered the point between state.StateDB.Commit() and triedb.Database.Commit() but forgot that that would then go back 4096 blocks.

This only leaves the Firewood scenario that @alarso16 described.

sae/vm.go

sae/always.go

sae/blocks.go

blocks/execution.go

sae/recovery.go

StephenButtolph · 2026-02-12T20:03:55Z

sae/vm.go

+	if err := canonicaliseLastSynchronous(db, lastSynchronous); err != nil {
+		return nil, err
+	}


Could you be able to explain the rationale on why we MarkSynchronous in SinceGenesis.Initialize and then canonicalize in NewVM?

To me, it feels like it would flow more naturally for NewVM to take in a lastSynchronous *types.Block, and then inside NewVM manage marking the block as synchronous (if needed) and making sure the state is correct.

If we did that then NewVM() would also have to take the starting gas excess. It's absolutely doable, but I'm not sure how much is gained because there's only a single "degree of freedom" in the call to MarkSynchronous() so it's not like NewVM() would be ensuring any invariants.

txgossip/blockchain.go

Co-authored-by: Austin Larson <78000745+alarso16@users.noreply.github.com> Signed-off-by: Arran Schlosberg <519948+ARR4N@users.noreply.github.com>

…genesis reference

Co-authored-by: Austin Larson <78000745+alarso16@users.noreply.github.com> Signed-off-by: Arran Schlosberg <519948+ARR4N@users.noreply.github.com>

… state

…ExecutionResults`

ARR4N added 4 commits February 11, 2026 16:02

feat: VM restart with database recovery

ff763c6

refactor: abstract params.RawDBKeyForBlock()

8b73177

refactor: abstract shared post-execution path into internal `Block.ma…

acbbace

…rkExecuted()`

Merge branch 'main' into arr4n/recovery

c9a3080

StephenButtolph reviewed Feb 11, 2026

View reviewed changes

alarso16 reviewed Feb 11, 2026

View reviewed changes

ARR4N and others added 14 commits February 12, 2026 13:34

fix: include vm.close() error if closing is due to constructor failure

2d38489

fix: remove duplicate go:generate

66824d0

refactor: fix typo in function name

e6569f9

doc: minor corrections

314e7e0

fix: missing argument to require.NoErrorf()

66f7139

doc: placate the linter with various comments

f0a31ed

refactor: placate the linter

78757d6

refactor!: sae.CanonicaliseLastSynchronous() called by NewVM() an…

7f7dfaf

…d no longer exported

refactor!: internal Block.markExecuted() takes ethdb.Batch argument

4f6575b

refactor: split disk and "MIX" events of markExecuted() into differ…

c6b7df8

…ent methods

refactor: remove zero check in params.CommitTrieDB()

6182f8b

refactor: use ethB instead of b.b

0e078d2

Co-authored-by: Austin Larson <78000745+alarso16@users.noreply.github.com> Signed-off-by: Arran Schlosberg <519948+ARR4N@users.noreply.github.com>

doc: update blocks.Block.bounds from MUST to SHOULD

1fcf199

refactor: use uint256.Int.Clone()

4296fcd

ARR4N commented Feb 12, 2026

View reviewed changes

ARR4N added 4 commits February 12, 2026 14:45

feat: fail early if expected state root is missing

00ecea2

doc: recovery.go

e0d3db7

refactor: vm.rebuildBlocksInMemory() readability

ab39a75

fix: 00ecea2

fec4265

ARR4N marked this pull request as ready for review February 12, 2026 16:00

ARR4N requested review from StephenButtolph and alarso16 February 12, 2026 16:00

alarso16 reviewed Feb 12, 2026

View reviewed changes

StephenButtolph reviewed Feb 12, 2026

View reviewed changes

ARR4N mentioned this pull request Feb 13, 2026

refactor: txgossip.NewBlockChain() serves executed not settled state #162

Closed

ARR4N commented Feb 13, 2026

View reviewed changes

txgossip/blockchain.go Show resolved Hide resolved

ARR4N and others added 12 commits February 13, 2026 13:37

fix: size saexec.Executor channel to fit full set of recovered blocks

98e3023

doc: clarify canonicaliseLastSynchronous()

8090718

Co-authored-by: Austin Larson <78000745+alarso16@users.noreply.github.com> Signed-off-by: Arran Schlosberg <519948+ARR4N@users.noreply.github.com>

refactor: move canonicaliseLastSynchronous() to vm.go and remove …

435233d

…genesis reference

doc: requirement for state-root availability of last synchronous block

bf5379d

refactor: s/executeAfter/lastExecuted/'

6122a90

Co-authored-by: Austin Larson <78000745+alarso16@users.noreply.github.com> Signed-off-by: Arran Schlosberg <519948+ARR4N@users.noreply.github.com>

refactor: move recovery methods off VM and onto dedicated type

71a9d95

refactor: simplify checking for accepted block after last synchronous

b8d0c7f

refactor: separate persisted execution results from others

cf311ec

refactor: abstract saedb package from params

facffab

refactor: simplify sae.recovery.lastBlockWithStateRootAvailable()

cee2029

doc: internal extend() function of in-memory block recovery

01a4679

refactor: simplify setting of ancestry in in-memory block rebuilding

b1cc7fb

ARR4N requested review from StephenButtolph and alarso16 February 13, 2026 16:55

ARR4N mentioned this pull request Feb 13, 2026

Simplify mempool blockchain implementation #163

Closed

ARR4N added 2 commits February 13, 2026 20:01

fix!: txgossip.NewBlockChain() wrapper always returns last-executed…

2e4e451

… state

refactor: executionResults wraps ephemeral- instead of `persisted…

5e102bf

…ExecutionResults`

Conversation

ARR4N commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recommended review order

Mempool rationale

Uh oh!

StephenButtolph left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alarso16 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alarso16 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephenButtolph left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

ARR4N commented Feb 11, 2026 •

edited

Loading