Skip to content

disk: replaces lmdb with book#936

Open
matthew-levan wants to merge 42 commits intodevelopfrom
ml/book
Open

disk: replaces lmdb with book#936
matthew-levan wants to merge 42 commits intodevelopfrom
ml/book

Conversation

@matthew-levan
Copy link
Contributor

@matthew-levan matthew-levan commented Jan 2, 2026

This PR replaces LMDB with book, a custom append-only file-based event log persistence layer tailored to Urbit's sequential access patterns.

Motivation

Unlimited event size

LMDB's general-purpose key-value store features (random access, transactions) are unnecessary overhead for Urbit's strictly append-only event log. With LMDB, reducing log size on disk is impossible (due to B+tree) and maximum value size (event size, in our case) is limited to 4GB or less. This new API provides a simpler, more focused solution.

Faster writes

Additionally, write speeds with book will exceed LMDB's, thus removing a potential bottleneck (should we approach it after integrating SKA with the core operating function).

Implementation

Double-Buffered Headers (LMDB-style Durability)

Book uses a double-buffered header strategy inspired by LMDB to achieve single-fsync durability:

  • Two header slots exist at page-aligned offsets (0 and 4096)
  • Each header contains a monotonically increasing sequence number (seq_d)
  • On commit: write deed data, then write updated header to the inactive slot, then fsync once
  • On startup: read both headers, use the one with the higher valid seq_d

This provides crash consistency without requiring two fsyncs per commit. If a crash occurs mid-write, the old header remains valid (the new one will have an invalid CRC), and any uncommitted deed data is overwritten/truncated on next startup.

File Format

Events are stored in book.log with the following layout:

Offset 0:      Header Slot A (32 bytes, padded to 4096)
Offset 4096:   Header Slot B (32 bytes, padded to 4096)
Offset 8192:   Deeds start here
/* u3_book_head: on-disk file header (32 bytes, page-aligned slots)
**
**   two header slots at offsets 0 and 4096; deeds start at 8192.
*/
typedef struct _u3_book_head {
  c3_w mag_w;      //  magic number: 0x424f4f4b ("BOOK")
  c3_w ver_w;      //  format version: 1
  c3_d fir_d;      //  first event number in file
  c3_d las_d;      //  last event number (commit marker)
  c3_d seq_d;      //  sequence number (for double-buffer)
  c3_w crc_w;      //  CRC32 checksum (of preceding fields)
} u3_book_head;

Events on-disk are written as deeds with a minimal framing format:

/* u3_book_deed: on-disk event record
**
**   on-disk format: len_d | buffer_data | let_d
**   where buffer_data is len_d bytes of opaque buffer data
**   and let_d echoes len_d for validation (used for backward scanning)
*/
typedef struct _u3_book_deed {
  c3_d len_d;    //  buffer size (bytes)
  // c3_y buf_y[];  //  variable-length buffer data
  c3_d let_d;    //  length trailer (echoes len_d)
} u3_book_deed;

The trailing let_d field enables efficient backward scanning during crash recovery—we can read the last 8 bytes to determine the previous deed's size without a forward scan.

reeds are used to represent deeds in memory:

/* u3_book_reed: in-memory event record representation for I/O
*/
typedef struct _u3_book_reed {
  c3_d  len_d;    //  total buffer size (bytes)
  c3_y* buf_y;    //  complete buffer (caller owns)
} u3_book_reed;

The u3_book structure is used for operations like reading, writing, etc.:

/* u3_book: event log handle
*/
typedef struct _u3_book {
  c3_i         fid_i;      //  file descriptor for book.log
  c3_i         met_i;      //  file descriptor for meta.bin
  c3_c*        pax_c;      //  file path to book.log
  u3_book_head hed_u;      //  cached header (current valid state)
  c3_d         las_d;      //  cached last event number
  c3_d         off_d;      //  cached append offset (end of last event)
  c3_w         act_w;      //  active header slot (0 or 1)
} u3_book;

Batch Writes with Scatter-Gather I/O

Batch writes use pwritev() with iovecs to write multiple deeds in a single syscall, avoiding both per-deed syscall overhead and buffer copying. Each deed requires 3 iovecs (len_d, buffer, let_d), chunked to respect IOV_MAX limits.

Features:

  • LMDB-style double-buffered headers for single-fsync durability
  • Automatic crash recovery via backward/forward scanning and truncation
  • Embedded key-value metadata storage (separate meta.bin file)
  • Iterator API for sequential reads (u3_book_walk_*)
  • Batch writes with scatter-gather I/O (pwritev)
  • Thread-safe when used with libuv (maintains existing async patterns)
  • ACID (at the event batch level)
  • Functional partial and full (via play -f) replay support
  • Stateless operations via pread and pwrite (no cursor position tracking)
  • Drop-in replacement for LMDB (API mirrors u3_lmdb_* functions)

Testing

Tests focus on failure mode, edge case, recovery, and benchmarks. This PR adds write benchmarks for LMDB as well (executable via zig build lmdb-test).

Run: zig build book-test

Compatibility

This PR changes how events are stored in future epochs, but it continues to use LMDB to store global pier metadata in the top-level log directory ($pier/.urb/log/data.mdb). This ensures that helpful error messages can be printed even when users attempt to boot their book-style piers with old binaries. It should be noted that the top-level metadata should be considered canonical. Metadata stored within epochs (meta.bin, as of this PR) maintains consistency with the top-level too, though.

Performance

Book's performance is slightly favorable in the single-event case, and marginally favorable with larger event batches. Disk use is equivalent.

Metric book single lmdb single book batched lmdb batched
Events written 1000 1000 100000 100000
Event size 128 bytes 128 bytes 1280 bytes 1280 bytes
Total data 0.12 MB 0.12 MB 122.07 MB 122.07 MB
Total time 4.020 s 4.045 s 0.625 s 0.662 s
Write speed 249 ev/s 247 ev/s 160083 ev/s 151125 ev/s
Throughput 0.03 MB/s 0.03 MB/s 195.41 MB/s 184.48 MB/s
Latency 4020.2 μs 4045.2 μs 6.2 μs 6.6 μs

To-do

  • Migrations from vere-v3,4.x piers
  • Failure mode tests

@dozreg-toplud
Copy link
Contributor

Overall looks good, couple of comments:

@dozreg-toplud
Copy link
Contributor

_book_scan_end iterates over every event in the file, validating them and the event count in the header, and it is called on every event log load (including on every boot) to locate the append offset.

On my laptop it took 0.320835 seconds to iterate over 119054 events. On ~dozreg-toplud (far from the busiest ship on the network) there are epochs with ~20M events. This means that it would take around a minute to just read the event log in order to boot.

Surely the last offset should just be stored in the header, and _book_scan_end should be reserved for corruption recovery. With that we could also iterate from end to the start of the iterator range in u3_book_walk_init whenever it would make sense: the deeds already have sizes in their tails.

@dozreg-toplud
Copy link
Contributor

_book_scan_end will also attempt to truncate all events after a corrupted event was encountered. Is this desirable?

@matthew-levan matthew-levan marked this pull request as ready for review January 23, 2026 01:37
@matthew-levan matthew-levan requested a review from a team as a code owner January 23, 2026 01:37
@matthew-levan matthew-levan force-pushed the ml/book branch 2 times, most recently from 85c7a88 to 53751f5 Compare January 26, 2026 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants