Conversation
|
Overall looks good, couple of comments: |
|
On my laptop it took 0.320835 seconds to iterate over 119054 events. On ~dozreg-toplud (far from the busiest ship on the network) there are epochs with ~20M events. This means that it would take around a minute to just read the event log in order to boot. Surely the last offset should just be stored in the header, and |
|
_book_scan_end will also attempt to truncate all events after a corrupted event was encountered. Is this desirable? |
dec2562 to
6e95311
Compare
6e95311 to
6f621f7
Compare
3a3610d to
3d994bd
Compare
85c7a88 to
53751f5
Compare
53751f5 to
7c54c40
Compare
…om ~mastyr-bottec
This PR replaces LMDB with book, a custom append-only file-based event log persistence layer tailored to Urbit's sequential access patterns.
Motivation
Unlimited event size
LMDB's general-purpose key-value store features (random access, transactions) are unnecessary overhead for Urbit's strictly append-only event log. With LMDB, reducing log size on disk is impossible (due to B+tree) and maximum value size (event size, in our case) is limited to 4GB or less. This new API provides a simpler, more focused solution.
Faster writes
Additionally, write speeds with
bookwill exceed LMDB's, thus removing a potential bottleneck (should we approach it after integrating SKA with the core operating function).Implementation
Double-Buffered Headers (LMDB-style Durability)
Book uses a double-buffered header strategy inspired by LMDB to achieve single-fsync durability:
seq_d)seq_dThis provides crash consistency without requiring two fsyncs per commit. If a crash occurs mid-write, the old header remains valid (the new one will have an invalid CRC), and any uncommitted deed data is overwritten/truncated on next startup.
File Format
Events are stored in
book.logwith the following layout:Events on-disk are written as
deeds with a minimal framing format:The trailing
let_dfield enables efficient backward scanning during crash recovery—we can read the last 8 bytes to determine the previous deed's size without a forward scan.reeds are used to representdeeds in memory:The
u3_bookstructure is used for operations like reading, writing, etc.:Batch Writes with Scatter-Gather I/O
Batch writes use
pwritev()with iovecs to write multiple deeds in a single syscall, avoiding both per-deed syscall overhead and buffer copying. Each deed requires 3 iovecs (len_d, buffer, let_d), chunked to respect IOV_MAX limits.Features:
meta.binfile)u3_book_walk_*)pwritev)libuv(maintains existing async patterns)play -f) replay supportpreadandpwrite(no cursor position tracking)u3_lmdb_*functions)Testing
Tests focus on failure mode, edge case, recovery, and benchmarks. This PR adds write benchmarks for LMDB as well (executable via
zig build lmdb-test).Run:
zig build book-testCompatibility
This PR changes how events are stored in future epochs, but it continues to use LMDB to store global pier metadata in the top-level log directory (
$pier/.urb/log/data.mdb). This ensures that helpful error messages can be printed even when users attempt to boot their book-style piers with old binaries. It should be noted that the top-level metadata should be considered canonical. Metadata stored within epochs (meta.bin, as of this PR) maintains consistency with the top-level too, though.Performance
Book's performance is slightly favorable in the single-event case, and marginally favorable with larger event batches. Disk use is equivalent.
To-do