Skip to content

Performance: 2x faster page splitting and extraction#30

Merged
Mythie merged 10 commits intomainfrom
perf/splitting-optimizations
Feb 18, 2026
Merged

Performance: 2x faster page splitting and extraction#30
Mythie merged 10 commits intomainfrom
perf/splitting-optimizations

Conversation

@Mythie
Copy link
Contributor

@Mythie Mythie commented Feb 18, 2026

Page splitting and extraction were slower than they needed to be. Profiling the 2000-page split workload revealed a few easy wins that compound nicely.

What changed

Sync ObjectCopierObjectCopier was fully async despite doing zero I/O. Every recursive call in the deep-copy graph walk went through the microtask queue for no reason. Made all methods synchronous. (~15% on split workloads)

npm lru-cache — Our internal LRU cache did Map.delete() + Map.set() on every get() to maintain recency. Replaced with the lru-cache package which uses a doubly-linked-list internally. Biggest impact on large PDF loading where PdfRef.of() and PdfName.of() are called thousands of times. (~17% on split, ~60% on single-page extraction from large PDFs)

Cached PdfName serializationPdfName.toBytes() was calling new TextEncoder(), encoding to bytes, and iterating the result on every single write. Since names are interned, we can cache the serialized bytes on the instance. Added an ASCII fast-path that skips the encoder entirely for the 99% of PDF names that are plain ASCII. Also extracted a shared HEX_TABLE lookup used by both bytesToHex and escapeName. (~23% on split)

Skip deflate for tiny streams — pako's Deflate constructor zeros a 64KB hash table on every call (~0.023ms). When splitting 2000 pages, each output PDF has a few tiny unfiltered content streams (2-74 bytes) — that's 6000+ deflate initializations for streams that never compress meaningfully. Added a configurable compressionThreshold (default 512 bytes) to skip compression below that size. Also exposed compressStreams and compressionThreshold on SaveOptions. (~30% on split)

Numbers

Benchmark Before After Speedup
split 2000-page PDF 582ms 245ms 2.38x
split 100-page PDF 31.6ms 12.7ms 2.48x
1 page from 2000-page PDF 40.9ms 25.1ms 1.63x
100 pages from 2000-page PDF 50.9ms 26.8ms 1.90x

Also included

  • Benchmark suite for splitting/copying workloads (benchmarks/splitting.bench.ts)
  • CI workflow that runs benchmarks on PRs touching .ts files and posts a comparison comment
  • scripts/bench-compare.ts for the comparison logic

Add benchmarks for page splitting, copying, and merging (#26).
Synthetic 100-page and 2000-page PDFs are generated from sample.pdf
and cached to disk for reuse.

New benchmark suites:
- splitting.bench.ts: single-page extraction, full split, batch extract
- copying.bench.ts: cross-doc copy, duplication, merging
- comparison.bench.ts: head-to-head vs pdf-lib for all of the above

Report generation:
- scripts/bench-report.ts transforms vitest JSON output to markdown
- reports/benchmarks.md committed to repo, updated by CI
- .github/workflows/bench.yml runs weekly + on push to main
ObjectCopier does zero I/O — every method was async but never awaited
anything asynchronous. Removing async/await eliminates microtask
scheduling overhead on every recursive call in the deep-copy graph walk.

Benchmarks show ~15% improvement on full-split workloads:
- 100-page split: 31.6ms → 27.3ms (1.16x)
- 2000-page split: 582.5ms → 506.6ms (1.15x)
The internal LRU cache did Map.delete()+Map.set() on every get() to
maintain recency ordering. The npm lru-cache package uses a
doubly-linked-list for O(1) operations without Map rehashing.

Benchmarks show significant gains especially on large PDF parsing:
- 2000-page split: 506.6ms → 432.3ms (1.17x incremental)
- Single page from 2000p: 41.0ms → 25.5ms (1.61x incremental)
- Cumulative from baseline: 1.35x–1.60x across split workloads
Three changes:
- PdfName.toBytes() caches serialized bytes on the interned instance
  (compute once, writeBytes on every subsequent call). ASCII fast-path
  skips TextEncoder entirely for the 99% of names that are pure ASCII.
- Shared HEX_TABLE in buffer.ts replaces per-byte toString(16) calls
  in both bytesToHex and escapeName.
- Skip deflate for streams under 512 bytes (configurable via
  compressionThreshold). Deflate init zeros a 64KB hash table per call;
  for tiny streams the overhead dwarfs any savings.
- Expose compressStreams and compressionThreshold on SaveOptions.

Cumulative from baseline: 582ms → 245ms (2.38x) on 2000-page split.
Runs splitting benchmarks on both base and PR branches when .ts files
are changed. Posts a comparison table as a sticky PR comment showing
per-benchmark speedup/regression with 🟢/🔴 indicators at ±5% threshold.
@vercel
Copy link
Contributor

vercel bot commented Feb 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
core Ready Ready Preview, Comment Feb 18, 2026 3:09am

Just run benchmarks and post results as a PR comment.
No base comparison — check manually if needed.
@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

Benchmark Results

Comparison

Load PDF

Benchmark Mean p99 RME Samples
libpdf 2.37ms 2.94ms ±1.0% 211
pdf-lib 38.10ms 44.73ms ±4.1% 14

Create blank PDF

Benchmark Mean p99 RME Samples
libpdf 56μs 110μs ±1.7% 8893
pdf-lib 410μs 1.42ms ±2.4% 1220

Add 10 pages

Benchmark Mean p99 RME Samples
libpdf 99μs 149μs ±1.1% 5072
pdf-lib 502μs 1.77ms ±2.6% 996

Draw 50 rectangles

Benchmark Mean p99 RME Samples
libpdf 763μs 2.08ms ±2.9% 656
pdf-lib 1.64ms 5.75ms ±6.0% 306

Load and save PDF

Benchmark Mean p99 RME Samples
libpdf 2.44ms 3.85ms ±1.9% 206
pdf-lib 89.22ms 93.70ms ±2.0% 10

Load, modify, and save PDF

Benchmark Mean p99 RME Samples
libpdf 57.66ms 70.37ms ±7.6% 10
pdf-lib 91.15ms 109.98ms ±7.1% 10

Extract single page from 100-page PDF

Benchmark Mean p99 RME Samples
libpdf 3.97ms 6.38ms ±2.1% 126
pdf-lib 9.34ms 12.10ms ±2.7% 54

Split 100-page PDF into single-page PDFs

Benchmark Mean p99 RME Samples
libpdf 33.80ms 36.10ms ±1.7% 15
pdf-lib 87.10ms 90.68ms ±2.8% 6

Split 2000-page PDF into single-page PDFs (0.9MB)

Benchmark Mean p99 RME Samples
libpdf 625.08ms 625.08ms ±0.0% 1
pdf-lib 1.66s 1.66s ±0.0% 1

Copy 10 pages between documents

Benchmark Mean p99 RME Samples
libpdf 4.78ms 6.13ms ±1.9% 105
pdf-lib 11.76ms 12.89ms ±1.3% 43

Merge 2 x 100-page PDFs

Benchmark Mean p99 RME Samples
libpdf 14.74ms 16.63ms ±1.2% 34
pdf-lib 52.66ms 53.21ms ±0.5% 10
Copying

Copy pages between documents

Benchmark Mean p99 RME Samples
copy 1 page 988μs 2.02ms ±2.1% 507
copy 10 pages from 100-page PDF 4.60ms 5.32ms ±1.8% 109
copy all 100 pages 7.55ms 8.44ms ±0.9% 67

Duplicate pages within same document

Benchmark Mean p99 RME Samples
duplicate page 0 923μs 1.71ms ±1.3% 542
duplicate all pages (double the document) 906μs 1.63ms ±1.0% 552

Merge PDFs

Benchmark Mean p99 RME Samples
merge 2 small PDFs 1.51ms 2.15ms ±1.1% 332
merge 10 small PDFs 7.79ms 8.35ms ±0.7% 65
merge 2 x 100-page PDFs 14.14ms 15.40ms ±1.0% 36
Drawing

benchmarks/drawing.bench.ts

Benchmark Mean p99 RME Samples
draw 100 rectangles 1.53ms 3.98ms ±4.3% 327
draw 100 circles 3.57ms 6.77ms ±5.3% 141
draw 100 lines 1.23ms 2.80ms ±3.1% 408
draw 100 text lines (standard font) 3.47ms 7.37ms ±4.7% 144
create 10 pages with mixed content 3.94ms 8.73ms ±5.9% 128
Forms

benchmarks/forms.bench.ts

Benchmark Mean p99 RME Samples
get form fields 3.55ms 6.45ms ±3.6% 141
fill text fields 11.97ms 17.18ms ±4.1% 42
read field values 3.09ms 4.02ms ±1.3% 162
flatten form 9.04ms 14.32ms ±3.2% 56
Loading

benchmarks/loading.bench.ts

Benchmark Mean p99 RME Samples
load small PDF (888B) 65μs 145μs ±0.8% 7745
load medium PDF (19KB) 97μs 175μs ±0.5% 5130
load form PDF (116KB) 1.43ms 2.60ms ±1.5% 350
load heavy PDF (9.9MB) 2.38ms 2.81ms ±0.9% 211
Saving

benchmarks/saving.bench.ts

Benchmark Mean p99 RME Samples
save unmodified (19KB) 110μs 268μs ±0.9% 4538
save with modifications (19KB) 779μs 1.50ms ±1.5% 643
incremental save (19KB) 180μs 387μs ±1.0% 2771
save heavy PDF (9.9MB) 2.33ms 2.75ms ±0.6% 215
incremental save heavy PDF (9.9MB) 5.53ms 9.91ms ±3.0% 91
Splitting

Extract single page

Benchmark Mean p99 RME Samples
extractPages (1 page from small PDF) 1.03ms 2.52ms ±2.9% 487
extractPages (1 page from 100-page PDF) 3.72ms 6.27ms ±2.1% 135
extractPages (1 page from 2000-page PDF) 59.36ms 61.48ms ±1.5% 10

Split into single-page PDFs

Benchmark Mean p99 RME Samples
split 100-page PDF (0.1MB) 32.87ms 37.14ms ±3.7% 16
split 2000-page PDF (0.9MB) 583.25ms 583.25ms ±0.0% 1

Batch page extraction

Benchmark Mean p99 RME Samples
extract first 10 pages from 2000-page PDF 60.31ms 63.39ms ±1.9% 9
extract first 100 pages from 2000-page PDF 64.23ms 65.84ms ±1.4% 8
extract every 10th page from 2000-page PDF (200 pages) 69.01ms 87.16ms ±8.9% 8
Environment
  • Runner: Linux (X64)
  • Runtime: Bun 1.3.9

Results are machine-dependent.

@Mythie Mythie merged commit 0a1bc71 into main Feb 18, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant