Note: This code was generated by Claude (Anthropic's AI assistant) and has not been thoroughly reviewed or tested in production environments. Use at your own risk. Contributions and reviews are welcome.
Efficient gzip to BGZF transcoder using Puffin-style half-decompression.
BGZF (Blocked GNU Zip Format) is a variant of gzip that enables random access to compressed data. It's widely used in bioinformatics for BAM, CRAM, and VCF files.
Converting standard gzip files to BGZF traditionally requires:
- Full decompression - decompress the entire gzip stream
- Re-compression - compress the data into BGZF blocks
This is slow and CPU-intensive because compression is the expensive part.
This tool uses a half-decompression technique inspired by Puffin:
- Parse the DEFLATE stream to extract LZ77 tokens (literals, lengths, distances)
- Track only enough context to resolve cross-boundary references
- Re-encode the tokens directly into new BGZF blocks
This avoids the expensive compression step entirely - we're just re-serializing already-compressed data into a different block structure.
Half-decompression is significantly faster than full decompress + recompress because:
- No dictionary lookups for compression
- No hash table maintenance
- Minimal memory footprint (just the sliding window)
- Parallel encoding of independent blocks
Benchmark Results (54 MB gzip → 80 MB BGZF, original data 208 MB uncompressed FASTQ):
| Threads | Time | Throughput | Speedup |
|---|---|---|---|
| 1 | 2.92s | 19.6 MB/s | 1.0x |
| 2 | 1.76s | 32.4 MB/s | 1.66x |
| 4 | 1.79s | 32.0 MB/s | 1.63x |
| 8 | 1.78s | 32.2 MB/s | 1.64x |
Throughput measured on compressed input size. Tested on Apple M1.
Output size: BGZF output is typically ~1.5x larger than the original gzip because:
- Each BGZF block has 26 bytes of overhead (header + footer)
- Cross-boundary LZ77 references are expanded to literals
- Fixed Huffman encoding (less optimal than original dynamic tables)
Why only 2 threads help:
The tool uses a producer-consumer architecture where the main thread sequentially parses DEFLATE blocks (this cannot be parallelized) while worker threads encode BGZF blocks in parallel. With 2 threads total:
- Main thread: parses DEFLATE, resolves cross-boundary references
- Worker thread: encodes tokens to BGZF, computes CRC32
Additional threads beyond 2 provide no benefit because the main thread's sequential parsing is the bottleneck. The single worker can easily keep up with the parsing rate.
Trade-off: BGZF files are larger than gzip, but enable random access. Best suited for format conversion pipelines where random access is needed
git clone https://github.com/nh13/rebgzf.git
cd rebgzf
cargo install --path .cargo install rebgzf# Basic transcoding (fastest, level 1)
rebgzf -i input.gz -o output.bgz
# With progress display
rebgzf -i input.gz -o output.bgz --progress
# Better compression with dynamic Huffman (level 6)
rebgzf -i input.gz -o output.bgz --level 6
# Best compression for FASTQ files (dynamic Huffman + record-aligned blocks)
rebgzf -i reads.fastq.gz -o reads.bgz --format fastq --level 9
# Generate GZI index for random access
rebgzf -i data.gz -o data.bgz --index
# Parallel transcoding (auto-detect threads)
rebgzf -i input.gz -o output.bgz -t 0 --progress
# Check if a file is already BGZF
rebgzf --check -i input.gz
echo $? # 0 = BGZF, 1 = not BGZF
# Strict validation (all blocks, works with stdin)
cat file.bgz | rebgzf --check --strict -i -
# Force transcoding even if already BGZF
rebgzf -i input.bgz -o output.bgz --force
# Verbose output with statistics
rebgzf -i input.gz -o output.bgz -vConvert gzip files to BGZF format efficiently
Usage: rebgzf [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Input gzip file (use - for stdin)
-o, --output <OUTPUT> Output BGZF file (use - for stdout)
-t, --threads <THREADS> Number of threads (0 = auto, 1 = single-threaded) [default: 1]
-l, --level <LEVEL> Compression level 1-9 (1-3: fixed Huffman, 4-6: dynamic,
7-9: dynamic + smart boundaries) [default: 1]
--format <FORMAT> Input format profile: default, fastq, auto [default: default]
--block-size <BLOCK_SIZE> BGZF block size (default: 65280) [default: 65280]
-v, --verbose Show verbose statistics
-q, --quiet Quiet mode - suppress all output except errors
--json Output results as JSON (for scripting)
--check Check if input is BGZF and exit (0=BGZF, 1=not BGZF, 2=error)
--strict Validate all BGZF blocks (slower, more thorough)
--verify Verify BGZF by decompressing and checking CRC32
--stats Show file statistics without transcoding
--force Force transcoding even if input is already BGZF
-p, --progress Show progress during transcoding
--index [PATH] Write GZI index file (enables random access)
-h, --help Print help
-V, --version Print version
use rebgzf::{
CompressionLevel, FormatProfile, ParallelTranscoder,
TranscodeConfig, Transcoder
};
use std::fs::File;
fn main() -> rebgzf::Result<()> {
let input = File::open("input.gz")?;
let output = File::create("output.bgz")?;
let config = TranscodeConfig {
num_threads: 0, // auto-detect
compression_level: CompressionLevel::Level6, // dynamic Huffman
format: FormatProfile::Fastq, // FASTQ-aware splitting
build_index: true, // generate GZI index entries
..Default::default()
};
let mut transcoder = ParallelTranscoder::new(config);
let stats = transcoder.transcode(input, output)?;
println!("Transcoded {} bytes -> {} bytes ({} blocks)",
stats.input_bytes,
stats.output_bytes,
stats.blocks_written
);
// Write index if requested
if let Some(entries) = stats.index_entries {
println!("Generated {} index entries", entries.len());
// Write entries to .gzi file...
}
Ok(())
}use rebgzf::{is_bgzf, validate_bgzf_strict, validate_bgzf_streaming};
use std::fs::File;
use std::io::{Seek, SeekFrom};
fn main() -> rebgzf::Result<()> {
let mut file = File::open("input.gz")?;
// Quick check (reads first 18 bytes)
if is_bgzf(&mut file)? {
println!("File appears to be BGZF");
}
// Strict validation (validates all blocks, requires Seek)
file.seek(SeekFrom::Start(0))?;
let validation = validate_bgzf_strict(&mut file)?;
if validation.is_valid_bgzf {
println!("Valid BGZF with {} blocks, {} uncompressed bytes",
validation.block_count.unwrap_or(0),
validation.total_uncompressed_size.unwrap_or(0));
}
// Streaming validation (works with stdin/pipes, no Seek required)
// let validation = validate_bgzf_streaming(&mut stdin)?;
Ok(())
}A DEFLATE stream consists of:
- Literals: raw bytes (0-255)
- Back-references: (length, distance) pairs pointing to earlier data
- End-of-block: marker indicating block completion
We parse the Huffman-coded stream to extract these LZ77 tokens without fully reconstructing the original data. The key insight is that we only need the tokens themselves, not the decompressed bytes.
When splitting into BGZF blocks, back-references may point across block boundaries. We maintain a 32KB sliding window to:
- Detect when a reference crosses the current block boundary
- Resolve cross-boundary references by emitting the referenced bytes as literals
- Preserve intra-block references as-is (they compress better)
Tokens are re-encoded using either:
- Fixed Huffman tables (levels 1-3): Fast encoding using pre-defined tables (RFC 1951 section 3.2.6)
- Dynamic Huffman tables (levels 4-9): Per-block optimal tables computed from token frequencies
At levels 7-9 with --format fastq, block boundaries are aligned to FASTQ record boundaries for better compression.
Standard Huffman decoding reads one bit at a time, requiring multiple loop iterations per symbol. We use a lookup table approach:
- Build a 1024-entry table (10-bit lookup) at decoder construction time
- Peek 10 bits from the bitstream
- Single table lookup returns both the symbol and its actual code length
- Consume only the actual code length bits
This reduces most symbol decodes from ~9 loop iterations to a single table lookup, providing a ~30% speedup in parsing.
CRC32 computation is handled differently based on thread count:
- Single-threaded: CRC computed inline during boundary resolution (cache-efficient, no duplicate work)
- Multi-threaded: Workers compute CRC in parallel by replaying tokens to bytes
This avoids adding sequential work to the main thread in parallel mode.
Uses the crc32fast crate which automatically leverages:
- SSE4.2 CRC instructions on x86_64
- ARM CRC instructions on aarch64
- Optimized software fallback on other platforms
┌─────────────────────────────────────────────────────────────┐
│ Input (gzip) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GzipHeader Parser │
│ - Validates gzip magic bytes (1f 8b) │
│ - Skips optional fields (FEXTRA, FNAME, FCOMMENT, FHCRC) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DEFLATE Parser (DeflateParser) │
│ - Parses block headers (BFINAL, BTYPE) │
│ - Decodes Huffman tables (fixed or dynamic) │
│ - Uses table-based lookup for fast symbol decoding │
│ - Extracts LZ77 tokens: Literal(u8), Copy{len, dist} │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Boundary Resolver (BoundaryResolver) │
│ - Maintains 32KB sliding window │
│ - Detects cross-boundary back-references │
│ - Resolves to literals when necessary │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Huffman Encoder (HuffmanEncoder) │
│ - Re-encodes tokens to DEFLATE format │
│ - Fixed tables (levels 1-3) or dynamic (levels 4-9) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ BGZF Writer (BgzfWriter) │
│ - Wraps DEFLATE data in BGZF block format │
│ - Adds BC extra field with block size │
│ - Computes CRC32 and writes footer │
│ - Appends EOF marker (28-byte empty block) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Output (BGZF) │
└─────────────────────────────────────────────────────────────┘
For multi-threaded transcoding, we use a producer-consumer pipeline:
- Main thread (Producer): Parses DEFLATE, accumulates tokens, resolves cross-boundary references
- Worker pool (Consumers): Replay tokens to bytes, compute CRC32, encode to BGZF blocks
- Output assembly: Blocks are reassembled in order using sequence numbers
The main thread is the bottleneck because DEFLATE parsing is inherently sequential (Huffman codes are variable-length and context-dependent). In practice, 2 threads is optimal: one for parsing, one for encoding. Additional workers have nothing to do.
Run benchmarks with:
cargo benchBenchmarks compare:
- Single-threaded vs parallel transcoding
- Various block sizes
# Run all tests
cargo test
# Run integration tests
cargo test --test integration
# Run with verbose output
cargo test -- --nocaptureContributions are welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.