A version control system designed to handle large binary files efficiently using rolling hash algorithms, inspired by rsync's delta synchronization approach.
duh is a Git-inspired version control system that excels at managing large binary project files. While Git stores complete snapshots of binary files at each commit (leading to repository bloat), duh uses rolling hash algorithms similar to rsync to store only the differences between file versions.
Git is excellent for text files but struggles with binary files because:
- Binary files can't be meaningfully diff'd line-by-line
- Each change to a binary file requires storing the entire file again
- Large binary files quickly bloat the repository size
- Cloning and pulling become slow as history grows
duh uses a rolling hash algorithm (Rabin-Karp) to:
- Break files into chunks based on content boundaries
- Identify which chunks have changed between versions
- Store only the changed chunks (delta encoding)
- Reconstruct any version by applying the appropriate deltas
This is the same approach rsync uses to efficiently synchronize files over a network—when rsync runs a rolling hash and the destination has a different hash than the source, only the changed portions are transferred. duh applies this same principle to version control.
duh is actively under development with the goal of achieving feature parity with Git where it makes sense. The core functionality is in place, including:
- ✅ Repository initialization
- ✅ File tracking and staging
- ✅ Rolling hash-based delta storage
- ✅ Commit creation with metadata
- ✅ Status reporting
- ✅ Diff computation using block signatures
- 🚧 Branch management
- 🚧 Merge operations
- 🚧 Remote repositories
- 🚧 History traversal
duh/
├── lib/ # Core library implementing version control logic
│ ├── diff.rs # Rolling hash algorithm and delta generation
│ ├── hash.rs # Content-addressable hashing (SHA-256)
│ ├── objects.rs # Object model (Commit, Tree, File, Fragment)
│ ├── repo.rs # Repository management and operations
│ └── utils.rs # Utilities and helpers
└── cli/ # Command-line interface
├── init.rs # Repository initialization
├── status.rs # Working directory status
├── track.rs # File staging
├── diff.rs # Difference visualization
└── commit.rs # Commit creation
duh uses a content-addressable storage system similar to Git, with the following object types:
- Fragment: A diff fragment representing added, unchanged, or deleted bytes
- File: References to content hash and delta fragments
- Tree: Directory structure with file and subtree references
- Commit: Snapshot of the tree with metadata (author, message, parent, timestamp)
- StagedFile: Temporary representation of files being prepared for commit
The heart of duh's efficiency is its rolling hash implementation:
1. Divide file into overlapping windows
2. Calculate hash for each window position
3. Create "block signatures" of stable regions
4. Compare signatures between versions
5. Generate diff fragments:
- ADDED: New bytes not in previous version
- UNCHANGED: Bytes matching a known block
- DELETED: Bytes present in old but not new version
This allows duh to:
- Detect moved/copied content within files
- Store only actual changes, not entire files
- Efficiently reconstruct any historical version
- Keep repository size manageable even with large binaries
# Build from source
cd cli
cargo build --release
# The binary will be at cli/target/release/duh# Initialize a new repository
duh init
# Track files for commit
duh track file1.bin file2.bin
# Check status
duh status
# View differences
duh diff
# Commit changes
duh commit -m "Initial commit"Repository configuration is stored in .duh/config.toml:
chunk_size = 4096.0 # Size of blocks for rolling hash
[user]
name = "Your Name"
email = "your.email@example.com"- Uses a Nix flake to build a single derivation for the
clicrate (duh).
-
Build (Nix):
nix build .#duh— binary appears at./result/bin/duh -
Enter development shell (Nix):
nix develop .#duh -
Build locally with Cargo (alternative):
cd cli && cargo build --release— binary atcli/target/release/duh
-
Run the full demo (generates test files, init, stage/commit, show):
just demo -
Create test-data only:
just generate-test-files <outdir> -
Update the vendored crates & print the nix hash (paste into
flake.nix):just update-vendor-hashThis runs
cargo vendor(undercli/) and prints thecargoVendorHashline you should copy into the flake.
-
If
nix buildfails withcargoSha256/cargoHashout of date:- Run
just update-vendor-hash. - Replace
cargoVendorHashinflake.nixwith the printed value. - Re-run
nix build .#duh.
- Run
-
If you get Cargo path-dep errors, ensure you build the flake target (
.#duh) which selects thecli/package in the workspace.
Imagine you have a 100MB binary file and you edit 1MB in the middle:
Git's approach:
- Original version: 100MB stored
- After edit: 100MB stored again
- Total: 200MB
duh's approach:
- Original version: 100MB stored as fragments
- After edit: Only the changed ~1MB stored as new fragments
- Unchanged fragments: Referenced from original version
- Total: ~101MB
Original file blocks: [A][B][C][D][E]
Modified file blocks: [A][B][X][Y][E]
duh stores:
- Reference to blocks A, B (unchanged)
- New data for blocks X, Y (added/changed)
- Reference to block E (unchanged)
- Note that C, D were deleted
This allows efficient storage and reconstruction of any version in the history.
- Efficient binary file handling: Store only deltas, not full copies
- Git-like workflow: Familiar commands and concepts for easy adoption
- Content integrity: Cryptographic hashing ensures data validity
- Performance: Fast operations even with large files and deep history
- Branch and tag management
- Merge strategies for binary content
- Remote repository support (push/pull)
- Repository compression and garbage collection
- Partial clone/checkout for large repositories
- Web interface for repository browsing
- Plugin system for custom diff/merge handlers
- Objects stored in MessagePack format for efficiency
- Content-addressable storage using SHA-256 hashes
- Rolling hash parameters: Base 256, large prime modulus
- Time complexity: O(n) for diff generation (n = file size)
- Space complexity: O(changed blocks) not O(file size)
- Best case: Files with localized changes
- Worst case: Completely rewritten files (falls back to full storage)
This project is under active development. Contributions are welcome! Areas where help is needed:
- Merge algorithm development for binary files
- Remote repository protocols
- Performance optimization
- Documentation and examples
- Test coverage
[Add your license here]
- Git: The inspiration for the object model and workflow
- rsync: The inspiration for rolling hash delta encoding
- Git-LFS: Alternative approach using pointer files and external storage
- Perforce: Commercial VCS with good binary file support