feat: add core content-defined chunking (CDC) infrastructure #667

kalbasit · 2026-01-31T02:42:54Z

This commit adds the foundation for content-defined chunking and deduplication:

Implemented pkg/chunker using the FastCDC algorithm for efficient streaming deduplication.
Added database migrations for 'chunks' and 'nar_file_chunks' tables.
Implemented a local file-based chunk store with a two-level directory structure.
Includes unit tests for chunking logic and local storage.
Added github.com/jotfs/fastcdc-go dependency.

kalbasit · 2026-01-31T02:43:11Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

gemini-code-assist · 2026-01-31T02:43:14Z

Summary of Changes

Hello @kalbasit, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the core infrastructure for content-defined chunking and data deduplication. It provides a robust mechanism to break down data streams into variable-sized chunks based on their content, rather than fixed boundaries. This foundation is crucial for optimizing storage and network transfer by identifying and storing unique data segments only once, thereby reducing redundancy across various data objects.

Highlights

Content-Defined Chunking (CDC) Implementation: Introduced pkg/chunker utilizing the FastCDC algorithm for efficient streaming deduplication, dynamically splitting data into variable-sized chunks based on content.
Database Schema Additions: New migration files for MySQL, PostgreSQL, and SQLite were added to create chunks and nar_file_chunks tables, supporting chunk metadata and their association with NAR files.
Local Chunk Storage: Implemented a local file-based chunk store (pkg/storage/chunk/local.go) with a two-level directory structure for content-addressable storage, ensuring atomic writes.
Comprehensive Unit Testing: Added unit tests for both the chunking logic (pkg/chunker/chunker_test.go) and the local storage implementation (pkg/storage/chunk/local_test.go) to ensure correctness and reliability.
Dependency Management: Included github.com/jotfs/fastcdc-go as a new dependency to provide the FastCDC algorithm implementation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the foundational infrastructure for content-defined chunking (CDC). It adds a chunker package using the FastCDC algorithm, a local file-based storage for chunks, and the necessary database migrations. The implementation is solid, introducing a streaming chunking mechanism and atomic file writes for the chunk store. My review includes suggestions to improve schema efficiency in the database migrations by removing redundant indexes and using more precise column types, and to enhance the robustness of the local chunk store's file writing logic.

pkg/storage/chunk/local.go

db/migrations/mysql/20260131021850_add_chunks.sql

db/migrations/postgres/20260131021850_add_chunks.sql

db/migrations/sqlite/20260131021850_add_chunks.sql

kalbasit · 2026-01-31T02:54:24Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the core infrastructure for content-defined chunking (CDC), which is a great step towards deduplication. The implementation of the FastCDC algorithm in pkg/chunker is solid and the use of channels for streaming is well-done. The database migrations and local file store are also included, making this a comprehensive feature foundation. I've found a few areas for improvement, mainly concerning database schema correctness to prevent potential data truncation and a fragile file-writing pattern in the local chunk store that could be made more robust.

pkg/storage/chunk/local.go

db/migrations/mysql/20260131021850_add_chunks.sql

db/migrations/postgres/20260131021850_add_chunks.sql

pkg/storage/chunk/local.go

kalbasit · 2026-01-31T03:07:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a solid foundation for content-defined chunking (CDC) and deduplication. The implementation includes the FastCDC algorithm via pkg/chunker, corresponding database migrations and schema updates for chunks and nar_file_chunks tables, and a local file-based chunk store. The code is well-structured and includes relevant unit tests. I've identified a few areas for improvement regarding database schema consistency, potential integer overflow on 32-bit systems, and resource management in the local storage implementation. Overall, this is a great addition.

db/schema/mysql.sql

db/schema/postgres.sql

pkg/chunker/chunker.go

pkg/storage/chunk/local.go

This commit adds the foundation for content-defined chunking and deduplication: - Implemented pkg/chunker using the FastCDC algorithm for efficient streaming deduplication. - Added database migrations for 'chunks' and 'nar_file_chunks' tables. - Implemented a local file-based chunk store with a two-level directory structure. - Includes unit tests for chunking logic and local storage. - Added github.com/jotfs/fastcdc-go dependency.

kalbasit · 2026-01-31T06:04:06Z

/gemini review

gemini-code-assist

Code Review

This PR introduces a solid foundation for content-defined chunking. The use of the FastCDC algorithm via an external library is appropriate, and the streaming implementation using channels is well-designed. The local file storage implementation correctly uses atomic writes for data integrity. I've identified a few areas for improvement: there's a data type inconsistency for the chunk size across the different database schemas (MySQL vs. Postgres/SQLite), and the ref_count column in MySQL could be UNSIGNED. Overall, great work on adding this significant feature.

db/migrations/mysql/20260131021850_add_chunks.sql

db/migrations/postgres/20260131021850_add_chunks.sql

db/migrations/sqlite/20260131021850_add_chunks.sql

kalbasit · 2026-01-31T06:26:21Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the foundational infrastructure for content-defined chunking (CDC) using the FastCDC algorithm. It adds new database tables chunks and nar_file_chunks with corresponding migrations for MySQL, PostgreSQL, and SQLite. A local file-based chunk store is also implemented, along with unit tests for the new chunking logic and storage.

My review focuses on ensuring data integrity in the database schemas and improving code clarity. I've found a few inconsistencies in the generated schema dump files compared to the migrations, which should be addressed. I've also suggested adding CHECK constraints to prevent invalid data in PostgreSQL and SQLite, noting that such schema changes should be applied consistently across all supported backends. Additionally, I've suggested a small refactoring in the local chunk store implementation for better readability.

db/migrations/postgres/20260131021850_add_chunks.sql

db/migrations/sqlite/20260131021850_add_chunks.sql

db/schema/postgres.sql

db/schema/sqlite.sql

pkg/storage/chunk/local.go

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request go Pull requests that update go code labels Jan 31, 2026

This was referenced Jan 31, 2026

feat: add S3 storage support for chunks #668

Open

feat: integrate CDC with cache and implement chunk GC #669

Open

This was referenced Jan 31, 2026

feat: add CLI and Helm configuration for CDC and S3 #670

Open

style: apply nix fmt to documentation and project files #671

Open

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

pkg/storage/chunk/local.go Show resolved Hide resolved

db/migrations/mysql/20260131021850_add_chunks.sql Outdated Show resolved Hide resolved

db/migrations/postgres/20260131021850_add_chunks.sql Outdated Show resolved Hide resolved

pkg/storage/chunk/local.go Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

db/schema/mysql.sql Show resolved Hide resolved

db/schema/postgres.sql Show resolved Hide resolved

pkg/chunker/chunker.go Show resolved Hide resolved

pkg/storage/chunk/local.go Show resolved Hide resolved

kalbasit force-pushed the 01-30-feat_add_core_content-defined_chunking_cdc_infrastructure branch from 844b061 to 8b0a778 Compare January 31, 2026 06:03

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

db/migrations/mysql/20260131021850_add_chunks.sql Outdated Show resolved Hide resolved

db/migrations/postgres/20260131021850_add_chunks.sql Outdated Show resolved Hide resolved

db/migrations/sqlite/20260131021850_add_chunks.sql Outdated Show resolved Hide resolved

kalbasit added 2 commits January 30, 2026 22:16

make it uint32

21f4385

fix: use INT UNSIGNED for ref_count in MySQL and regenerate wrappers

f6f0331

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

fix: address PR comments regarding chunks schema and local store

7a370b5

Uh oh!

feat: add core content-defined chunking (CDC) infrastructure #667

Are you sure you want to change the base?

feat: add core content-defined chunking (CDC) infrastructure #667

Uh oh!

Conversation

kalbasit commented Jan 31, 2026

Uh oh!

kalbasit commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kalbasit commented Jan 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kalbasit commented Jan 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kalbasit commented Jan 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kalbasit commented Jan 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kalbasit commented Jan 31, 2026 •

edited

Loading