Skip to content

Conversation

@kalbasit
Copy link
Owner

This commit adds the foundation for content-defined chunking and deduplication:

  • Implemented pkg/chunker using the FastCDC algorithm for efficient streaming deduplication.
  • Added database migrations for 'chunks' and 'nar_file_chunks' tables.
  • Implemented a local file-based chunk store with a two-level directory structure.
  • Includes unit tests for chunking logic and local storage.
  • Added github.com/jotfs/fastcdc-go dependency.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request go Pull requests that update go code labels Jan 31, 2026
Copy link
Owner Author

kalbasit commented Jan 31, 2026

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kalbasit, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the core infrastructure for content-defined chunking and data deduplication. It provides a robust mechanism to break down data streams into variable-sized chunks based on their content, rather than fixed boundaries. This foundation is crucial for optimizing storage and network transfer by identifying and storing unique data segments only once, thereby reducing redundancy across various data objects.

Highlights

  • Content-Defined Chunking (CDC) Implementation: Introduced pkg/chunker utilizing the FastCDC algorithm for efficient streaming deduplication, dynamically splitting data into variable-sized chunks based on content.
  • Database Schema Additions: New migration files for MySQL, PostgreSQL, and SQLite were added to create chunks and nar_file_chunks tables, supporting chunk metadata and their association with NAR files.
  • Local Chunk Storage: Implemented a local file-based chunk store (pkg/storage/chunk/local.go) with a two-level directory structure for content-addressable storage, ensuring atomic writes.
  • Comprehensive Unit Testing: Added unit tests for both the chunking logic (pkg/chunker/chunker_test.go) and the local storage implementation (pkg/storage/chunk/local_test.go) to ensure correctness and reliability.
  • Dependency Management: Included github.com/jotfs/fastcdc-go as a new dependency to provide the FastCDC algorithm implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the foundational infrastructure for content-defined chunking (CDC). It adds a chunker package using the FastCDC algorithm, a local file-based storage for chunks, and the necessary database migrations. The implementation is solid, introducing a streaming chunking mechanism and atomic file writes for the chunk store. My review includes suggestions to improve schema efficiency in the database migrations by removing redundant indexes and using more precise column types, and to enhance the robustness of the local chunk store's file writing logic.

@kalbasit
Copy link
Owner Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the core infrastructure for content-defined chunking (CDC), which is a great step towards deduplication. The implementation of the FastCDC algorithm in pkg/chunker is solid and the use of channels for streaming is well-done. The database migrations and local file store are also included, making this a comprehensive feature foundation. I've found a few areas for improvement, mainly concerning database schema correctness to prevent potential data truncation and a fragile file-writing pattern in the local chunk store that could be made more robust.

@kalbasit
Copy link
Owner Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a solid foundation for content-defined chunking (CDC) and deduplication. The implementation includes the FastCDC algorithm via pkg/chunker, corresponding database migrations and schema updates for chunks and nar_file_chunks tables, and a local file-based chunk store. The code is well-structured and includes relevant unit tests. I've identified a few areas for improvement regarding database schema consistency, potential integer overflow on 32-bit systems, and resource management in the local storage implementation. Overall, this is a great addition.

This commit adds the foundation for content-defined chunking and deduplication:
- Implemented pkg/chunker using the FastCDC algorithm for efficient streaming deduplication.
- Added database migrations for 'chunks' and 'nar_file_chunks' tables.
- Implemented a local file-based chunk store with a two-level directory structure.
- Includes unit tests for chunking logic and local storage.
- Added github.com/jotfs/fastcdc-go dependency.
@kalbasit kalbasit force-pushed the 01-30-feat_add_core_content-defined_chunking_cdc_infrastructure branch from 844b061 to 8b0a778 Compare January 31, 2026 06:03
@kalbasit
Copy link
Owner Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a solid foundation for content-defined chunking. The use of the FastCDC algorithm via an external library is appropriate, and the streaming implementation using channels is well-designed. The local file storage implementation correctly uses atomic writes for data integrity. I've identified a few areas for improvement: there's a data type inconsistency for the chunk size across the different database schemas (MySQL vs. Postgres/SQLite), and the ref_count column in MySQL could be UNSIGNED. Overall, great work on adding this significant feature.

@kalbasit
Copy link
Owner Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the foundational infrastructure for content-defined chunking (CDC) using the FastCDC algorithm. It adds new database tables chunks and nar_file_chunks with corresponding migrations for MySQL, PostgreSQL, and SQLite. A local file-based chunk store is also implemented, along with unit tests for the new chunking logic and storage.

My review focuses on ensuring data integrity in the database schemas and improving code clarity. I've found a few inconsistencies in the generated schema dump files compared to the migrations, which should be addressed. I've also suggested adding CHECK constraints to prevent invalid data in PostgreSQL and SQLite, noting that such schema changes should be applied consistently across all supported backends. Additionally, I've suggested a small refactoring in the local chunk store implementation for better readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update go code size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants