-
-
Notifications
You must be signed in to change notification settings - Fork 10
feat: add core content-defined chunking (CDC) infrastructure #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add core content-defined chunking (CDC) infrastructure #667
Conversation
Summary of ChangesHello @kalbasit, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the core infrastructure for content-defined chunking and data deduplication. It provides a robust mechanism to break down data streams into variable-sized chunks based on their content, rather than fixed boundaries. This foundation is crucial for optimizing storage and network transfer by identifying and storing unique data segments only once, thereby reducing redundancy across various data objects. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the foundational infrastructure for content-defined chunking (CDC). It adds a chunker package using the FastCDC algorithm, a local file-based storage for chunks, and the necessary database migrations. The implementation is solid, introducing a streaming chunking mechanism and atomic file writes for the chunk store. My review includes suggestions to improve schema efficiency in the database migrations by removing redundant indexes and using more precise column types, and to enhance the robustness of the local chunk store's file writing logic.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the core infrastructure for content-defined chunking (CDC), which is a great step towards deduplication. The implementation of the FastCDC algorithm in pkg/chunker is solid and the use of channels for streaming is well-done. The database migrations and local file store are also included, making this a comprehensive feature foundation. I've found a few areas for improvement, mainly concerning database schema correctness to prevent potential data truncation and a fragile file-writing pattern in the local chunk store that could be made more robust.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a solid foundation for content-defined chunking (CDC) and deduplication. The implementation includes the FastCDC algorithm via pkg/chunker, corresponding database migrations and schema updates for chunks and nar_file_chunks tables, and a local file-based chunk store. The code is well-structured and includes relevant unit tests. I've identified a few areas for improvement regarding database schema consistency, potential integer overflow on 32-bit systems, and resource management in the local storage implementation. Overall, this is a great addition.
This commit adds the foundation for content-defined chunking and deduplication: - Implemented pkg/chunker using the FastCDC algorithm for efficient streaming deduplication. - Added database migrations for 'chunks' and 'nar_file_chunks' tables. - Implemented a local file-based chunk store with a two-level directory structure. - Includes unit tests for chunking logic and local storage. - Added github.com/jotfs/fastcdc-go dependency.
844b061 to
8b0a778
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces a solid foundation for content-defined chunking. The use of the FastCDC algorithm via an external library is appropriate, and the streaming implementation using channels is well-designed. The local file storage implementation correctly uses atomic writes for data integrity. I've identified a few areas for improvement: there's a data type inconsistency for the chunk size across the different database schemas (MySQL vs. Postgres/SQLite), and the ref_count column in MySQL could be UNSIGNED. Overall, great work on adding this significant feature.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the foundational infrastructure for content-defined chunking (CDC) using the FastCDC algorithm. It adds new database tables chunks and nar_file_chunks with corresponding migrations for MySQL, PostgreSQL, and SQLite. A local file-based chunk store is also implemented, along with unit tests for the new chunking logic and storage.
My review focuses on ensuring data integrity in the database schemas and improving code clarity. I've found a few inconsistencies in the generated schema dump files compared to the migrations, which should be addressed. I've also suggested adding CHECK constraints to prevent invalid data in PostgreSQL and SQLite, noting that such schema changes should be applied consistently across all supported backends. Additionally, I've suggested a small refactoring in the local chunk store implementation for better readability.

This commit adds the foundation for content-defined chunking and deduplication: