Skip to content

Comments

Chunking Algorithms#336

Draft
sluongng wants to merge 1 commit intobazelbuild:mainfrom
sluongng:sluongng/chunking-algo
Draft

Chunking Algorithms#336
sluongng wants to merge 1 commit intobazelbuild:mainfrom
sluongng:sluongng/chunking-algo

Conversation

@sluongng
Copy link
Collaborator

@sluongng sluongng commented Jul 8, 2025

Based on #282

Introduce ChunkingFunction which enum is a set of known chunking
algorithms that the server can recommend to the client.

Provide FastCDC_2020 as the first explicit chunking algorithm.

The server advertises these through a new chunking_configuration field in
CacheCapabilities message. There, the server may set the chunking
functions that it supports as well as the relevant configuration
parameters for that chunking algorithm.


I recommend reading https://joshleeb.com/posts/fastcdc.html to understand more about the available FastCDC configuration parameters.

@sluongng sluongng force-pushed the sluongng/chunking-algo branch from 456e902 to b25c8e4 Compare July 8, 2025 14:33
@sluongng sluongng force-pushed the sluongng/chunking-algo branch from b25c8e4 to 27b0d6c Compare July 15, 2025 13:38
@sluongng
Copy link
Collaborator Author

@mostynb I think most of your comments are meant for #282, which is what this PR is based on. Since #282 is merged, I have rebased this PR on top of the latest changes.

I would recommend creating a separate PR with the suggestions above.

@sluongng sluongng force-pushed the sluongng/chunking-algo branch from 27b0d6c to 39bbe03 Compare July 15, 2025 13:55
Introduce ChunkingFunction which enum is a set of known chunking
algorithms that the server can recommend to the client.

Provide FastCDC_2020 as the first explicit chunking algorithm.

The server advertise these through a new chunking_configuration field in
CacheCapabilities message. There, the server may set the chunking
functions that it supports as well as the relevant configuration
parameters for that chunking algorithm.
Comment on lines +1980 to +1984
// The chunking function that the client prefers to use.
//
// The server MAY use a different chunking function. The client MUST check
// the chunking function used in the response.
ChunkingFunction.Value chunking_function = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this field intended to be mandatory or optional (with the latter giving the server complete leeway in choosing a function)? If optional, can we document it?
  2. When the field is present, should it be required to match one of the functions declared in the server capabilities?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it defaults to UNKNOWN if unset, because this is proto3. (But perhaps it's still worth spelling it out - up to you.)

Comment on lines +2036 to +2037
// The chunking function that the client used to split the blob.
ChunkingFunction.Value chunking_function = 5;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Isn't the result of a splice completely independent from the function originally used to do the splitting?

It also imposes a requirement that the chunks must have necessarily originated from a split operation, which seems counter to the spirit of the original proposal (that split and splice are independent optimizations).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the requirements for this come from the desire for the server to be able to store and retrieve information on how to convert between a blob and chunks, and act as a cache for that information.

I think this is important if a client is doing the chunking, and is telling the server to store information on how a certain blob can be reconstructed from chunks. Then, if someone calls Split on the same blob, the server can then return this information to the client, without needing to actually split the blob, given that the chunking operation is the same. Caches can also share these entries and do chunking async, etc, if the server and client are aligned on a reproducible way to create these chunks.

If the client calls Splice twice, with two different sets of digests, and two different chunking functions, the server can also make decisions on how it handles this. A server could choose to be very explicit on the types of chunking it supports, for example, if it needs to be able to reproduce them somewhere else, and so that it chooses to store chunks that are more likely to de-duplicate storage of content.

// If any of the advertised parameters are not within the expected range,
// the client SHOULD ignore FastCDC chunking function support.
message FastCDCParams {
// The normalization level for the FastCDC chunking algorithm.
Copy link
Contributor

@tyler-french tyler-french Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider having just avg_chunk_size_bytes, and all the other parameters are set to defulats and documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants