Chunking Algorithms by sluongng · Pull Request #336 · bazelbuild/remote-apis

sluongng · 2025-07-08T13:47:35Z

Based on #282

Introduce ChunkingFunction which enum is a set of known chunking
algorithms that the server can recommend to the client.

Provide FastCDC_2020 as the first explicit chunking algorithm.

The server advertises these through a new chunking_configuration field in
CacheCapabilities message. There, the server may set the chunking
functions that it supports as well as the relevant configuration
parameters for that chunking algorithm.

I recommend reading https://joshleeb.com/posts/fastcdc.html to understand more about the available FastCDC configuration parameters.

build/bazel/remote/execution/v2/remote_execution.proto

sluongng · 2025-07-15T13:40:46Z

@mostynb I think most of your comments are meant for #282, which is what this PR is based on. Since #282 is merged, I have rebased this PR on top of the latest changes.

I would recommend creating a separate PR with the suggestions above.

Introduce ChunkingFunction which enum is a set of known chunking algorithms that the server can recommend to the client. Provide FastCDC_2020 as the first explicit chunking algorithm. The server advertise these through a new chunking_configuration field in CacheCapabilities message. There, the server may set the chunking functions that it supports as well as the relevant configuration parameters for that chunking algorithm.

tjgq · 2025-07-17T12:24:49Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // The chunking function that the client prefers to use.
+  //
+  // The server MAY use a different chunking function. The client MUST check
+  // the chunking function used in the response.
+  ChunkingFunction.Value chunking_function = 4;


Is this field intended to be mandatory or optional (with the latter giving the server complete leeway in choosing a function)? If optional, can we document it?

When the field is present, should it be required to match one of the functions declared in the server capabilities?

Ah, it defaults to UNKNOWN if unset, because this is proto3. (But perhaps it's still worth spelling it out - up to you.)

tjgq · 2025-07-17T12:27:28Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // The chunking function that the client used to split the blob.
+  ChunkingFunction.Value chunking_function = 5;


Why is this necessary? Isn't the result of a splice completely independent from the function originally used to do the splitting?

It also imposes a requirement that the chunks must have necessarily originated from a split operation, which seems counter to the spirit of the original proposal (that split and splice are independent optimizations).

A lot of the requirements for this come from the desire for the server to be able to store and retrieve information on how to convert between a blob and chunks, and act as a cache for that information.

I think this is important if a client is doing the chunking, and is telling the server to store information on how a certain blob can be reconstructed from chunks. Then, if someone calls Split on the same blob, the server can then return this information to the client, without needing to actually split the blob, given that the chunking operation is the same. Caches can also share these entries and do chunking async, etc, if the server and client are aligned on a reproducible way to create these chunks.

If the client calls Splice twice, with two different sets of digests, and two different chunking functions, the server can also make decisions on how it handles this. A server could choose to be very explicit on the types of chunking it supports, for example, if it needs to be able to reproduce them somewhere else, and so that it chooses to store chunks that are more likely to de-duplicate storage of content.

tyler-french · 2025-12-09T15:25:05Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // If any of the advertised parameters are not within the expected range,
+  // the client SHOULD ignore FastCDC chunking function support.
+  message FastCDCParams {
+    // The normalization level for the FastCDC chunking algorithm.


We should consider having just avg_chunk_size_bytes, and all the other parameters are set to defulats and documented.

sluongng force-pushed the sluongng/chunking-algo branch from 456e902 to b25c8e4 Compare July 8, 2025 14:33

mostynb reviewed Jul 12, 2025

View reviewed changes

sluongng force-pushed the sluongng/chunking-algo branch from b25c8e4 to 27b0d6c Compare July 15, 2025 13:38

sluongng force-pushed the sluongng/chunking-algo branch from 27b0d6c to 39bbe03 Compare July 15, 2025 13:55

sluongng force-pushed the sluongng/chunking-algo branch from 39bbe03 to 19f1152 Compare July 15, 2025 14:05

mostynb mentioned this pull request Jul 16, 2025

Clean up the split/splice documentation #337

Merged

tjgq requested changes Jul 17, 2025

View reviewed changes

tyler-french reviewed Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Chunking Algorithms#336

Chunking Algorithms#336
sluongng wants to merge 1 commit intobazelbuild:mainfrom
sluongng:sluongng/chunking-algo

sluongng commented Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sluongng commented Jul 15, 2025

Uh oh!

tjgq Jul 17, 2025

Uh oh!

tjgq Jul 17, 2025

Uh oh!

tjgq Jul 17, 2025

Uh oh!

tyler-french Dec 5, 2025

Uh oh!

tyler-french Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// The chunking function that the client used to split the blob.
		ChunkingFunction.Value chunking_function = 5;

Comments

Conversation

sluongng commented Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sluongng commented Jul 15, 2025

Uh oh!

tjgq Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

tjgq Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

tjgq Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-french Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-french Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tyler-french Dec 9, 2025 •

edited

Loading