feat(SplitBlob/SpliceBlob): add chunking algorithm by tyler-french · Pull Request #357 · bazelbuild/remote-apis

tyler-french · 2026-01-07T20:23:15Z

For CDC (Content-Defined Chunking), having the client and server agree upon a chunking algorithm unlocks a new level of possible improvements, where we can have distributed, deterministic, and reproducible chunking.

This PR adds a chunking algorithm negotiation to GetCapabilities. Most notably, it:

Defaults to no algorithm (server is not expected to be able to chunk, but can verify/store chunking info if it has Split/Splice capabilities)
Adds FastCDC 2020 and RepMaxCDC as a supported algorithm with suggested defaults.

If a chunking algorithm is announced by the server, the client will expect Split responses to be chunked using one of these algorithms. The client should also try to chunk using the same algorithm so that the server will accept new chunking information via Splice, and so many clients can de-duplicate storage with shared chunks.

Why FastCDC 2020?

FastCDC 2020 is fast (~5GB/s on AMD Ryzen 9950), and is backed by a clear spec described in the paper: IEEE paper pdf. It is popular, with https://github.com/nlfiedler/fastcdc-rs mirroring the paper's implementation.

This algorithm is:

simple to implement: DON'T MERGE/EXAMPLE: fastcdc 2020 bazel#28438
deterministic and reproducible
fast and well-known

Why RepMaxCDC?

RepMaxCDC is another fast chunking algorithm that has many benefits, and is used in https://github.com/buildbarn/bonanza. This algorithm is also included to provide users with multiple options for configuration of the chunking.

FasCDC Config Defaults

Why the FastCDC 2020 threshold = Max Blob Size

We only chunk blobs > the threshold size. Having a threshold be >= the max chunk size is an important design decision for the first iterations of implementation here. Mainly, there's a cyclic re-chunking possibility that could happen if it's not upheld. For example, if the chunking threshold is only 1MB, and a chunker (avg size 512k and max 2MB) produces a 1.5MB chunk, the server is going to try to re-chunk this again. It also means the CAS doesn't know if a chunk is a full blob, or is a chunk, or is a file that was chunked into a single chunk. For FastCDC2020, the threshold will be set to the max chunk size, or 4*avg_size. This also has the benefit of guaranteeing that we'll always get >1 chunk, which simplifies implementations.

Why the default chunk size?

512kB hits the sweet spot with Bazel artifacts for medium-side repos like https://github.com/buildbuddy-io/buildbuddy with some overlapping GoLink objects.

Using my --disk_cache from the past 1-2 months of development, I ran some benchmarking using different sizes, and got the following results:

AvgSize  │ %Chunked   │ %Reused    │ %Unique    │ Dedup%   │ Saved
─────────────────────────────────────────────────────────────────────────────────────
16KB    │      18.1% │      48.0% │      52.0% │    50.0% │    141.91 GB
32KB    │      15.8% │      46.0% │      54.0% │    48.2% │    136.90 GB
64KB    │      13.2% │      42.3% │      57.7% │    45.8% │    130.18 GB
128KB   │       7.5% │      38.4% │      61.6% │    43.1% │    122.24 GB
256KB   │       6.1% │      32.5% │      67.5% │    39.8% │    112.96 GB
**512KB   │       4.2% │      24.0% │      76.0% │    35.8% │    101.52 GB
1MB     │       1.9% │      16.4% │      83.6% │    31.0% │     88.00 GB
2MB     │       1.6% │       8.8% │      91.2% │    25.9% │     73.47 GB

De-duplication is strong at 512kiB (at 35%), and this only affects 4% of files.

We could drop to 64k, but we'd only get ~10% more de-duplication savings and still need to chunk 3x the number of files.

The sizes selected for RepMaxCDC are selected to similarly produce an average chunk size around 512kiB

Thank you to @sluongng for much of the initial version of this PR

Copilot

Pull request overview

This pull request adds Content-Defined Chunking (CDC) algorithm negotiation to the Remote Execution API, enabling distributed, deterministic, and reproducible chunking between clients and servers. It introduces FastCDC 2020 as the first supported algorithm with configuration parameters for optimal deduplication.

Changes:

Adds ChunkingFunction enum and ChunkingConfiguration message to define supported chunking algorithms and their parameters
Extends SplitBlobRequest, SplitBlobResponse, and SpliceBlobRequest messages with chunking_function fields
Introduces FastCDC 2020 algorithm support with configurable parameters (avg_chunk_size_bytes, normalization_level, seed) and sensible defaults (512 KiB average, 2 MiB threshold)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build/bazel/remote/execution/v2/remote_execution.proto

sluongng

Im a bit occupied this week, so I plan to give this a better read next week. Got some small nits but lgtm overall.

I think it would be nice if we could provide a test vector for this, similar to https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/sha256tree_test_vectors.txt, which was added for sha256tree. This way, folks can test their implementation against the test vector to verify that the generated chunks are identical across implementations.

build/bazel/remote/execution/v2/remote_execution.proto

EdSchouten · 2026-02-02T18:35:55Z

Have you benchmarked this against https://github.com/buildbarn/go-cdc?tab=readme-ov-file#maxcdc-content-defined-chunking-with-lookahead ?

tyler-french · 2026-02-02T20:48:10Z

Have you benchmarked this against buildbarn/go-cdc#maxcdc-content-defined-chunking-with-lookahead ?

@EdSchouten, I did, and I likely may need to tune my parameters a bit for this as I expect that I'm not getting the full benefits of maxCDC by keeping the chunking threshold > max chunk size bytes, which simplifies the implementation a lot.
I think it should be very easy to add MaxCDC as another supported algorithm in Bazel and also in remote-apis, but I think FastCDC is a safe starting point since it's a bit more well-known. I was able to implement it quickly in the bazel Java codebase.

Here's my pretty crude benchmark with a 2MiB threshold, a 512kiB average size, against my ~300GB disk cache from the last few weeks.

Probably ignore the throughput, since I'm doing a lot of parallelism and the bottlenecks are usually not the chunkers, but the compression or hashing. All chunkers are sufficiently fast to not be the bottleneck.

My guess is sometimes cutting the small end of a file especially in GoLink artifacts, allows for a matching chunk. So only that last small bit gets discarded, although I didn't really dig into specific examples.

(note the nc2 means normalization coefficient = 2)

Algorithm         │ Dedup%   │ Saved        │ Chunks/F │ Throughput │ Time        
─────────────────────────────────────────────────────────────────────────────────────
fastcdc-2016      │   31.39% │     88.67 GB │     36.0 │  4310.2 MB/s │ 1.1m
fastcdc-2020-nc0  │   29.00% │     81.90 GB │     31.4 │  4508.4 MB/s │ 1.1m
fastcdc-2020-nc1  │   30.01% │     84.77 GB │     32.4 │  4482.8 MB/s │ 1.1m
fastcdc-2020-nc2  │   30.81% │     87.02 GB │     36.1 │  4409.2 MB/s │ 1.1m
fastcdc-2020-nc3  │   30.81% │     87.03 GB │     38.9 │  4382.2 MB/s │ 1.1m
go-cdc-max        │   27.57% │     77.89 GB │     21.9 │  4545.8 MB/s │ 1.1m

Here's the test benchmark code for ref: buildbuddy-io/buildbuddy#11223

EdSchouten · 2026-02-03T05:47:56Z

You're likely getting bad performance out of MaxCDC because you made the spread between min/max far too big (16). At least in my tests I observed the optimal to be slightly below 4.

Also to make it an apples to apples comparison, you should measure that for both algorithms you get the same average chunk size. Because the way you set it right now, the actual average chunk size is about twice as big as averageSize. (4 + 1/4) / 2 = 2.125.

Can you please rerun your tests for MaxCDC with something like this?

	minSize := averageSize * 2 / 5 // 1 - 3/5
	maxSize := minSize * 4         // 1 + 3/5

FastCDC is a safe starting point since its slightly simpler, [...]

As demonstrated in the go-cdc repo, both algorithms can be implemented with the same amount of code. It's just that MaxCDC's optimized implementation spans more lines because I wanted to document it extensively. The 'simple' implementation in the go-cdc repo is even smaller than FastCDC. Furthermore, MaxCDC only takes two configuration parameters (min/max size), whereas FastCDC takes at least five (min/average/max size, MaskS, MaskL).

It is perfectly reasonable to state that you prefer FastCDC because there happen to be some existing implementations out there. But stating that it's 'slightly simpler' is objectively false.

tyler-french · 2026-02-03T15:05:04Z

@EdSchouten Nice! That worked great, thanks for parameter recommendations, and sorry about the false assertion about simplicity. Looking at it again, the two-parameter design is very simple.

Below are the results I got with those parameters. Deduplication looks slightly better with MaxCDC, though it is pretty similar, and likely because blobs are on average being split into slightly more chunks (42.8 vs 35 avg chunks per blob). I can see the benefits of MaxCDC with its throughput (4877 MB/s vs ~4000 MB/s) and having the tightest min/max spread.

I do really like the MaxCDC algorithm and I think it's a great design. I think, at least to start, getting the community aligned is the most important. Starting with FastCDC is helpful because it matches an IEEE paper, and has many existing implementations and usages. I simply picked this because I think it would be the mostly likely to get alignment and progress. I also have made a lot of progress on client and server getting everything tuned and working, but I am happy to adjust if there's alignment.

Rather than spend too much time comparing without making progress, I would be happy to add both algorithms. I can update the PR to include MaxCDC. Then, the Bazel client can just pick which one to use if the server advertises both of them. Do you prefer this?

EDIT: I created tyler-french#1 against this branch. If we prefer keeping both, we can merge that into here before this goes to main.

Algorithm         │ Dedup%   │ Saved        │ Chunks/F │ Throughput │ Time        
─────────────────────────────────────────────────────────────────────────────────────
fastcdc-2016      │   33.59% │     88.63 GB │     35.5 │  4045.6 MB/s │ 1.1m
fastcdc-2020-nc0  │   31.01% │     81.81 GB │     31.0 │  3941.7 MB/s │ 1.1m
fastcdc-2020-nc1  │   32.10% │     84.70 GB │     31.9 │  3938.9 MB/s │ 1.1m
fastcdc-2020-nc2  │   32.97% │     86.99 GB │     35.6 │  4066.8 MB/s │ 1.1m
fastcdc-2020-nc3  │   32.98% │     87.02 GB │     38.4 │  3991.4 MB/s │ 1.1m
go-cdc-max        │   34.91% │     92.12 GB │     42.8 │  4877.3 MB/s │ 55.4s

Algorithm         │ Avg        │ Stdev      │ Min        │ Max       
──────────────────────────────────────────────────────────────────────
fastcdc-2016      │  615.64 KB │  226.77 KB │      508 B │    2.00 MB
fastcdc-2020-nc0  │  705.67 KB │  525.70 KB │       90 B │    2.00 MB
fastcdc-2020-nc1  │  684.07 KB │  374.91 KB │       90 B │    2.00 MB
fastcdc-2020-nc2  │  614.55 KB │  241.66 KB │       59 B │    2.00 MB
fastcdc-2020-nc3  │  569.57 KB │  175.89 KB │       90 B │    2.00 MB
go-cdc-max        │  510.60 KB │  184.35 KB │  204.80 KB │  819.20 KB

EdSchouten · 2026-02-03T15:38:04Z

Yeah, that would be great!

tyler-french · 2026-02-03T20:31:03Z

@tjgq 👋🏻 would you have time to take a look at this PR? Thanks!

sluongng

I think this is clean and nice. A few nits to make the spec stricter.

Given the context of other PR, I do get why you would want to add chunking_function into the SpliceRequest and SplitResponse.

If the server were doing all the Chunking and Concatting live, I don't think those fields are needed. But if they are simply looking up existing "big blob manifests", then those values might be useful for the client/server to store and look up the manifests? Is that the intention?

build/bazel/remote/execution/v2/remote_execution.proto

sluongng · 2026-02-04T14:47:02Z

@EdSchouten @tjgq, since the current 2 algorithms all have their respective open source implementation available. Do we still want a test vector in this repo, similar to the sha256tree test vector, to help others validate their implementation?

I know we asked for test vectors in a meeting many moons ago when we discussed the chunking algorithms. Not entirely sure if they are still necessary.

I suspect if the 2 maintainers just provide released binaries for each algorithm in their repo, perhaps with the config knobs as flags, then we can just refer to those CLIs as test vectors?

tyler-french · 2026-02-04T14:53:57Z

@EdSchouten @tjgq, since the current 2 algorithms all have their respective open source implementation available. Do we still want a test vector in this repo, similar to the sha256tree test vector, to help others validate their implementation?

I know we asked for test vectors in a meeting many moons ago when we discussed the chunking algorithms. Not entirely sure if they are still necessary.

I suspect if the 2 maintainers just provide released binaries for each algorithm in their repo, perhaps with the config knobs as flags, then we can just refer to those CLIs as test vectors?

I have been using this as a test vector: https://github.com/nlfiedler/fastcdc-rs/blob/master/test/fixtures/SekienAkashita.jpg
The issue is I don't know if we want to check in an image to this repo, https://github.com/nlfiedler/fastcdc-rs/blob/master/src/v2020/mod.rs#L902-L925

I added the same test here: https://github.com/buildbuddy-io/fastcdc2020/blob/main/fastcdc/fastcdc_test.go#L11-L36

The resulting chunks are:

		{17968276318003433923, 21325},
		{8197189939299398838, 17140},
		{13019990849178155730, 28084},
		{4509236223063678303, 18217},
		{2504464741100432583, 24700},

Should I add this?

sluongng · 2026-02-04T15:09:02Z

Perhaps we can use the perma link https://github.com/nlfiedler/fastcdc-rs/blob/49c3d0b8043a7c1c2d9aca75e868d3791ffedcf3/test/fixtures/SekienAkashita.jpg with the SHA256 fingerprint to identify the blob?

The file is 107KB, so it's not too bad.

EdSchouten · 2026-02-04T21:30:58Z

Tyler and I had a private chat via Slack, but I thought I should share this here as well.

I got nerd sniped by the discussion we had about the addition of chunking_threshold_bytes and PR #358. Namely, algorithms like FastCDC and MaxCDC don't really allow you to easily distinguish between chunked and unchunked blobs. For example, you can't just look at a blob's size to know whether repeated application of chunking yields anything different.

This got me thinking: would it be possible to design a somewhat decent chunking algorithm that always yields chunks of size $[n, 2n)$? Because if you had an algorithm like that, you know that anything smaller than 2n cannot be chunked further, while anything at least as large as 2n can definitely be chunked. So it turns out we can. Namely, by applying MaxCDC repeatedly!

I just added such an algorithm to the go-cdc repository under the name RepMaxCDC. Benchmark results look very promising. In fact, I wasn't able to find any measurable difference between MaxCDC and RepMaxCDC, even though the latter has far tighter bounds on object size. @tyler-french (and others), I would really like to invite you to give this a try.

For this PR my recommendation would be to:

Change references of MaxCDC to RepMaxCDC.
Rename RepMaxCDC's max_chunk_size_bytes to horizon_size_bytes.
Move chunking_threshold_bytes into FastCdc2020Params, as RepMaxCDC does not depend on it.
Maybe eliminate ChunkingConfiguration and inline its fields into CacheCapabilities. The reason being that its presence/absence does not imply anything.

Introduce ChunkingFunction which enum is a set of known chunking algorithms that the server can recommend to the client. Provide FastCDC_2020 as the first explicit chunking algorithm. The server advertise these through a new chunking_configuration field in CacheCapabilities message. There, the server may set the chunking functions that it supports as well as the relevant configuration parameters for that chunking algorithm.

tyler-french · 2026-02-05T01:20:45Z

Perhaps we can use the perma link nlfiedler/fastcdc-rs@49c3d0b/test/fixtures/SekienAkashita.jpg with the SHA256 fingerprint to identify the blob?

The file is 107KB, so it's not too bad.

Added test vectors, and referenced them here: buildbuddy-io/fastcdc2020#4, which can be used to verify

tyler-french · 2026-02-05T01:52:47Z

@EdSchouten @fmeum @sluongng made some changes, please take another look when you get the chance!

Summary is:

Removed threshold: rep max doesn't need one, fastcdc2020 the threshold will be == max chunk size
Added test vectors for fastcdc2020

build/bazel/remote/execution/v2/remote_execution.proto

tyler-french · 2026-02-10T01:59:30Z

@tjgq Please take another look when you get the chance! Once this gets merged, we should cut a new release.

EdSchouten · 2026-02-10T11:22:48Z

Even though I already approved this PR a couple of days ago, just wanted to say that this is good to land w.r.t. RepMaxCDC. The reference implementation available at https://github.com/buildbarn/go-cdc works reliably now, and performance is on par with FastCDC.

tyler-french force-pushed the chunking-algo branch 2 times, most recently from 5050f99 to e7d3f11 Compare January 26, 2026 19:22

tyler-french changed the title ~~WIP: server should prefer a chunking algorithm~~ feat(SplitBlob/SpliceBlob): add chunking algorithm Jan 26, 2026

tyler-french force-pushed the chunking-algo branch from e7d3f11 to 9f979d6 Compare January 27, 2026 02:31

tyler-french marked this pull request as ready for review January 27, 2026 03:42

Copilot AI review requested due to automatic review settings January 27, 2026 03:42

tyler-french requested review from EdSchouten, EricBurnett, bwkimmel, fmeum, sluongng, sstriker, tjgq, ulfjack and werkt as code owners January 27, 2026 03:42

Copilot started reviewing on behalf of tyler-french January 27, 2026 03:42 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

tyler-french force-pushed the chunking-algo branch 2 times, most recently from 685803d to d2d2404 Compare January 29, 2026 02:37

sluongng reviewed Jan 29, 2026

View reviewed changes

tyler-french force-pushed the chunking-algo branch 3 times, most recently from 215416a to eeb78e5 Compare February 2, 2026 16:12

tyler-french mentioned this pull request Feb 3, 2026

Support remote cache CDC bazelbuild/bazel#28437

Open

tyler-french force-pushed the chunking-algo branch from aa8b21e to 89aea30 Compare February 3, 2026 16:06

tyler-french force-pushed the chunking-algo branch from 1a45c89 to 604ad3d Compare February 3, 2026 20:24

sluongng reviewed Feb 4, 2026

View reviewed changes

tyler-french force-pushed the chunking-algo branch 2 times, most recently from a180d33 to 2189600 Compare February 5, 2026 01:19

tyler-french mentioned this pull request Feb 5, 2026

add test for REv2 test vectors buildbuddy-io/fastcdc2020#4

Merged

tyler-french force-pushed the chunking-algo branch 3 times, most recently from ed13212 to 57aec35 Compare February 5, 2026 01:48

tyler-french requested review from EdSchouten, fmeum and sluongng February 5, 2026 01:49

tyler-french force-pushed the chunking-algo branch from 57aec35 to 54c5d85 Compare February 5, 2026 01:53

EdSchouten reviewed Feb 5, 2026

View reviewed changes

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved

tyler-french force-pushed the chunking-algo branch from 54c5d85 to 0f4cc43 Compare February 5, 2026 19:37

feat(chunking): add chunking algorithm

cc38e6b

tyler-french force-pushed the chunking-algo branch from 0f4cc43 to cc38e6b Compare February 5, 2026 19:41

EdSchouten approved these changes Feb 5, 2026

View reviewed changes

fmeum approved these changes Feb 5, 2026

View reviewed changes

sluongng approved these changes Feb 9, 2026

View reviewed changes

tjgq merged commit de5501d into bazelbuild:main Feb 10, 2026
1 check passed

Comments

Conversation

tyler-french commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why FastCDC 2020?

Why RepMaxCDC?

FasCDC Config Defaults

Why the FastCDC 2020 threshold = Max Blob Size

Why the default chunk size?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sluongng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EdSchouten commented Feb 2, 2026

Uh oh!

tyler-french commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdSchouten commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tyler-french commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdSchouten commented Feb 3, 2026

Uh oh!

tyler-french commented Feb 3, 2026

Uh oh!

sluongng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sluongng commented Feb 4, 2026

Uh oh!

tyler-french commented Feb 4, 2026

Uh oh!

sluongng commented Feb 4, 2026

Uh oh!

EdSchouten commented Feb 4, 2026

Uh oh!

tyler-french commented Feb 5, 2026

Uh oh!

tyler-french commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

tyler-french commented Feb 10, 2026

Uh oh!

EdSchouten commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tyler-french commented Jan 7, 2026 •

edited

Loading

tyler-french commented Feb 2, 2026 •

edited

Loading

EdSchouten commented Feb 3, 2026 •

edited

Loading

tyler-french commented Feb 3, 2026 •

edited

Loading