Skip to content

Comments

feat(SplitBlob/SpliceBlob): add chunking algorithm#357

Merged
tjgq merged 2 commits intobazelbuild:mainfrom
tyler-french:chunking-algo
Feb 10, 2026
Merged

feat(SplitBlob/SpliceBlob): add chunking algorithm#357
tjgq merged 2 commits intobazelbuild:mainfrom
tyler-french:chunking-algo

Conversation

@tyler-french
Copy link
Contributor

@tyler-french tyler-french commented Jan 7, 2026

For CDC (Content-Defined Chunking), having the client and server agree upon a chunking algorithm unlocks a new level of possible improvements, where we can have distributed, deterministic, and reproducible chunking.

This PR adds a chunking algorithm negotiation to GetCapabilities. Most notably, it:

  • Defaults to no algorithm (server is not expected to be able to chunk, but can verify/store chunking info if it has Split/Splice capabilities)
  • Adds FastCDC 2020 and RepMaxCDC as a supported algorithm with suggested defaults.

If a chunking algorithm is announced by the server, the client will expect Split responses to be chunked using one of these algorithms. The client should also try to chunk using the same algorithm so that the server will accept new chunking information via Splice, and so many clients can de-duplicate storage with shared chunks.

Why FastCDC 2020?

FastCDC 2020 is fast (~5GB/s on AMD Ryzen 9950), and is backed by a clear spec described in the paper: IEEE paper pdf. It is popular, with https://github.com/nlfiedler/fastcdc-rs mirroring the paper's implementation.

This algorithm is:

Why RepMaxCDC?

RepMaxCDC is another fast chunking algorithm that has many benefits, and is used in https://github.com/buildbarn/bonanza. This algorithm is also included to provide users with multiple options for configuration of the chunking.

FasCDC Config Defaults

Why the FastCDC 2020 threshold = Max Blob Size

We only chunk blobs > the threshold size. Having a threshold be >= the max chunk size is an important design decision for the first iterations of implementation here. Mainly, there's a cyclic re-chunking possibility that could happen if it's not upheld. For example, if the chunking threshold is only 1MB, and a chunker (avg size 512k and max 2MB) produces a 1.5MB chunk, the server is going to try to re-chunk this again. It also means the CAS doesn't know if a chunk is a full blob, or is a chunk, or is a file that was chunked into a single chunk. For FastCDC2020, the threshold will be set to the max chunk size, or 4*avg_size. This also has the benefit of guaranteeing that we'll always get >1 chunk, which simplifies implementations.

Why the default chunk size?

512kB hits the sweet spot with Bazel artifacts for medium-side repos like https://github.com/buildbuddy-io/buildbuddy with some overlapping GoLink objects.

Using my --disk_cache from the past 1-2 months of development, I ran some benchmarking using different sizes, and got the following results:

AvgSize  │ %Chunked   │ %Reused    │ %Unique    │ Dedup%   │ Saved
─────────────────────────────────────────────────────────────────────────────────────
16KB    │      18.1% │      48.0% │      52.0% │    50.0% │    141.91 GB
32KB    │      15.8% │      46.0% │      54.0% │    48.2% │    136.90 GB
64KB    │      13.2% │      42.3% │      57.7% │    45.8% │    130.18 GB
128KB   │       7.5% │      38.4% │      61.6% │    43.1% │    122.24 GB
256KB   │       6.1% │      32.5% │      67.5% │    39.8% │    112.96 GB
**512KB   │       4.2% │      24.0% │      76.0% │    35.8% │    101.52 GB
1MB     │       1.9% │      16.4% │      83.6% │    31.0% │     88.00 GB
2MB     │       1.6% │       8.8% │      91.2% │    25.9% │     73.47 GB

De-duplication is strong at 512kiB (at 35%), and this only affects 4% of files.

We could drop to 64k, but we'd only get ~10% more de-duplication savings and still need to chunk 3x the number of files.

The sizes selected for RepMaxCDC are selected to similarly produce an average chunk size around 512kiB

Thank you to @sluongng for much of the initial version of this PR

@tyler-french tyler-french force-pushed the chunking-algo branch 2 times, most recently from 5050f99 to e7d3f11 Compare January 26, 2026 19:22
@tyler-french tyler-french changed the title WIP: server should prefer a chunking algorithm feat(SplitBlob/SpliceBlob): add chunking algorithm Jan 26, 2026
@tyler-french tyler-french marked this pull request as ready for review January 27, 2026 03:42
Copilot AI review requested due to automatic review settings January 27, 2026 03:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Content-Defined Chunking (CDC) algorithm negotiation to the Remote Execution API, enabling distributed, deterministic, and reproducible chunking between clients and servers. It introduces FastCDC 2020 as the first supported algorithm with configuration parameters for optimal deduplication.

Changes:

  • Adds ChunkingFunction enum and ChunkingConfiguration message to define supported chunking algorithms and their parameters
  • Extends SplitBlobRequest, SplitBlobResponse, and SpliceBlobRequest messages with chunking_function fields
  • Introduces FastCDC 2020 algorithm support with configurable parameters (avg_chunk_size_bytes, normalization_level, seed) and sensible defaults (512 KiB average, 2 MiB threshold)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tyler-french tyler-french force-pushed the chunking-algo branch 2 times, most recently from 685803d to d2d2404 Compare January 29, 2026 02:37
Copy link
Collaborator

@sluongng sluongng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im a bit occupied this week, so I plan to give this a better read next week. Got some small nits but lgtm overall.

I think it would be nice if we could provide a test vector for this, similar to https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/sha256tree_test_vectors.txt, which was added for sha256tree. This way, folks can test their implementation against the test vector to verify that the generated chunks are identical across implementations.

@tyler-french tyler-french force-pushed the chunking-algo branch 3 times, most recently from 215416a to eeb78e5 Compare February 2, 2026 16:12
@EdSchouten
Copy link
Collaborator

@tyler-french
Copy link
Contributor Author

tyler-french commented Feb 2, 2026

Have you benchmarked this against buildbarn/go-cdc#maxcdc-content-defined-chunking-with-lookahead ?

@EdSchouten, I did, and I likely may need to tune my parameters a bit for this as I expect that I'm not getting the full benefits of maxCDC by keeping the chunking threshold > max chunk size bytes, which simplifies the implementation a lot.
I think it should be very easy to add MaxCDC as another supported algorithm in Bazel and also in remote-apis, but I think FastCDC is a safe starting point since it's a bit more well-known. I was able to implement it quickly in the bazel Java codebase.

Here's my pretty crude benchmark with a 2MiB threshold, a 512kiB average size, against my ~300GB disk cache from the last few weeks.

Probably ignore the throughput, since I'm doing a lot of parallelism and the bottlenecks are usually not the chunkers, but the compression or hashing. All chunkers are sufficiently fast to not be the bottleneck.

My guess is sometimes cutting the small end of a file especially in GoLink artifacts, allows for a matching chunk. So only that last small bit gets discarded, although I didn't really dig into specific examples.

(note the nc2 means normalization coefficient = 2)

Algorithm         │ Dedup%   │ Saved        │ Chunks/F │ Throughput │ Time        
─────────────────────────────────────────────────────────────────────────────────────
fastcdc-2016      │   31.39% │     88.67 GB │     36.0 │  4310.2 MB/s │ 1.1m
fastcdc-2020-nc0  │   29.00% │     81.90 GB │     31.4 │  4508.4 MB/s │ 1.1m
fastcdc-2020-nc1  │   30.01% │     84.77 GB │     32.4 │  4482.8 MB/s │ 1.1m
fastcdc-2020-nc2  │   30.81% │     87.02 GB │     36.1 │  4409.2 MB/s │ 1.1m
fastcdc-2020-nc3  │   30.81% │     87.03 GB │     38.9 │  4382.2 MB/s │ 1.1m
go-cdc-max        │   27.57% │     77.89 GB │     21.9 │  4545.8 MB/s │ 1.1m

Here's the test benchmark code for ref: buildbuddy-io/buildbuddy#11223

@EdSchouten
Copy link
Collaborator

EdSchouten commented Feb 3, 2026

You're likely getting bad performance out of MaxCDC because you made the spread between min/max far too big (16). At least in my tests I observed the optimal to be slightly below 4.

Also to make it an apples to apples comparison, you should measure that for both algorithms you get the same average chunk size. Because the way you set it right now, the actual average chunk size is about twice as big as averageSize. (4 + 1/4) / 2 = 2.125.

Can you please rerun your tests for MaxCDC with something like this?

	minSize := averageSize * 2 / 5 // 1 - 3/5
	maxSize := minSize * 4         // 1 + 3/5

FastCDC is a safe starting point since its slightly simpler, [...]

As demonstrated in the go-cdc repo, both algorithms can be implemented with the same amount of code. It's just that MaxCDC's optimized implementation spans more lines because I wanted to document it extensively. The 'simple' implementation in the go-cdc repo is even smaller than FastCDC. Furthermore, MaxCDC only takes two configuration parameters (min/max size), whereas FastCDC takes at least five (min/average/max size, MaskS, MaskL).

It is perfectly reasonable to state that you prefer FastCDC because there happen to be some existing implementations out there. But stating that it's 'slightly simpler' is objectively false.

@tyler-french
Copy link
Contributor Author

tyler-french commented Feb 3, 2026

@EdSchouten Nice! That worked great, thanks for parameter recommendations, and sorry about the false assertion about simplicity. Looking at it again, the two-parameter design is very simple.

Below are the results I got with those parameters. Deduplication looks slightly better with MaxCDC, though it is pretty similar, and likely because blobs are on average being split into slightly more chunks (42.8 vs 35 avg chunks per blob). I can see the benefits of MaxCDC with its throughput (4877 MB/s vs ~4000 MB/s) and having the tightest min/max spread.

I do really like the MaxCDC algorithm and I think it's a great design. I think, at least to start, getting the community aligned is the most important. Starting with FastCDC is helpful because it matches an IEEE paper, and has many existing implementations and usages. I simply picked this because I think it would be the mostly likely to get alignment and progress. I also have made a lot of progress on client and server getting everything tuned and working, but I am happy to adjust if there's alignment.

Rather than spend too much time comparing without making progress, I would be happy to add both algorithms. I can update the PR to include MaxCDC. Then, the Bazel client can just pick which one to use if the server advertises both of them. Do you prefer this?

EDIT: I created tyler-french#1 against this branch. If we prefer keeping both, we can merge that into here before this goes to main.

Algorithm         │ Dedup%   │ Saved        │ Chunks/F │ Throughput │ Time        
─────────────────────────────────────────────────────────────────────────────────────
fastcdc-2016      │   33.59% │     88.63 GB │     35.5 │  4045.6 MB/s │ 1.1m
fastcdc-2020-nc0  │   31.01% │     81.81 GB │     31.0 │  3941.7 MB/s │ 1.1m
fastcdc-2020-nc1  │   32.10% │     84.70 GB │     31.9 │  3938.9 MB/s │ 1.1m
fastcdc-2020-nc2  │   32.97% │     86.99 GB │     35.6 │  4066.8 MB/s │ 1.1m
fastcdc-2020-nc3  │   32.98% │     87.02 GB │     38.4 │  3991.4 MB/s │ 1.1m
go-cdc-max        │   34.91% │     92.12 GB │     42.8 │  4877.3 MB/s │ 55.4s

Algorithm         │ Avg        │ Stdev      │ Min        │ Max       
──────────────────────────────────────────────────────────────────────
fastcdc-2016      │  615.64 KB │  226.77 KB │      508 B │    2.00 MB
fastcdc-2020-nc0  │  705.67 KB │  525.70 KB │       90 B │    2.00 MB
fastcdc-2020-nc1  │  684.07 KB │  374.91 KB │       90 B │    2.00 MB
fastcdc-2020-nc2  │  614.55 KB │  241.66 KB │       59 B │    2.00 MB
fastcdc-2020-nc3  │  569.57 KB │  175.89 KB │       90 B │    2.00 MB
go-cdc-max        │  510.60 KB │  184.35 KB │  204.80 KB │  819.20 KB

@EdSchouten
Copy link
Collaborator

Yeah, that would be great!

@tyler-french
Copy link
Contributor Author

@tjgq 👋🏻 would you have time to take a look at this PR? Thanks!

Copy link
Collaborator

@sluongng sluongng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is clean and nice. A few nits to make the spec stricter.

Given the context of other PR, I do get why you would want to add chunking_function into the SpliceRequest and SplitResponse.

If the server were doing all the Chunking and Concatting live, I don't think those fields are needed. But if they are simply looking up existing "big blob manifests", then those values might be useful for the client/server to store and look up the manifests? Is that the intention?

@sluongng
Copy link
Collaborator

sluongng commented Feb 4, 2026

@EdSchouten @tjgq, since the current 2 algorithms all have their respective open source implementation available. Do we still want a test vector in this repo, similar to the sha256tree test vector, to help others validate their implementation?

I know we asked for test vectors in a meeting many moons ago when we discussed the chunking algorithms. Not entirely sure if they are still necessary.

I suspect if the 2 maintainers just provide released binaries for each algorithm in their repo, perhaps with the config knobs as flags, then we can just refer to those CLIs as test vectors?

@tyler-french
Copy link
Contributor Author

@EdSchouten @tjgq, since the current 2 algorithms all have their respective open source implementation available. Do we still want a test vector in this repo, similar to the sha256tree test vector, to help others validate their implementation?

I know we asked for test vectors in a meeting many moons ago when we discussed the chunking algorithms. Not entirely sure if they are still necessary.

I suspect if the 2 maintainers just provide released binaries for each algorithm in their repo, perhaps with the config knobs as flags, then we can just refer to those CLIs as test vectors?

I have been using this as a test vector: https://github.com/nlfiedler/fastcdc-rs/blob/master/test/fixtures/SekienAkashita.jpg
The issue is I don't know if we want to check in an image to this repo, https://github.com/nlfiedler/fastcdc-rs/blob/master/src/v2020/mod.rs#L902-L925

I added the same test here: https://github.com/buildbuddy-io/fastcdc2020/blob/main/fastcdc/fastcdc_test.go#L11-L36

The resulting chunks are:

		{17968276318003433923, 21325},
		{8197189939299398838, 17140},
		{13019990849178155730, 28084},
		{4509236223063678303, 18217},
		{2504464741100432583, 24700},

Should I add this?

@sluongng
Copy link
Collaborator

sluongng commented Feb 4, 2026

Perhaps we can use the perma link https://github.com/nlfiedler/fastcdc-rs/blob/49c3d0b8043a7c1c2d9aca75e868d3791ffedcf3/test/fixtures/SekienAkashita.jpg with the SHA256 fingerprint to identify the blob?

The file is 107KB, so it's not too bad.

@EdSchouten
Copy link
Collaborator

Tyler and I had a private chat via Slack, but I thought I should share this here as well.

I got nerd sniped by the discussion we had about the addition of chunking_threshold_bytes and PR #358. Namely, algorithms like FastCDC and MaxCDC don't really allow you to easily distinguish between chunked and unchunked blobs. For example, you can't just look at a blob's size to know whether repeated application of chunking yields anything different.

This got me thinking: would it be possible to design a somewhat decent chunking algorithm that always yields chunks of size $[n, 2n)$? Because if you had an algorithm like that, you know that anything smaller than 2n cannot be chunked further, while anything at least as large as 2n can definitely be chunked. So it turns out we can. Namely, by applying MaxCDC repeatedly!

I just added such an algorithm to the go-cdc repository under the name RepMaxCDC. Benchmark results look very promising. In fact, I wasn't able to find any measurable difference between MaxCDC and RepMaxCDC, even though the latter has far tighter bounds on object size. @tyler-french (and others), I would really like to invite you to give this a try.

For this PR my recommendation would be to:

  • Change references of MaxCDC to RepMaxCDC.
  • Rename RepMaxCDC's max_chunk_size_bytes to horizon_size_bytes.
  • Move chunking_threshold_bytes into FastCdc2020Params, as RepMaxCDC does not depend on it.
  • Maybe eliminate ChunkingConfiguration and inline its fields into CacheCapabilities. The reason being that its presence/absence does not imply anything.

Introduce ChunkingFunction which enum is a set of known chunking
algorithms that the server can recommend to the client.

Provide FastCDC_2020 as the first explicit chunking algorithm.

The server advertise these through a new chunking_configuration field in
CacheCapabilities message. There, the server may set the chunking
functions that it supports as well as the relevant configuration
parameters for that chunking algorithm.
@tyler-french
Copy link
Contributor Author

Perhaps we can use the perma link nlfiedler/fastcdc-rs@49c3d0b/test/fixtures/SekienAkashita.jpg with the SHA256 fingerprint to identify the blob?

The file is 107KB, so it's not too bad.

Added test vectors, and referenced them here: buildbuddy-io/fastcdc2020#4, which can be used to verify

@tyler-french tyler-french force-pushed the chunking-algo branch 3 times, most recently from ed13212 to 57aec35 Compare February 5, 2026 01:48
@tyler-french
Copy link
Contributor Author

@EdSchouten @fmeum @sluongng made some changes, please take another look when you get the chance!

Summary is:

  • Removed threshold: rep max doesn't need one, fastcdc2020 the threshold will be == max chunk size
  • Added test vectors for fastcdc2020

@tyler-french
Copy link
Contributor Author

@tjgq Please take another look when you get the chance! Once this gets merged, we should cut a new release.

@EdSchouten
Copy link
Collaborator

Even though I already approved this PR a couple of days ago, just wanted to say that this is good to land w.r.t. RepMaxCDC. The reference implementation available at https://github.com/buildbarn/go-cdc works reliably now, and performance is on par with FastCDC.

@tjgq tjgq merged commit de5501d into bazelbuild:main Feb 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants