feat(SplitBlob/SpliceBlob): add chunking algorithm#357
Conversation
5050f99 to
e7d3f11
Compare
e7d3f11 to
9f979d6
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds Content-Defined Chunking (CDC) algorithm negotiation to the Remote Execution API, enabling distributed, deterministic, and reproducible chunking between clients and servers. It introduces FastCDC 2020 as the first supported algorithm with configuration parameters for optimal deduplication.
Changes:
- Adds ChunkingFunction enum and ChunkingConfiguration message to define supported chunking algorithms and their parameters
- Extends SplitBlobRequest, SplitBlobResponse, and SpliceBlobRequest messages with chunking_function fields
- Introduces FastCDC 2020 algorithm support with configurable parameters (avg_chunk_size_bytes, normalization_level, seed) and sensible defaults (512 KiB average, 2 MiB threshold)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
685803d to
d2d2404
Compare
sluongng
left a comment
There was a problem hiding this comment.
Im a bit occupied this week, so I plan to give this a better read next week. Got some small nits but lgtm overall.
I think it would be nice if we could provide a test vector for this, similar to https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/sha256tree_test_vectors.txt, which was added for sha256tree. This way, folks can test their implementation against the test vector to verify that the generated chunks are identical across implementations.
215416a to
eeb78e5
Compare
|
Have you benchmarked this against https://github.com/buildbarn/go-cdc?tab=readme-ov-file#maxcdc-content-defined-chunking-with-lookahead ? |
@EdSchouten, I did, and I likely may need to tune my parameters a bit for this as I expect that I'm not getting the full benefits of maxCDC by keeping the chunking threshold > max chunk size bytes, which simplifies the implementation a lot. Here's my pretty crude benchmark with a 2MiB threshold, a 512kiB average size, against my ~300GB disk cache from the last few weeks. Probably ignore the throughput, since I'm doing a lot of parallelism and the bottlenecks are usually not the chunkers, but the compression or hashing. All chunkers are sufficiently fast to not be the bottleneck. My guess is sometimes cutting the small end of a file especially in GoLink artifacts, allows for a matching chunk. So only that last small bit gets discarded, although I didn't really dig into specific examples. (note the nc2 means normalization coefficient = 2) Here's the test benchmark code for ref: buildbuddy-io/buildbuddy#11223 |
|
You're likely getting bad performance out of MaxCDC because you made the spread between min/max far too big (16). At least in my tests I observed the optimal to be slightly below 4. Also to make it an apples to apples comparison, you should measure that for both algorithms you get the same average chunk size. Because the way you set it right now, the actual average chunk size is about twice as big as Can you please rerun your tests for MaxCDC with something like this? minSize := averageSize * 2 / 5 // 1 - 3/5
maxSize := minSize * 4 // 1 + 3/5
As demonstrated in the go-cdc repo, both algorithms can be implemented with the same amount of code. It's just that MaxCDC's optimized implementation spans more lines because I wanted to document it extensively. The 'simple' implementation in the go-cdc repo is even smaller than FastCDC. Furthermore, MaxCDC only takes two configuration parameters (min/max size), whereas FastCDC takes at least five (min/average/max size, MaskS, MaskL). It is perfectly reasonable to state that you prefer FastCDC because there happen to be some existing implementations out there. But stating that it's 'slightly simpler' is objectively false. |
|
@EdSchouten Nice! That worked great, thanks for parameter recommendations, and sorry about the false assertion about simplicity. Looking at it again, the two-parameter design is very simple. Below are the results I got with those parameters. Deduplication looks slightly better with MaxCDC, though it is pretty similar, and likely because blobs are on average being split into slightly more chunks (42.8 vs 35 avg chunks per blob). I can see the benefits of MaxCDC with its throughput (4877 MB/s vs ~4000 MB/s) and having the tightest min/max spread. I do really like the MaxCDC algorithm and I think it's a great design. I think, at least to start, getting the community aligned is the most important. Starting with FastCDC is helpful because it matches an IEEE paper, and has many existing implementations and usages. I simply picked this because I think it would be the mostly likely to get alignment and progress. I also have made a lot of progress on client and server getting everything tuned and working, but I am happy to adjust if there's alignment. Rather than spend too much time comparing without making progress, I would be happy to add both algorithms. I can update the PR to include MaxCDC. Then, the Bazel client can just pick which one to use if the server advertises both of them. Do you prefer this? EDIT: I created tyler-french#1 against this branch. If we prefer keeping both, we can merge that into here before this goes to |
|
Yeah, that would be great! |
aa8b21e to
89aea30
Compare
1a45c89 to
604ad3d
Compare
|
@tjgq 👋🏻 would you have time to take a look at this PR? Thanks! |
sluongng
left a comment
There was a problem hiding this comment.
I think this is clean and nice. A few nits to make the spec stricter.
Given the context of other PR, I do get why you would want to add chunking_function into the SpliceRequest and SplitResponse.
If the server were doing all the Chunking and Concatting live, I don't think those fields are needed. But if they are simply looking up existing "big blob manifests", then those values might be useful for the client/server to store and look up the manifests? Is that the intention?
|
@EdSchouten @tjgq, since the current 2 algorithms all have their respective open source implementation available. Do we still want a test vector in this repo, similar to the sha256tree test vector, to help others validate their implementation? I know we asked for test vectors in a meeting many moons ago when we discussed the chunking algorithms. Not entirely sure if they are still necessary. I suspect if the 2 maintainers just provide released binaries for each algorithm in their repo, perhaps with the config knobs as flags, then we can just refer to those CLIs as test vectors? |
I have been using this as a test vector: https://github.com/nlfiedler/fastcdc-rs/blob/master/test/fixtures/SekienAkashita.jpg I added the same test here: https://github.com/buildbuddy-io/fastcdc2020/blob/main/fastcdc/fastcdc_test.go#L11-L36 The resulting chunks are: Should I add this? |
|
Perhaps we can use the perma link https://github.com/nlfiedler/fastcdc-rs/blob/49c3d0b8043a7c1c2d9aca75e868d3791ffedcf3/test/fixtures/SekienAkashita.jpg with the SHA256 fingerprint to identify the blob? The file is 107KB, so it's not too bad. |
|
Tyler and I had a private chat via Slack, but I thought I should share this here as well. I got nerd sniped by the discussion we had about the addition of This got me thinking: would it be possible to design a somewhat decent chunking algorithm that always yields chunks of size I just added such an algorithm to the go-cdc repository under the name RepMaxCDC. Benchmark results look very promising. In fact, I wasn't able to find any measurable difference between MaxCDC and RepMaxCDC, even though the latter has far tighter bounds on object size. @tyler-french (and others), I would really like to invite you to give this a try. For this PR my recommendation would be to:
|
Introduce ChunkingFunction which enum is a set of known chunking algorithms that the server can recommend to the client. Provide FastCDC_2020 as the first explicit chunking algorithm. The server advertise these through a new chunking_configuration field in CacheCapabilities message. There, the server may set the chunking functions that it supports as well as the relevant configuration parameters for that chunking algorithm.
a180d33 to
2189600
Compare
Added test vectors, and referenced them here: buildbuddy-io/fastcdc2020#4, which can be used to verify |
ed13212 to
57aec35
Compare
|
@EdSchouten @fmeum @sluongng made some changes, please take another look when you get the chance! Summary is:
|
57aec35 to
54c5d85
Compare
54c5d85 to
0f4cc43
Compare
0f4cc43 to
cc38e6b
Compare
|
@tjgq Please take another look when you get the chance! Once this gets merged, we should cut a new release. |
|
Even though I already approved this PR a couple of days ago, just wanted to say that this is good to land w.r.t. RepMaxCDC. The reference implementation available at https://github.com/buildbarn/go-cdc works reliably now, and performance is on par with FastCDC. |
For CDC (Content-Defined Chunking), having the client and server agree upon a chunking algorithm unlocks a new level of possible improvements, where we can have distributed, deterministic, and reproducible chunking.
This PR adds a chunking algorithm negotiation to GetCapabilities. Most notably, it:
If a chunking algorithm is announced by the server, the client will expect Split responses to be chunked using one of these algorithms. The client should also try to chunk using the same algorithm so that the server will accept new chunking information via Splice, and so many clients can de-duplicate storage with shared chunks.
Why FastCDC 2020?
FastCDC 2020 is fast (~5GB/s on AMD Ryzen 9950), and is backed by a clear spec described in the paper: IEEE paper pdf. It is popular, with https://github.com/nlfiedler/fastcdc-rs mirroring the paper's implementation.
This algorithm is:
Why RepMaxCDC?
RepMaxCDC is another fast chunking algorithm that has many benefits, and is used in https://github.com/buildbarn/bonanza. This algorithm is also included to provide users with multiple options for configuration of the chunking.
FasCDC Config Defaults
Why the FastCDC 2020 threshold = Max Blob Size
We only chunk blobs > the threshold size. Having a threshold be >= the max chunk size is an important design decision for the first iterations of implementation here. Mainly, there's a cyclic re-chunking possibility that could happen if it's not upheld. For example, if the chunking threshold is only 1MB, and a chunker (avg size 512k and max 2MB) produces a 1.5MB chunk, the server is going to try to re-chunk this again. It also means the CAS doesn't know if a chunk is a full blob, or is a chunk, or is a file that was chunked into a single chunk. For FastCDC2020, the threshold will be set to the max chunk size, or 4*avg_size. This also has the benefit of guaranteeing that we'll always get >1 chunk, which simplifies implementations.
Why the default chunk size?
512kB hits the sweet spot with Bazel artifacts for medium-side repos like https://github.com/buildbuddy-io/buildbuddy with some overlapping GoLink objects.
Using my
--disk_cachefrom the past 1-2 months of development, I ran some benchmarking using different sizes, and got the following results:De-duplication is strong at 512kiB (at 35%), and this only affects 4% of files.
We could drop to 64k, but we'd only get ~10% more de-duplication savings and still need to chunk 3x the number of files.
The sizes selected for RepMaxCDC are selected to similarly produce an average chunk size around 512kiB
Thank you to @sluongng for much of the initial version of this PR