Conversation
456e902 to
b25c8e4
Compare
b25c8e4 to
27b0d6c
Compare
27b0d6c to
39bbe03
Compare
Introduce ChunkingFunction which enum is a set of known chunking algorithms that the server can recommend to the client. Provide FastCDC_2020 as the first explicit chunking algorithm. The server advertise these through a new chunking_configuration field in CacheCapabilities message. There, the server may set the chunking functions that it supports as well as the relevant configuration parameters for that chunking algorithm.
39bbe03 to
19f1152
Compare
| // The chunking function that the client prefers to use. | ||
| // | ||
| // The server MAY use a different chunking function. The client MUST check | ||
| // the chunking function used in the response. | ||
| ChunkingFunction.Value chunking_function = 4; |
There was a problem hiding this comment.
- Is this field intended to be mandatory or optional (with the latter giving the server complete leeway in choosing a function)? If optional, can we document it?
- When the field is present, should it be required to match one of the functions declared in the server capabilities?
There was a problem hiding this comment.
Ah, it defaults to UNKNOWN if unset, because this is proto3. (But perhaps it's still worth spelling it out - up to you.)
| // The chunking function that the client used to split the blob. | ||
| ChunkingFunction.Value chunking_function = 5; |
There was a problem hiding this comment.
Why is this necessary? Isn't the result of a splice completely independent from the function originally used to do the splitting?
It also imposes a requirement that the chunks must have necessarily originated from a split operation, which seems counter to the spirit of the original proposal (that split and splice are independent optimizations).
There was a problem hiding this comment.
A lot of the requirements for this come from the desire for the server to be able to store and retrieve information on how to convert between a blob and chunks, and act as a cache for that information.
I think this is important if a client is doing the chunking, and is telling the server to store information on how a certain blob can be reconstructed from chunks. Then, if someone calls Split on the same blob, the server can then return this information to the client, without needing to actually split the blob, given that the chunking operation is the same. Caches can also share these entries and do chunking async, etc, if the server and client are aligned on a reproducible way to create these chunks.
If the client calls Splice twice, with two different sets of digests, and two different chunking functions, the server can also make decisions on how it handles this. A server could choose to be very explicit on the types of chunking it supports, for example, if it needs to be able to reproduce them somewhere else, and so that it chooses to store chunks that are more likely to de-duplicate storage of content.
| // If any of the advertised parameters are not within the expected range, | ||
| // the client SHOULD ignore FastCDC chunking function support. | ||
| message FastCDCParams { | ||
| // The normalization level for the FastCDC chunking algorithm. |
There was a problem hiding this comment.
We should consider having just avg_chunk_size_bytes, and all the other parameters are set to defulats and documented.
Based on #282
Introduce ChunkingFunction which enum is a set of known chunking
algorithms that the server can recommend to the client.
Provide FastCDC_2020 as the first explicit chunking algorithm.
The server advertises these through a new chunking_configuration field in
CacheCapabilities message. There, the server may set the chunking
functions that it supports as well as the relevant configuration
parameters for that chunking algorithm.
I recommend reading https://joshleeb.com/posts/fastcdc.html to understand more about the available FastCDC configuration parameters.