Conversation
GITSHA1 is the digest function used by git. Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.
|
@jgalmes2 could you provide a but more context here? What is your use case? this was previously discussed in the working group and the decision was to hold until there are more interest |
|
We're trying to integrate Bazel with a remote execution engine whose native hashing algorithm is GITSHA1. |
|
Did you have a conversation with Bazel team about adding such a digest function? Im also curious, how do you do the collision detection? afaik SHA1 in Git is modified with collision in mind. Btw, if you are just looking for git-interop, worth noting that Git upstream dev branches have a few patch sets that include sha256 inter-op with sha1. There has been a bit of progress there in recent months. I wonder if we can do GITSHA256 instead? |
|
Oh yeah Lukács Berki and Chi Wang are fully aware of this change. @lberki @coeuvre |
This used to be the standard until 2017. The problem with probable collision attacks is that most CAS implementations today tend to assume a strong, collision-free cryptographic hash in use, thus skipping on collision check. At the very least, GITSHA1 should be defined as https://github.com/git/git/blob/master/sha1dc/sha1.h, which comes with additional checks against collision attacks. Im also curious: Do you use git tree objects or DirectoryNode proto to present file trees here? |
|
My $.02: I do realize SHA-1 is quite long in the tooth these days. However, my understanding is that migrating a large code base is non-trivial since it changes every commithash and it looks like GitHub doesn't even support SHA-256? In addition to the above adoption concerns, SHA-1 will always be an opt-in on the Bazel side so the default is a secure hash. Does this allay your concerns? |
|
Note the intention is just to use GITSHA1 to obtain hashes from extended attributes from CASFS, there's no intention of using git tree objects or anything like that |
|
Just curious, are you referring to https://github.com/google/casfs (archived) or something else? |
|
@jgalmes2 At minimum, I think this PR needs additional documentation. At current state, folks don't know how to implement this new digest function on the servers/clients. Yes there is instruction on how to hash a blob, but there isn't clear instructions on what to do with a tree/directory object: do you use the git tree, or the Directory proto message? If the proto message is used, then should we hash all the proto messages with the blob prefix? or some with the tree prefix? See https://github.com/bazelbuild/remote-apis/pull/257/files for a similar attempt. I would also encourage you (or whoever is implementing this on the Bazel side) to join the monthly working group meeting to clarify this further. I would love it if @aehlig and @asartori86 could help review this PR, since they have already implemented a version of GITSHA1 in their own build tools. We want to make sure that different interested parties can agree on the same spec here. |
As @sluongng already mentioned, the documentation could certainly be improved and the handling of trees be clarified. In the justbuild project, we found that the best interaction with I understand that the goal of this PR is different; but as it is, at the pure code (not comment) level identical to earlier PR, the scope, use case, and the way trees should be treated should be clarified. |
How about we replace the comment added in this PR with something like: If we so insist, we could also add a followup sentence like "Use of this digest function has no effect on the representation of directory trees", but I feel like being precise about this being the hash function used by Git blobs (and not other Git objects), as well as spelling out the algorithm instead of linking to the Git documentation, does 90% of the work to avoid the implication that the directory representation is affected. |
|
Channeling @jgalmes2 here, the intent is to be able to upload files individually, but not directories. The use case is that the have a file system backed a git repository. So they can apparently do directory listings cheaply and they also have the GITSHA1 hash (but not any other hash) of every file available without having to checksum it locally. Which makes GITSHA1 the natural kind of hash to use for remote execution, too. I suppose using tree hashes would be marginally better, but that would only come into play if one would need to upload a full directory tree to remote execution (i.e. |
|
Worth to note here that the digest function is not only applied to blobs in a build, it's also applied to all protobuf messages, such as Action, Command and Directory. Typically the client would send an ExecuteRequest to the server with a digest function specified inside the request message. That digest function will then be used for all related protobuf messages and blobs inside the Action merkle tree. The Directory message is used in this merkle tree to describe the input root tree of each action, as well as potential output directories. If the GITSHA1 digest function is only meant for blob and not proto messages (including Directory), then the documents should be specific about what should be used instead. If we were to apply this to all proto messages, then we should be explicit about tree objects should be presented as Directory proto messages and not as git tree. And that they should be hashed with a blob prefix. |
|
+Haamed Gheibi ***@***.***>
…On Wed, Dec 10, 2025 at 1:03 AM Son Luong Ngoc ***@***.***> wrote:
*sluongng* left a comment (bazelbuild/remote-apis#352)
<#352 (comment)>
Worth to note here that the digest function is not only applied to blobs
in a build, it's also applied to all protobuf messages, such as Action,
Command and Directory.
Typically the client would send an ExecuteRequest to the server with a
digest function specified inside the request message. That digest function
will then be used for all related protobuf messages and blobs inside the
Action merkle tree.
The Directory message is used in this merkle tree to describe the input
root tree of each action, as well as potential output directories.
If the GITSHA1 digest function is only meant for blob and not proto
messages (including Directory), then the documents should be specific about
what should be used instead. If we were to apply this to all proto
messages, then we should be explicit about tree objects should be presented
as Directory proto messages and not as git tree. And that they should be
hashed with a blob prefix.
—
Reply to this email directly, view it on GitHub
<#352 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BSAW26MAE6N4VBJLLA4C4ED4A7OUPAVCNFSM6AAAAACM2P6BVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMZWGA2TINJWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think we already agree that we don't want to change the representation of tree objects. If you regard the prepending of I'm not even sure the protocol could work under those circumstances: doesn't the I think once you define a new hash function, the notion that the hashing algorithm must apply identically to every object stored in the CAS under that hash function is the only reasonable reading of the spec. @sluongng if you still believe further clarification is warranted, may I ask you to propose an alternative to the wording I suggested in my other comment above? |
| // See https://github.com/BLAKE3-team/BLAKE3. | ||
| BLAKE3 = 9; | ||
|
|
||
| // The SHA1 variant used by git. |
There was a problem hiding this comment.
This could use a more thorough description. Maybe something like this?
Identical to SHA1, except that "blob ${sizeBytes}\0" is prepended to the blob's contents before hashing, where ${sizeBytes} corresponds to the decimal size of the original blob. This allows hashes of files to be converted from and to the ones used by the Git version control system.
There was a problem hiding this comment.
Wait. Now I'm confused. Do we literally prepend "blob \0", or should we include the size as well? @tjgq mentions "blob \0", but the Git documentation tells me otherwise. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
There was a problem hiding this comment.
You do need the size
master ~/work/buildbuddy-io/buildbuddy> { printf "blob %s\0" "$(stat -c%s README.md)"; cat README.md; } | sha1sum
46ec5d933a27bcac82000d1c386f071219273b45 -
master ~/work/buildbuddy-io/buildbuddy> cat README.md| git hash-object --stdin
46ec5d933a27bcac82000d1c386f071219273b45|
I agree with the description Ed proposed above. I would still like to push for at least 1 sentence noting that GITSHA1 (and SHA1) is not recommended for multi-tenant setup due to collision attacks, and implementations should look into https://github.com/git/git/blob/e85ae279b0d58edc2f4c3fd5ac391b51e1223985/Documentation/technical/hash-function-transition.adoc?plain=1#L1 to either try the sha256 transition, or at least use Git's SHA1DC for collision detection. Disclosing these vulnerabilities would help newer folks avoid unnecessary footguns. This can be done in a separate PR though. I realized that for directory, unless you use the entire git checkout (working copy) as input tree for every action, the tree objects for individual action would rarely match the tree objects inside the git repo (because we often just use some sub dirs, or even just some files inside a dir). So I don't see much benefit in I hope that whatever we are proposing here could work for multiple parties (i.e. JustBuild and Bazel) so that we can have interoperability between different clients and servers. |
GITSHA1 is the digest function used by git. Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.
|
I updated the comment to match Ed's description above. |
Having tree as first class objects (including the CAS invariant that the presence of a tree promises the presence of its parts) also reduces the effort of ensuring trees are present (only ask for one hash instead of a long list of of files), e.g.,
While most of those concepts are not implemented in bazel, they are the reason why Justbuild prefers an extension of the protocol with first-class trees.
Justbuild can already interoperate with standard remote exeuction (basically by rehashing trees to directories and verifying all blobs of a tree input are present in CAS) and many of the mentioned savings require proper trees. So the benefit for Justbuild of this addition to the protocol will be minimal (if there is the capacity to implement it). cc @spc90 That said, I do not oppose this addition, if properly documented. |
|
|
||
| // Identical to SHA1, except that "blob ${sizeBytes}\0" is prepended to | ||
| // the blob's contents before hashing, where ${sizeBytes} corresponds to | ||
| // the decimal size of the original blob. This allows hashes of files to |
There was a problem hiding this comment.
IMO "decimal size" makes this sound like a non-integer value. Would something like "where ${sizeBytes} is the number of bytes in the original blob" be better?
There was a problem hiding this comment.
I explicitly used the term "decimal", because it shouldn't be provided in little/big endian binary form, or as hexadecimal characters, or ...
There was a problem hiding this comment.
How about "decimal integer size" ?
There was a problem hiding this comment.
Since we have multiple approvals and the sole remaining disagreement is on how the comment is worded (which can always be amended later), I'll merge this now.
GITSHA1 is the digest function used by git.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.