Skip to content

Comments

Add GITSHA1 digest function.#352

Merged
tjgq merged 2 commits intobazelbuild:mainfrom
jgalmes2:gitsha1
Jan 20, 2026
Merged

Add GITSHA1 digest function.#352
tjgq merged 2 commits intobazelbuild:mainfrom
jgalmes2:gitsha1

Conversation

@jgalmes2
Copy link
Contributor

GITSHA1 is the digest function used by git.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.

GITSHA1 is the digest function used by git.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.
@sluongng
Copy link
Collaborator

@jgalmes2 could you provide a but more context here? What is your use case?

this was previously discussed in the working group and the decision was to hold until there are more interest

@jgalmes2
Copy link
Contributor Author

We're trying to integrate Bazel with a remote execution engine whose native hashing algorithm is GITSHA1.

@sluongng
Copy link
Collaborator

Did you have a conversation with Bazel team about adding such a digest function?
It would be a lot easier to know that they are open to such a change so that we have a pair of client/server implementations willing to adopt the proposal.

Im also curious, how do you do the collision detection? afaik SHA1 in Git is modified with collision in mind.

Btw, if you are just looking for git-interop, worth noting that Git upstream dev branches have a few patch sets that include sha256 inter-op with sha1. There has been a bit of progress there in recent months. I wonder if we can do GITSHA256 instead?

@jgalmes2
Copy link
Contributor Author

jgalmes2 commented Nov 24, 2025

Oh yeah Lukács Berki and Chi Wang are fully aware of this change. @lberki @coeuvre
The GITSHA1 implementation is just the SHA1 of the bytes with the "blob <size>\0" header prefix prepended to the bytes.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects

@sluongng
Copy link
Collaborator

The GITSHA1 implementation is just the SHA1 of the bytes with the "blob \0" header prefix prepended to the bytes.
Details here: https://git-scm.c

This used to be the standard until 2017.
See https://github.com/git/git/blob/6ab38b7e9cc7adafc304f3204616a4debd49c6e9/Documentation/technical/hash-function-transition.adoc#background for more info.

The problem with probable collision attacks is that most CAS implementations today tend to assume a strong, collision-free cryptographic hash in use, thus skipping on collision check. At the very least, GITSHA1 should be defined as https://github.com/git/git/blob/master/sha1dc/sha1.h, which comes with additional checks against collision attacks.

Im also curious: Do you use git tree objects or DirectoryNode proto to present file trees here?

@lberki
Copy link

lberki commented Nov 26, 2025

My $.02: I do realize SHA-1 is quite long in the tooth these days. However, my understanding is that migrating a large code base is non-trivial since it changes every commithash and it looks like GitHub doesn't even support SHA-256?

In addition to the above adoption concerns, SHA-1 will always be an opt-in on the Bazel side so the default is a secure hash. Does this allay your concerns?

@jgalmes2
Copy link
Contributor Author

jgalmes2 commented Dec 9, 2025

Note the intention is just to use GITSHA1 to obtain hashes from extended attributes from CASFS, there's no intention of using git tree objects or anything like that

@fmeum
Copy link
Collaborator

fmeum commented Dec 9, 2025

Just curious, are you referring to https://github.com/google/casfs (archived) or something else?

@sluongng
Copy link
Collaborator

sluongng commented Dec 9, 2025

@jgalmes2 At minimum, I think this PR needs additional documentation. At current state, folks don't know how to implement this new digest function on the servers/clients. Yes there is instruction on how to hash a blob, but there isn't clear instructions on what to do with a tree/directory object: do you use the git tree, or the Directory proto message? If the proto message is used, then should we hash all the proto messages with the blob prefix? or some with the tree prefix?

See https://github.com/bazelbuild/remote-apis/pull/257/files for a similar attempt.

I would also encourage you (or whoever is implementing this on the Bazel side) to join the monthly working group meeting to clarify this further.

I would love it if @aehlig and @asartori86 could help review this PR, since they have already implemented a version of GITSHA1 in their own build tools. We want to make sure that different interested parties can agree on the same spec here.

@aehlig
Copy link

aehlig commented Dec 9, 2025

I would love it if @aehlig and @asartori86 could help review this PR, since they have already implemented a version of GITSHA1 in their own build tools. We want to make sure that different interested parties can agree on the same spec here.

As @sluongng already mentioned, the documentation could certainly be improved and the handling of trees be clarified.

In the justbuild project, we found that the best interaction with git (which is used by many projects as version-control system) was to also include trees as first-class objects. In this way, not only the hashes of individual files could be read off directly from git without looking at the files, but also whole directories. The latter was especially useful (IO, traffic, and round trips) when whole trees were already known to the remote-execution, e.g., include directories, or project trees for foreign rules. Based on these considerations, the custom protocol extension that #257 refers to was designed and the justbuild project made good experience with it.

I understand that the goal of this PR is different; but as it is, at the pure code (not comment) level identical to earlier PR, the scope, use case, and the way trees should be treated should be clarified.

@tjgq
Copy link
Collaborator

tjgq commented Dec 9, 2025

As @sluongng already mentioned, the documentation could certainly be improved and the handling of trees be clarified.

How about we replace the comment added in this PR with something like:

// The SHA-1 variant used by Git blob objects. A GITSHA1 digest is obtained by prepending the 6-byte string "blob \0" to the message to be digested and hashing the result with the SHA-1 algorithm.

If we so insist, we could also add a followup sentence like "Use of this digest function has no effect on the representation of directory trees", but I feel like being precise about this being the hash function used by Git blobs (and not other Git objects), as well as spelling out the algorithm instead of linking to the Git documentation, does 90% of the work to avoid the implication that the directory representation is affected.

@lberki
Copy link

lberki commented Dec 10, 2025

Channeling @jgalmes2 here, the intent is to be able to upload files individually, but not directories.

The use case is that the have a file system backed a git repository. So they can apparently do directory listings cheaply and they also have the GITSHA1 hash (but not any other hash) of every file available without having to checksum it locally. Which makes GITSHA1 the natural kind of hash to use for remote execution, too.

I suppose using tree hashes would be marginally better, but that would only come into play if one would need to upload a full directory tree to remote execution (i.e. glob(["dir/**"]) without and BUILD files in this tree), which I imagine is not something that happens that frequently. It would also require a way to transmit the tree hash from the file system to Bazel so it's a lot of work for not a lot of benefit.

@sluongng
Copy link
Collaborator

Worth to note here that the digest function is not only applied to blobs in a build, it's also applied to all protobuf messages, such as Action, Command and Directory.

Typically the client would send an ExecuteRequest to the server with a digest function specified inside the request message. That digest function will then be used for all related protobuf messages and blobs inside the Action merkle tree.

The Directory message is used in this merkle tree to describe the input root tree of each action, as well as potential output directories.

If the GITSHA1 digest function is only meant for blob and not proto messages (including Directory), then the documents should be specific about what should be used instead. If we were to apply this to all proto messages, then we should be explicit about tree objects should be presented as Directory proto messages and not as git tree. And that they should be hashed with a blob prefix.

@jgalmes2
Copy link
Contributor Author

jgalmes2 commented Dec 10, 2025 via email

@tjgq
Copy link
Collaborator

tjgq commented Dec 11, 2025

If the GITSHA1 digest function is only meant for blob and not proto messages (including Directory), then the documents should be specific about what should be used instead. If we were to apply this to all proto messages, then we should be explicit about tree objects should be presented as Directory proto messages and not as git tree. And that they should be hashed with a blob prefix.

I think we already agree that we don't want to change the representation of tree objects.

If you regard the prepending of blob \0 as part of the definition of the GITSHA1 hash function (which in my opinion is the correct way to think about this), then not imposing the same requirement on proto messages would amount to saying that protos stored in the CAS must use a different hash function than non-proto blobs.

I'm not even sure the protocol could work under those circumstances: doesn't the ExecuteRequest.digest_function field (or the UpdateActionResult.digest_function field for the uploading case) effectively determine not only the hash function used by the Action proto, but also by the Command proto and the Merkle tree, since there's no separate digest_function field in the Action or Command messages?

I think once you define a new hash function, the notion that the hashing algorithm must apply identically to every object stored in the CAS under that hash function is the only reasonable reading of the spec.

@sluongng if you still believe further clarification is warranted, may I ask you to propose an alternative to the wording I suggested in my other comment above?

// See https://github.com/BLAKE3-team/BLAKE3.
BLAKE3 = 9;

// The SHA1 variant used by git.
Copy link
Collaborator

@EdSchouten EdSchouten Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a more thorough description. Maybe something like this?

Identical to SHA1, except that "blob ${sizeBytes}\0" is prepended to the blob's contents before hashing, where ${sizeBytes} corresponds to the decimal size of the original blob. This allows hashes of files to be converted from and to the ones used by the Git version control system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. Now I'm confused. Do we literally prepend "blob \0", or should we include the size as well? @tjgq mentions "blob \0", but the Git documentation tells me otherwise. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need the size

master ~/work/buildbuddy-io/buildbuddy> { printf "blob %s\0" "$(stat -c%s README.md)"; cat README.md; } | sha1sum
46ec5d933a27bcac82000d1c386f071219273b45  -


master ~/work/buildbuddy-io/buildbuddy> cat README.md| git hash-object --stdin
46ec5d933a27bcac82000d1c386f071219273b45

@sluongng
Copy link
Collaborator

I agree with the description Ed proposed above.

I would still like to push for at least 1 sentence noting that GITSHA1 (and SHA1) is not recommended for multi-tenant setup due to collision attacks, and implementations should look into https://github.com/git/git/blob/e85ae279b0d58edc2f4c3fd5ac391b51e1223985/Documentation/technical/hash-function-transition.adoc?plain=1#L1 to either try the sha256 transition, or at least use Git's SHA1DC for collision detection. Disclosing these vulnerabilities would help newer folks avoid unnecessary footguns. This can be done in a separate PR though.

I realized that for directory, unless you use the entire git checkout (working copy) as input tree for every action, the tree objects for individual action would rarely match the tree objects inside the git repo (because we often just use some sub dirs, or even just some files inside a dir). So I don't see much benefit in include trees as first-class objects as @aehlig described. Perhaps I am missing something obvious?

I hope that whatever we are proposing here could work for multiple parties (i.e. JustBuild and Bazel) so that we can have interoperability between different clients and servers.

GITSHA1 is the digest function used by git.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.
@jgalmes2 jgalmes2 requested a review from werkt as a code owner December 11, 2025 15:52
@jgalmes2
Copy link
Contributor Author

I updated the comment to match Ed's description above.

@aehlig
Copy link

aehlig commented Dec 11, 2025

I realized that for directory, unless you use the entire git checkout (working copy) as input tree for every action, the tree objects for individual action would rarely match the tree objects inside the git repo (because we often just use some sub dirs, or even just some files inside a dir). So I don't see much benefit in include trees as first-class objects as @aehlig described. Perhaps I am missing something obvious?

Having tree as first class objects (including the CAS invariant that the presence of a tree promises the presence of its parts) also reduces the effort of ensuring trees are present (only ask for one hash instead of a long list of of files), e.g.,

While most of those concepts are not implemented in bazel, they are the reason why Justbuild prefers an extension of the protocol with first-class trees.

I hope that whatever we are proposing here could work for multiple parties (i.e. JustBuild and Bazel) so that we can have interoperability between different clients and servers.

Justbuild can already interoperate with standard remote exeuction (basically by rehashing trees to directories and verifying all blobs of a tree input are present in CAS) and many of the mentioned savings require proper trees. So the benefit for Justbuild of this addition to the protocol will be minimal (if there is the capacity to implement it).

cc @spc90

That said, I do not oppose this addition, if properly documented.


// Identical to SHA1, except that "blob ${sizeBytes}\0" is prepended to
// the blob's contents before hashing, where ${sizeBytes} corresponds to
// the decimal size of the original blob. This allows hashes of files to
Copy link
Contributor

@mostynb mostynb Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "decimal size" makes this sound like a non-integer value. Would something like "where ${sizeBytes} is the number of bytes in the original blob" be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly used the term "decimal", because it shouldn't be provided in little/big endian binary form, or as hexadecimal characters, or ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "decimal integer size" ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have multiple approvals and the sole remaining disagreement is on how the comment is worded (which can always be amended later), I'll merge this now.

@tjgq tjgq merged commit b02e15a into bazelbuild:main Jan 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants