Add GITSHA1 digest function. by jgalmes2 · Pull Request #352 · bazelbuild/remote-apis

jgalmes2 · 2025-11-21T16:51:18Z

GITSHA1 is the digest function used by git.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.

GITSHA1 is the digest function used by git. Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.

sluongng · 2025-11-21T16:56:10Z

@jgalmes2 could you provide a but more context here? What is your use case?

this was previously discussed in the working group and the decision was to hold until there are more interest

jgalmes2 · 2025-11-24T15:43:51Z

We're trying to integrate Bazel with a remote execution engine whose native hashing algorithm is GITSHA1.

sluongng · 2025-11-24T22:22:36Z

Did you have a conversation with Bazel team about adding such a digest function?
It would be a lot easier to know that they are open to such a change so that we have a pair of client/server implementations willing to adopt the proposal.

Im also curious, how do you do the collision detection? afaik SHA1 in Git is modified with collision in mind.

Btw, if you are just looking for git-interop, worth noting that Git upstream dev branches have a few patch sets that include sha256 inter-op with sha1. There has been a bit of progress there in recent months. I wonder if we can do GITSHA256 instead?

jgalmes2 · 2025-11-24T22:48:03Z

Oh yeah Lukács Berki and Chi Wang are fully aware of this change. @lberki @coeuvre
The GITSHA1 implementation is just the SHA1 of the bytes with the "blob <size>\0" header prefix prepended to the bytes.
Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects

sluongng · 2025-11-25T08:42:38Z

The GITSHA1 implementation is just the SHA1 of the bytes with the "blob \0" header prefix prepended to the bytes.
Details here: https://git-scm.c

This used to be the standard until 2017.
See https://github.com/git/git/blob/6ab38b7e9cc7adafc304f3204616a4debd49c6e9/Documentation/technical/hash-function-transition.adoc#background for more info.

The problem with probable collision attacks is that most CAS implementations today tend to assume a strong, collision-free cryptographic hash in use, thus skipping on collision check. At the very least, GITSHA1 should be defined as https://github.com/git/git/blob/master/sha1dc/sha1.h, which comes with additional checks against collision attacks.

Im also curious: Do you use git tree objects or DirectoryNode proto to present file trees here?

lberki · 2025-11-26T06:30:05Z

My $.02: I do realize SHA-1 is quite long in the tooth these days. However, my understanding is that migrating a large code base is non-trivial since it changes every commithash and it looks like GitHub doesn't even support SHA-256?

In addition to the above adoption concerns, SHA-1 will always be an opt-in on the Bazel side so the default is a secure hash. Does this allay your concerns?

jgalmes2 · 2025-12-09T17:29:17Z

Note the intention is just to use GITSHA1 to obtain hashes from extended attributes from CASFS, there's no intention of using git tree objects or anything like that

fmeum · 2025-12-09T17:33:43Z

Just curious, are you referring to https://github.com/google/casfs (archived) or something else?

sluongng · 2025-12-09T18:22:49Z

@jgalmes2 At minimum, I think this PR needs additional documentation. At current state, folks don't know how to implement this new digest function on the servers/clients. Yes there is instruction on how to hash a blob, but there isn't clear instructions on what to do with a tree/directory object: do you use the git tree, or the Directory proto message? If the proto message is used, then should we hash all the proto messages with the blob prefix? or some with the tree prefix?

See https://github.com/bazelbuild/remote-apis/pull/257/files for a similar attempt.

I would also encourage you (or whoever is implementing this on the Bazel side) to join the monthly working group meeting to clarify this further.

I would love it if @aehlig and @asartori86 could help review this PR, since they have already implemented a version of GITSHA1 in their own build tools. We want to make sure that different interested parties can agree on the same spec here.

aehlig · 2025-12-09T19:36:51Z

I would love it if @aehlig and @asartori86 could help review this PR, since they have already implemented a version of GITSHA1 in their own build tools. We want to make sure that different interested parties can agree on the same spec here.

As @sluongng already mentioned, the documentation could certainly be improved and the handling of trees be clarified.

In the justbuild project, we found that the best interaction with git (which is used by many projects as version-control system) was to also include trees as first-class objects. In this way, not only the hashes of individual files could be read off directly from git without looking at the files, but also whole directories. The latter was especially useful (IO, traffic, and round trips) when whole trees were already known to the remote-execution, e.g., include directories, or project trees for foreign rules. Based on these considerations, the custom protocol extension that #257 refers to was designed and the justbuild project made good experience with it.

I understand that the goal of this PR is different; but as it is, at the pure code (not comment) level identical to earlier PR, the scope, use case, and the way trees should be treated should be clarified.

tjgq · 2025-12-09T19:55:17Z

As @sluongng already mentioned, the documentation could certainly be improved and the handling of trees be clarified.

How about we replace the comment added in this PR with something like:

// The SHA-1 variant used by Git blob objects. A GITSHA1 digest is obtained by prepending the 6-byte string "blob \0" to the message to be digested and hashing the result with the SHA-1 algorithm.

If we so insist, we could also add a followup sentence like "Use of this digest function has no effect on the representation of directory trees", but I feel like being precise about this being the hash function used by Git blobs (and not other Git objects), as well as spelling out the algorithm instead of linking to the Git documentation, does 90% of the work to avoid the implication that the directory representation is affected.

lberki · 2025-12-10T08:17:21Z

Channeling @jgalmes2 here, the intent is to be able to upload files individually, but not directories.

The use case is that the have a file system backed a git repository. So they can apparently do directory listings cheaply and they also have the GITSHA1 hash (but not any other hash) of every file available without having to checksum it locally. Which makes GITSHA1 the natural kind of hash to use for remote execution, too.

I suppose using tree hashes would be marginally better, but that would only come into play if one would need to upload a full directory tree to remote execution (i.e. glob(["dir/**"]) without and BUILD files in this tree), which I imagine is not something that happens that frequently. It would also require a way to transmit the tree hash from the file system to Bazel so it's a lot of work for not a lot of benefit.

sluongng · 2025-12-10T09:02:39Z

Worth to note here that the digest function is not only applied to blobs in a build, it's also applied to all protobuf messages, such as Action, Command and Directory.

Typically the client would send an ExecuteRequest to the server with a digest function specified inside the request message. That digest function will then be used for all related protobuf messages and blobs inside the Action merkle tree.

The Directory message is used in this merkle tree to describe the input root tree of each action, as well as potential output directories.

If the GITSHA1 digest function is only meant for blob and not proto messages (including Directory), then the documents should be specific about what should be used instead. If we were to apply this to all proto messages, then we should be explicit about tree objects should be presented as Directory proto messages and not as git tree. And that they should be hashed with a blob prefix.

jgalmes2 · 2025-12-10T20:33:15Z

+Haamed Gheibi ***@***.***>

…

On Wed, Dec 10, 2025 at 1:03 AM Son Luong Ngoc ***@***.***> wrote: *sluongng* left a comment (bazelbuild/remote-apis#352) <#352 (comment)> Worth to note here that the digest function is not only applied to blobs in a build, it's also applied to all protobuf messages, such as Action, Command and Directory. Typically the client would send an ExecuteRequest to the server with a digest function specified inside the request message. That digest function will then be used for all related protobuf messages and blobs inside the Action merkle tree. The Directory message is used in this merkle tree to describe the input root tree of each action, as well as potential output directories. If the GITSHA1 digest function is only meant for blob and not proto messages (including Directory), then the documents should be specific about what should be used instead. If we were to apply this to all proto messages, then we should be explicit about tree objects should be presented as Directory proto messages and not as git tree. And that they should be hashed with a blob prefix. — Reply to this email directly, view it on GitHub <#352 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BSAW26MAE6N4VBJLLA4C4ED4A7OUPAVCNFSM6AAAAACM2P6BVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMZWGA2TINJWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tjgq · 2025-12-11T10:40:21Z

If the GITSHA1 digest function is only meant for blob and not proto messages (including Directory), then the documents should be specific about what should be used instead. If we were to apply this to all proto messages, then we should be explicit about tree objects should be presented as Directory proto messages and not as git tree. And that they should be hashed with a blob prefix.

I think we already agree that we don't want to change the representation of tree objects.

If you regard the prepending of blob \0 as part of the definition of the GITSHA1 hash function (which in my opinion is the correct way to think about this), then not imposing the same requirement on proto messages would amount to saying that protos stored in the CAS must use a different hash function than non-proto blobs.

I'm not even sure the protocol could work under those circumstances: doesn't the ExecuteRequest.digest_function field (or the UpdateActionResult.digest_function field for the uploading case) effectively determine not only the hash function used by the Action proto, but also by the Command proto and the Merkle tree, since there's no separate digest_function field in the Action or Command messages?

I think once you define a new hash function, the notion that the hashing algorithm must apply identically to every object stored in the CAS under that hash function is the only reasonable reading of the spec.

@sluongng if you still believe further clarification is warranted, may I ask you to propose an alternative to the wording I suggested in my other comment above?

EdSchouten · 2025-12-11T12:22:46Z

build/bazel/remote/execution/v2/remote_execution.proto

    // See https://github.com/BLAKE3-team/BLAKE3.
    BLAKE3 = 9;
+
+    // The SHA1 variant used by git.


This could use a more thorough description. Maybe something like this?

Identical to SHA1, except that "blob ${sizeBytes}\0" is prepended to the blob's contents before hashing, where ${sizeBytes} corresponds to the decimal size of the original blob. This allows hashes of files to be converted from and to the ones used by the Git version control system.

Wait. Now I'm confused. Do we literally prepend "blob \0", or should we include the size as well? @tjgq mentions "blob \0", but the Git documentation tells me otherwise. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

You do need the size

master ~/work/buildbuddy-io/buildbuddy> { printf "blob %s\0" "$(stat -c%s README.md)"; cat README.md; } | sha1sum 46ec5d933a27bcac82000d1c386f071219273b45 - master ~/work/buildbuddy-io/buildbuddy> cat README.md| git hash-object --stdin 46ec5d933a27bcac82000d1c386f071219273b45

sluongng · 2025-12-11T13:21:59Z

I agree with the description Ed proposed above.

I would still like to push for at least 1 sentence noting that GITSHA1 (and SHA1) is not recommended for multi-tenant setup due to collision attacks, and implementations should look into https://github.com/git/git/blob/e85ae279b0d58edc2f4c3fd5ac391b51e1223985/Documentation/technical/hash-function-transition.adoc?plain=1#L1 to either try the sha256 transition, or at least use Git's SHA1DC for collision detection. Disclosing these vulnerabilities would help newer folks avoid unnecessary footguns. This can be done in a separate PR though.

I realized that for directory, unless you use the entire git checkout (working copy) as input tree for every action, the tree objects for individual action would rarely match the tree objects inside the git repo (because we often just use some sub dirs, or even just some files inside a dir). So I don't see much benefit in include trees as first-class objects as @aehlig described. Perhaps I am missing something obvious?

I hope that whatever we are proposing here could work for multiple parties (i.e. JustBuild and Bazel) so that we can have interoperability between different clients and servers.

GITSHA1 is the digest function used by git. Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.

jgalmes2 · 2025-12-11T15:55:46Z

I updated the comment to match Ed's description above.

aehlig · 2025-12-11T16:11:05Z

I realized that for directory, unless you use the entire git checkout (working copy) as input tree for every action, the tree objects for individual action would rarely match the tree objects inside the git repo (because we often just use some sub dirs, or even just some files inside a dir). So I don't see much benefit in include trees as first-class objects as @aehlig described. Perhaps I am missing something obvious?

Having tree as first class objects (including the CAS invariant that the presence of a tree promises the presence of its parts) also reduces the effort of ensuring trees are present (only ask for one hash instead of a long list of of files), e.g.,

header include directories for actions (likely a build against that set of libraries happened before),
synchronizing local trees, e.g., tool-chain configurations needed for delegated builds against repositories not present locally, or
handling tree outputs of actions, e.g., for computed repositories

While most of those concepts are not implemented in bazel, they are the reason why Justbuild prefers an extension of the protocol with first-class trees.

I hope that whatever we are proposing here could work for multiple parties (i.e. JustBuild and Bazel) so that we can have interoperability between different clients and servers.

Justbuild can already interoperate with standard remote exeuction (basically by rehashing trees to directories and verifying all blobs of a tree input are present in CAS) and many of the mentioned savings require proper trees. So the benefit for Justbuild of this addition to the protocol will be minimal (if there is the capacity to implement it).

cc @spc90

That said, I do not oppose this addition, if properly documented.

mostynb · 2025-12-13T18:15:01Z

build/bazel/remote/execution/v2/remote_execution.proto

+
+    // Identical to SHA1, except that "blob ${sizeBytes}\0" is prepended to
+    // the blob's contents before hashing, where ${sizeBytes} corresponds to
+    // the decimal size of the original blob. This allows hashes of files to


IMO "decimal size" makes this sound like a non-integer value. Would something like "where ${sizeBytes} is the number of bytes in the original blob" be better?

I explicitly used the term "decimal", because it shouldn't be provided in little/big endian binary form, or as hexadecimal characters, or ...

How about "decimal integer size" ?

I'm ambivalent.

Since we have multiple approvals and the sole remaining disagreement is on how the comment is worded (which can always be amended later), I'll merge this now.

Add GITSHA1 digest function.

aa1dffe

GITSHA1 is the digest function used by git. Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.

jgalmes2 requested review from EdSchouten, EricBurnett, bwkimmel, fmeum, sluongng, sstriker, tjgq and ulfjack as code owners November 21, 2025 16:51

EdSchouten reviewed Dec 11, 2025

View reviewed changes

Add GITSHA1 digest function.

7ba7909

GITSHA1 is the digest function used by git. Details here: https://git-scm.com/book/be/v2/Git-Internals-Git-Objects.

jgalmes2 requested a review from werkt as a code owner December 11, 2025 15:52

EdSchouten approved these changes Dec 12, 2025

View reviewed changes

fmeum approved these changes Dec 13, 2025

View reviewed changes

mostynb reviewed Dec 13, 2025

View reviewed changes

tjgq merged commit b02e15a into bazelbuild:main Jan 20, 2026
1 check passed

EdSchouten mentioned this pull request Jan 26, 2026

Add support for digest function GITSHA1 buildbarn/bb-storage#308

Merged

Comments

Conversation

jgalmes2 commented Nov 21, 2025

Uh oh!

sluongng commented Nov 21, 2025

Uh oh!

jgalmes2 commented Nov 24, 2025

Uh oh!

sluongng commented Nov 24, 2025

Uh oh!

jgalmes2 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sluongng commented Nov 25, 2025

Uh oh!

lberki commented Nov 26, 2025

Uh oh!

jgalmes2 commented Dec 9, 2025

Uh oh!

fmeum commented Dec 9, 2025

Uh oh!

sluongng commented Dec 9, 2025

Uh oh!

aehlig commented Dec 9, 2025

Uh oh!

tjgq commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lberki commented Dec 10, 2025

Uh oh!

sluongng commented Dec 10, 2025

Uh oh!

jgalmes2 commented Dec 10, 2025 via email

Uh oh!

tjgq commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdSchouten Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdSchouten Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

sluongng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

sluongng commented Dec 11, 2025

Uh oh!

jgalmes2 commented Dec 11, 2025

Uh oh!

aehlig commented Dec 11, 2025

Uh oh!

mostynb Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdSchouten Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

mostynb Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

EdSchouten Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

tjgq Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jgalmes2 commented Nov 24, 2025 •

edited

Loading

tjgq commented Dec 9, 2025 •

edited

Loading

tjgq commented Dec 11, 2025 •

edited

Loading

EdSchouten Dec 11, 2025 •

edited

Loading

mostynb Dec 13, 2025 •

edited

Loading