-
Notifications
You must be signed in to change notification settings - Fork 133
Description
There is a pretty interesting blog post from the folks at Cursor recently https://cursor.com/blog/secure-codebase-indexing, in which they said
When a new user joins, the client computes the Merkle tree for a new codebase and derives a value called a similarity hash (simhash) from that tree. This is a single value that acts as a summary of the file content hashes in the codebase.
The client uploads the simhash to the server. The server then uses it as a vector to search in a vector database composed of all the other current simhashes for all other indexes in Cursor in the same team (or from the same user) as the client. For each result returned by the vector database, we check whether it matches the client similarity hash above a threshold value. If it does, we use that index as the initial index for the new codebase.
Such a simhash could be derived from the Merkle tree calculation and added to a vector database.
Merkle trees with simhashes close together are more likely to have overlapping files than the ones with simhashes further away.
In practice, this means that we can compute this simhash on the client side when they create the Input Root Merkle Tree. The simhash could then be added to the request message (or as a header). Schedulers can leverage the hash to route actions to workers that have recently downloaded similar input roots and reduce input downloads.