Skip to content

Routing hint for Execute rpc #360

@sluongng

Description

@sluongng

There is a pretty interesting blog post from the folks at Cursor recently https://cursor.com/blog/secure-codebase-indexing, in which they said

When a new user joins, the client computes the Merkle tree for a new codebase and derives a value called a similarity hash (simhash) from that tree. This is a single value that acts as a summary of the file content hashes in the codebase.

The client uploads the simhash to the server. The server then uses it as a vector to search in a vector database composed of all the other current simhashes for all other indexes in Cursor in the same team (or from the same user) as the client. For each result returned by the vector database, we check whether it matches the client similarity hash above a threshold value. If it does, we use that index as the initial index for the new codebase.

Such a simhash could be derived from the Merkle tree calculation and added to a vector database.
Merkle trees with simhashes close together are more likely to have overlapping files than the ones with simhashes further away.

In practice, this means that we can compute this simhash on the client side when they create the Input Root Merkle Tree. The simhash could then be added to the request message (or as a header). Schedulers can leverage the hash to route actions to workers that have recently downloaded similar input roots and reduce input downloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions