Securely indexing large codebases

5 min readbeginner
--
View Original

Overview

This article explains how Cursor optimizes semantic search indexing for large codebases by using Merkle trees and similarity hashing to securely reuse existing indexes across team members. The approach reduces time-to-first-query from hours to seconds for the largest repositories while maintaining strict security guarantees that prevent code leakage between users.

What You'll Learn

1

How Merkle trees enable efficient incremental codebase indexing by detecting exactly which files changed

2

How similarity hashing allows secure reuse of teammate indexes to dramatically reduce onboarding time

3

Why cryptographic content proofs prevent code leakage when sharing indexes across team members

4

How embedding caching by chunk content avoids redundant computation for unchanged code

Prerequisites & Requirements

  • Basic understanding of cryptographic hash functions (e.g., SHA-256)
  • Familiarity with tree data structures and how Merkle trees work
  • Understanding of semantic search and vector embeddings(optional)
  • Experience working with large codebases in a team environment(optional)

Key Questions Answered

How does Cursor index large codebases efficiently for semantic search?
Cursor uses a Merkle tree structure where every file has a cryptographic hash, and folder hashes are derived from their children. When files change, only the affected branches of the tree are compared and synced, avoiding full reprocessing. Changed files are split into syntactic chunks, and embeddings are cached by chunk content so unchanged chunks skip the expensive embedding step entirely.
How does Cursor securely share codebase indexes between team members?
When a new user joins, their client computes a similarity hash (simhash) from the Merkle tree and uploads it. The server searches a vector database of existing simhashes from the same team. If a match exceeds a threshold, the existing index is copied. The client uploads its full Merkle tree as content proofs, and the server filters search results to only return files the client can cryptographically prove it has.
What is a similarity hash and how is it used for index matching?
A similarity hash (simhash) is a single value derived from a Merkle tree that summarizes all the file content hashes in a codebase. It acts as a compact fingerprint for the entire repository. The server uses it as a vector to search against all other simhashes in the same team's vector database, finding the most similar existing index to reuse as a starting point.
How does Cursor prevent code leakage when reusing shared indexes?
Cursor leverages the cryptographic properties of the Merkle tree. Each node is a hash that can only be computed with the actual file content. When using a copied index, the client uploads its Merkle tree as content proofs. During search, the server checks result file hashes against the client's tree. If the client can't prove it has a file, that result is dropped from search results.
How much faster is codebase indexing with index reuse compared to building from scratch?
For the median repo, time-to-first-query drops from 7.87 seconds to 525 milliseconds. At the 90th percentile, it falls from 2.82 minutes to 1.87 seconds. At the 99th percentile, it drops from 4.03 hours to just 21 seconds. The improvement compounds with repository size, with the largest repos seeing the most dramatic speedups.
Why does Cursor use Merkle trees instead of comparing all files directly?
In a workspace with 50,000 files, just filenames and SHA-256 hashes total roughly 3.2 MB. Without a Merkle tree, this entire dataset would need to be transferred on every update. With the tree structure, Cursor walks only the branches where hashes differ, significantly reducing data transfer. Small edits change only the edited file's hash and its parent directories up to the root.
How similar are team members' copies of the same codebase?
Clones of the same codebase average 92% similarity across users within an organization. This high degree of overlap is what makes index reuse practical and effective, since most of the embedding work has already been done by a teammate and can be shared rather than recomputed from scratch.
What happens after a copied index is used for initial queries?
While the client queries the copied index immediately, a background sync reconciles the remaining differences between the client's local codebase and the copied index. Once the client and server Merkle tree roots match completely, the server deletes the content proofs and all future queries run against the fully synced index without any filtering overhead.

Key Statistics & Figures

Semantic search accuracy improvement
12.5%
Average improvement in response accuracy when using semantic search
Codebase similarity across team members
92%
Average similarity of clones of the same codebase within an organization
Median time-to-first-query (without reuse)
7.87 seconds
Median repo indexing time without index sharing
Median time-to-first-query (with reuse)
525 milliseconds
Median repo indexing time with teammate index reuse
P90 time-to-first-query (without reuse)
2.82 minutes
90th percentile repo indexing time without index sharing
P90 time-to-first-query (with reuse)
1.87 seconds
90th percentile repo indexing time with teammate index reuse
P99 time-to-first-query (without reuse)
4.03 hours
99th percentile repo indexing time without index sharing
P99 time-to-first-query (with reuse)
21 seconds
99th percentile repo indexing time with teammate index reuse
Hash data size for 50K files
3.2 MB
Approximate size of filenames and SHA-256 hashes for a 50,000-file workspace
Semantic search availability threshold
80%
Percentage of indexing that must be completed before semantic search becomes available

Technologies & Tools

Data Structure
Merkle Tree
Incremental file change detection and sync optimization using cryptographic hashes
Cryptography
Sha-256
Cryptographic hashing of file contents for change detection and content proofs
Database
Vector Database
Searching similarity hashes to find reusable indexes from teammates
Search
Semantic Search
Enabling natural language code search across large codebases using embeddings
AI/ML
Embeddings
Converting syntactic code chunks into vector representations for semantic search

Key Actionable Insights

1
Use Merkle trees for incremental synchronization of large datasets. By structuring data as a tree of cryptographic hashes, you can identify exactly which portions have changed and sync only the affected branches, avoiding costly full comparisons.
This is especially effective for codebases or file systems where most content remains stable between updates. A 50,000-file workspace reduces sync overhead from 3.2 MB of hash data per update to only the changed branches.
2
Cache embeddings by content hash rather than by file path to maximize reuse. When files are split into syntactic chunks and embeddings are cached by chunk content, unchanged chunks automatically hit the cache even if the file they belong to has been modified elsewhere.
This approach avoids the expensive embedding generation step for the majority of code that hasn't changed, keeping search responses fast without redundant computation at inference time.
3
Leverage similarity hashing (simhash) to find near-duplicate datasets across users and enable index sharing. By deriving a single representative vector from a Merkle tree, you can efficiently search for similar indexes in a vector database and reuse them as starting points.
This technique is particularly valuable in team environments where codebases average 92% similarity across users, making most of the indexing work redundant if computed independently for each user.
4
Implement cryptographic content proofs to enforce access control when sharing indexes. Rather than relying on access control lists, use the inherent property of cryptographic hashes — that they can only be produced from the original content — to verify that a user actually possesses a file before returning search results for it.
This approach provides a zero-knowledge style guarantee that prevents code leakage between team members with different file access, while still allowing immediate querying against a shared index.
5
Design onboarding workflows that allow immediate use while background processing catches up. By letting users query against a copied index with filtered results while full synchronization happens asynchronously, you eliminate the wait time that blocks productivity.
This pattern of optimistic access with background reconciliation reduces time-to-first-query from hours to seconds on large repos and can be applied to any system where initial setup is expensive.

Common Pitfalls

1
Rebuilding the entire codebase index from scratch for every new user or machine. Without index reuse, each team member independently processes all files, generating embeddings that are identical to what teammates have already computed. This creates hours of redundant work on large repos.
Since team codebases average 92% similarity, the vast majority of this work is duplicated. Implementing index sharing eliminates this repeated computation.
2
Transferring full file hash lists on every sync instead of using a hierarchical comparison structure. A naive approach of comparing all filenames and hashes for a 50,000-file workspace sends roughly 3.2 MB per update, even when only a single file has changed.
Using a Merkle tree allows walking only the branches where hashes differ, dramatically reducing the amount of data transferred during incremental syncs.
3
Sharing indexes across team members without access control verification, which could leak code between users who have access to different parts of the codebase. Simply copying an index and allowing full search against it would expose files the new user may not have.
Cryptographic content proofs based on the Merkle tree ensure that search results are filtered to only include files the querying user can prove they possess.
4
Blocking users from querying until the full index is built or synced. Making semantic search unavailable during the indexing process forces users to wait, especially on large repos where indexing can take hours.
Allowing immediate queries against a copied index with content-proof filtering lets users start working right away while background sync reconciles differences.

Related Concepts

Merkle Trees
Cryptographic Hash Functions
Sha-256
Similarity Hashing (simhash)
Vector Databases
Semantic Search
Vector Embeddings
Content-addressable Storage
Incremental Synchronization
Code Indexing
Zero-knowledge Proofs
Embedding Caching
Syntactic Code Chunking