Merkle Trees | kashoverflow

I was listening this podcast of Cursor team with Lex Fridman a while back. It's a really interesting listen, with so many great ideas discussed. In Scaling Problems section, Aman Sanger mentioned Merkle Tress. Listening to It rang a bell. Why does it sound familiar? So, I googled it and found out why - it's used in blockchain. I took a class on blockchain and cryptocurrency in my final year at college. I heard it there but never dug deeper (Took it just for credits!). I found it interesting but then I got busy listening the podcast and it slipped out of my mind.

Then, one late-night, I was discussing how to chunk up our codebase and write automated tests for it, with my friends (and my colleagues as well). I proposed that we find smallest logical, unit-testable components (functions, classes etc) in a file and extract them into separate files. This will help write focused tests, increase the coverage, make the codebase modular and composable. During this, Keval mentioned Merkle Trees. "Cursor does something similar with Merkle Trees, for indexing their codebase." He said. Again, Cursor and Merkle Trees were mentioned together. Now I had to learn about it.

So, I started digging in. What it is, How it works, What are the applications etc. I was surprised to find out its widespread use. It's everywhere - Git uses it to track changes, Bitcoin uses it to verify transactions, distributed databases use it to sync data across nodes and apparently, Cursor uses it to sync files. It was a fun mental excursion and I learned a lot. But because of my bad short-term memory, I knew I was going to forget it all. So this article is a way of making sure I remember it - just the way I learned it.

The Problem: Syncing at Scale

Let's take the specific problem from the podcast. Cursor needs to index your codebase for . These embeddings live in a on their servers. Why they need to embed the codebase? - well that's a blog on it's own! For now, think of it as context of the codebase. This way, your codebase is in sync with their servers.

But wait, what if you make a change in a file? Now the server and your local codebase are out of sync. How do they know which files have changed and need to be synced? A typical codebase has thousands of files. A monorepo might have hundreds of thousands. And we haven't even considered the people working on a codebase. Each person with their own copy of the codebase, with their own branches.

On every change, Cursor has to decide which files have changed and need to be synced. On every sync, Cursor's server needs to answer a simple question: "What's different between what I have and what you have?"

Solution #1: Re-index Everything

The simplest approach: every time you want to sync, send your entire codebase to the server and re-embed everything. The brute-force one.

This is obviously wasteful. If you edited one file in a 10,000-file project, you're re-processing 9,999 files that haven't changed. Sending all these files on each sync is ridiculous. First, embedding your codebase is expensive, both computationally and financially. Second, sending your codebase on each sync will hammer your wifi. Third, it will take a long time to sync.

Solution #2: Track Timestamps

What if we just track when files were modified? Instead of sending the entire codebase, send over only the metadata - a list of changed files with their lastModified timestamps. The server can compare these against its stored timestamps, check if the files have been modified, and (if modified) figure out what changed.

This is much better than sending the entire codebase. We save so much bandwidth and time, for starters. We also don't have to re-embed the entire codebase - saving us a lot of computation and money. However, there are still some not-so-obvious problems with this:

Skewed Clock: You cannot be sure that your machine and the server are synchronized. Different machines have different clocks. Your laptop might be 30 seconds ahead of the server. This can lead to false positives (or more dangerously, ) when comparing timestamps.
File operations: Copying, moving, or restoring from backup can change timestamps without changing content. This will cause unnecessary re-embeds.
Git operations: Git updates the timestamps when you check out a branch, rebase it, or cherry-pick a commit. It will look like files have changed, because of the timestamp change. but actually, content hasn't changed at all.
No integrity check: Timestamps tell you when something changed but it does not tell you what changed. This needs to be figured out on server. Also, Timestamps cannot tell you if the file is corrupt or not.

Solution #3: Hash Every File

Better idea: compute a of each file's contents. Hashes are deterministic - same content always produces the same hash. Send the server a list of (filename, hash) pairs. The server compares against its stored hashes and identifies what changed. Problem solved! Now we can reliably tell that a file's content has changed.

But here's the scaling issue: you still need to compare every single file, every time. With 10,000 files, you need 10,000 hash comparisons. With 100,000 files, you need 100,000 comparisons. The work grows linearly with your codebase size. This is O(n) complexity.

Files:

Your codebase (2 files changed since last sync):

index.ts

App.tsx

utils.ts

api.ts

auth.ts

config.ts

hooks.ts

types.ts

store.ts

router.ts

Button.tsx

Modal.tsx

Form.tsx

Table.tsx

Card.tsx

Header.tsx

Comparisons

Total Files

Changed

Brute Force: Check every single file by comparing its contents with the server. If you have 1,000 files, that's 1,000 comparisons — even if only one file changed.

Try the demo above. toggle between approaches and watch the comparison count. The difference becomes even more dramatic at scale:

Operations to find one changed file:

Files in codebase	Brute Force	Merkle Tree	Savings
100	100	7	93% fewer
1,000	1,000	10	99% fewer
10,000	10,000	14	100% fewer
1,00,000	1,00,000	17	100% fewer
10,00,000	10,00,000	20	100% fewer

As codebases grow, the difference becomes astronomical. Checking a million files individually would be unbearably slow; with Merkle trees, it's instant.

Enter the Merkle Tree

What if we could skip most of those comparisons? What if we could look at a single hash and know instantly whether anything in the entire codebase changed?

That's exactly what a Merkle tree gives us. It's a tree where every leaf node contains a hash of some data, and every non-leaf node contains a hash of its children's hashes combined. The tree can have any branching factor - I have shown binary (2 children) for simplicity in above demo (and subsequent ones as well), but it can match your directory structure where each folder can have multiple files. The key insight here is: Root hash acts as a fingerprint for the entire dataset.

Let's look at another example. We have four files. Instead of comparing each file individually, you build a tree:

Edit the file contents below and watch the hashes update:

File A

File B

File C

File D

6b773931

4ce063fb

00000041

00000042

756941ff

00000043

00000044

Root Hash: 6b773931

This single hash represents all four files. Change any file and watch the root hash change completely.

The bottom level has the hashes of each file. The next level up combines pairs: Hash(A) + Hash(B) gets hashed together, and Hash(C) + Hash(D) gets hashed together. Then, those two results combine to form the root hash.

Now comes the magic: to sync the files, you first compare just the root hashes. Same root? Nothing changed. You're done in one comparison. Different root? Drill down one level. Compare the left subtree hash. Same? Skip that entire half. Different? Go deeper into just that subtree.

You only traverse the paths where hashes differ. For a tree with a million files, finding a single changed file takes about 20 comparisons instead of a million. That's O(log n). Exponentially better scaling.

How Hashing Works

One of the important part of Merkle Tree is hashing. So let's talk about hashing. To generate a hash out of an input, you need a hash function. A hash function takes any input and produces a fixed-size output. Hash functions have few key properties that make it useful:

Deterministic: The same input always produces the same output
One-way: You can't reverse a hash to find the original input (unless you are really good at guessing!)
Avalanche effect: A tiny change in input creates a completely different hash

Type anything to see the hash change:

SHA-256 (simulated):0% changed

-5032bfb9-4b6f2614-401003d5059876b41b5d99ab-45887561-55aac90d0b7f25c7

Input length

Output length

Chars changed

Notice how changing just one character completely transforms the output. This is the avalanche effect.

In above demo, try changing a single character. Notice how the hash changes completely. This sensitivity is what makes Merkle trees powerful. If any single bit of your data changes, the root hash will be entirely different.

Building a Merkle Tree Step by Step

We've seen what a Merkle tree looks like. Now, let's walk through constructing a Merkle tree from scratch. For this, we will take an example of Blockchain. Say we have four transactions in a Bitcoin block. Watch the tree being built layer by layer:

Step 1 of 11

Raw Transactions

We start with four transactions that need to be included in a block.

The process follows three steps:

Step 1: Hash the leaves. In our case, leaves will be transactions. Each transaction gets hashed individually. These hashes become the leaf nodes of our tree. In Bitcoin, this uses SHA-256 (actually double SHA-256, but let's keep it simple).

Step 2: Pair and hash. We take adjacent pairs of the tree and concatenate their hashes and again hash that result.Hash(T1) + Hash(T2) becomes a new hash. Same for Hash(T3) + Hash(T4).

Step 3: Repeat until one hash remains. Keep pairing and hashing until you're left with a single hash, which is the Merkle root.

Neat, right! But one question. We took an example with even transactions. What if you have an odd number of transactions? How do you create a tree from it? As we saw earlier, Merkle Trees can be . In blockchain, however, we stick to binary Merkle Trees to easily identify if a transaction is part of a block or not (covered in next section). Odd number of transactions would make it lopsided. How do we handle that? So, the standard approach is to duplicate the last item. Let's say you have transactions T1, T2, T3, then you'd pair T3 with itself to create the fourth hash.

As we saw in the demo, contents are embedded in the structure of tree itself. So, the beauty lies in the structure of Merkle Trees. Every parent is a cryptographic commitment to its children. Change any leaf, and it's parent will change. And by transitive property, this change will bubble up to the root, changing every hash on the path. It's a chain of accountability built right into the math.

Merkle Proofs

So we've built a Merkle tree. Cool. But here's where it gets really interesting. What if someone asks you: "Hey, was transaction X included in this block?" How do you prove it? From the Cursor's example, we learned that Merkle Trees are really fast for figuring out if a something was changed or not. Now, through Blockchain example, we will see how we can check if a content is part of the Merkle Tree or not.

The naive approach would be to hand them the entire tree. All million transactions. "Here, check for yourself." That's not great for the compute. This is where Merkle proofs come in. They let you prove that a specific piece of data exists in a tree without revealing the entire tree. You only need about 20 hashes to prove inclusion in a tree of a million items. That's it. 20 hashes.

What's in a Proof?

Think about it this way. To verify that a leaf belongs to a tree, you need to reconstruct the path from that leaf to the root. And to do that, you don't need the entire tree. You just need the sibling at each level. With that, you can compute the root hash and verify that it matches the expected root. If yes, then the leaf is indeed part of the tree. If no, then the leaf is not part of the tree. Simple.

A Merkle proof contains three things:

The leaf data: The actual data you're proving exists (a transaction in our case)
The sibling hashes: hash of the sibling node at each level of the tree. That's the minimum information needed to compute up to the root
The expected root: The trusted root hash to verify against

For a tree with a million leaves (height ~20), the proof is just 20 hashes. About 640 bytes with SHA-256. That's incredibly compact! Compare that to sending the entire Merkle Tree, which would have a million leaves, which would require over 2 million hashes or about 64 MB!

Generating a Proof

To generate a proof, you start at your target leaf and walk up to the root, collecting the sibling hash at each level. One thing to watch out for though: the order (left or right) matters. Because we are concatenating hashes and concatenation isn't commutative. Hash(A + B) ≠ Hash(B + A). So we need to track which side each sibling is on.

Try clicking on different transactions in the demo below to see how the proof changes:

Click on any transaction (leaf node) to see its proof:

Generated Proof:

Leaf:tx_alice→bob

Leaf hash:60122568

Proof:

[0]4a9c4859(right)

[1]42ff0454(right)

Root:3cc45098

The proof contains only 2 hashes — the sibling at each level. Anyone with the proof can verify the transaction was included without needing the other 3 transactions.

Verifying a Proof

Verification is the super simple. You start with the leaf, combine it with each sibling hash in order, and check if you arrive at the expected root. If you do, the data was definitely in the tree. If you don't, someone's lying.

Walk through the verification step by step:

Step 1 of 6

Start with File C

We want to verify that File C is authentic. First, compute its hash.

Hash(C)

Result:00000043

The beautiful thing here is that anyone can verify a proof with just log₂(n) hash operations. No trusted third party needed. Structure of tree and math itself provides the guarantee. This is why in Bitcoin can verify transactions without downloading the entire blockchain.

The Math Behind the Efficiency

I mentioned O(log n) a couple of times. So, let me actually put some numbers to this because the scaling is genuinely impressive. For a dataset with n items, a binary Merkle tree has:

Tree height: log₂(n) levels
Proof size: log₂(n) hashes
Verification time: O(log n) hash operations

These are just formulas though. Let me show you what this actually means in practice:

Proof size comparison:

Items	Without Merkle	Merkle Proof	Savings
1,000	1,000	10	99% fewer
10,00,000	10,00,000	20	99.9980% fewer
1,00,00,00,000	1,00,00,00,000	30	100.0000% fewer

Look at that last row. A billion items, and you only need 30 hashes to verify any single one. Every time you multiply your data by 1000x, you only add 10 more hashes to your proof.

This is similar to what we saw earlier, during cursor file comparision example. This efficiency is what makes Merkle Trees so fun and this is why Merkle trees are everywhere, especially in distributed systems. Without this efficiency, things like Bitcoin's light clients, Cursor's instant embeddings or Git's fast comparisons simply wouldn't be practical.

A Simple Implementation

Alright, enough theory. Let's get our hands dirty with some code. I've put together a minimal Merkle tree implementation in TypeScript below. It's about 30 lines - nothing fancy, just enough to see the core idea in action.

Try editing the file names and hit Run to see it execute step by step:

Input files:

// Simple hash function (use proper hashing algorithm in production)
function hash(data: string): string {
  let h = 0;
  for (let i = 0; i < data.length; i++) {
    h = (h << 5) - h + data.charCodeAt(i);
    h = h & h;
  }
  return Math.abs(h).toString(16).padStart(8, '0');
}

function buildMerkleTree(leaves: string[]): string {
  if (leaves.length === 0) return hash('');

  // Hash all leaves
  let level = leaves.map(hash);

  // Build tree bottom-up
  while (level.length > 1) {
    const nextLevel: string[] = [];

    for (let i = 0; i < level.length; i += 2) {
      const left = level[i];
      const right = level[i + 1] || left;
      nextLevel.push(hash(left + right));
    }
    level = nextLevel;
  }

  return level[0]; // Root hash
}

// Run it!
const files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt'];
const root = buildMerkleTree(files);
console.log('Merkle root:', root);

That's really all there is to it. The algorithm is surprisingly simple: hash the leaves, pair them up, hash the pairs, repeat until you have one hash left.

Of course, production systems are more complex than this. Proper cryptographic hashes (SHA-256 instead of our toy hash), handling of binary data, proof generation and verification, serialization for storage etc. But the core idea is same. It's right there in those 30 lines.

Real-World Applications

Remember how I mentioned Merkle trees are everywhere? Let me walk through some concrete examples. I guarantee you've probably used at least a couple of these today, directly or indirectly.

Git: Every time you run git commit, Git creates a tree object that's essentially a Merkle tree of your repository's contents. This is why Git can instantly tell if two repositories are identical (same root hash) or find exactly which files differ (traverse down where hashes diverge). So, next time you run git diff, know that there's a Merkle Tree comparison happening under the hood.

Bitcoin: This is the application that made Merkle trees famous (or at least famous outside of academic circles). Each block contains a Merkle root of all transactions in that block. Light clients can verify a transaction was included without downloading the entire blockchain. They just need the Merkle proof.

Cursor (and similar tools): And we're back to where we started! When syncing files between your machine and Cursor's servers, they use Merkle trees to quickly identify which subtrees have changed. Changed subtree? Go deeper. Same hash? Skip the whole thing. This is how they can sync a massive codebase efficiently.

Certificate Transparency: This is something new for me as well. Google's Certificate Transparency logs use Merkle trees to create an append-only, tamper-evident log of all . Anyone can verify that a certificate was logged without downloading the entire log. This is one of the ways we catch .

IPFS: The InterPlanetary File System uses Merkle (a generalization of Merkle trees) for content addressing. Files are identified by their hash. Large files are split into chunks organized in a Merkle structure. Change one chunk, and only that chunk needs to be re-uploaded instead of the entire file.

Variations and Extensions

What we saw till now was the basic Merkle tree. It is just the starting point. Over the years, people have come up with variations to solve specific problems. I won't go too deep into these (This article has already become long and each of those variations could be its own article!), but here's a quick tour:

Patricia Merkle Tries: Ethereum uses these. They combine the hash verification of Merkle trees with the key-value lookup of . This lets you efficiently prove things about any account or smart contract in the entire Ethereum state. Don't understand it? Me neither. Sounds pretty clever, though.

Sparse Merkle Trees: What if most of your possible keys are empty? Like, you have a key space of 2²⁵⁶ possible values but only a few thousand actual entries. Sparse Merkle trees handle this by using default hashes for empty subtrees. You don't actually store the empty parts.

Merkle Mountain Ranges: Regular Merkle trees are great, but they assume you have all your data upfront. What if data keeps getting added? MMRs are an append-only structure that allows efficient proofs even as new data is added.

Verkle Trees: Cool kid on the block. Instead of hashes, they use vector commitments. The big win? Much smaller proof sizes. Ethereum is seriously considering these for future upgrades because proof size is a major bottleneck for scalability. Again, I don't understand it much. I might write about it separately once I understand them better though!

Common Pitfalls

So, we have sung praises of Merkle Tree till now. But again, each coin has two sides and Merkle Tree is no exception. Most of it is implementation specific, though. If you ever implement a Merkle tree yourself, here are some gotchas that have tripped people up. Some of these have caused real security vulnerabilities in production systems.

Second preimage attacks: This one's subtle. If your internal nodes and leaf nodes use the same hashing scheme, an attacker might craft data that produces the same hash as an internal node. Suddenly they can fake proofs. The fix is simple: prefix leaf hashes with 0x00 and internal hashes with 0x01 (or use different hash functions for each).

Odd leaf handling: We saw this earlier. What happens when you have 5 leaves instead of 4? Different implementations handle this differently. Some duplicate the last leaf, others promote it directly to the next level. But make sure you are consistent. If you're not consistent, you'll get verification failures when two systems try to talk to each other. Pick one approach and stick to it.

Hash function choice: This should go without saying in 2026, but don't use weaker hash functions like MD5 or SHA-1 for anything security-critical. They're broken for collision resistance. Stick with SHA-256 or better. The toy hash function I used in the demos above is great for visualization but terrible for actual security. DO NOT USE THAT IN PRODUCTION.

Conclusion

What I find beautiful about Merkle Trees is how much efficiency they unlock through such a simple idea. It's just hashes organized in a tree. That's it. Such a simple idea and yet so powerful, so elegant, so brilliant. It's a eye-opener for people like me who equate complexity with brilliance. Complexity is not always better. It's a reminder that sometimes, simplicity is the best solution.

Ralph Merkle - the man behind this brilliant idea - patented this concept in 1979. 1979! That's 20 years before I was even born, before the web existed, before most of the applications we've discussed were even conceivable. And yet, his idea became fundamental infrastructure for Git, Bitcoin, Ethereum, IPFS, and countless other systems.

So, next time you git commit something or hear about blockchain verification, know that there's a humble tree of hashes doing the heavy lifting. And maybe, like me, you'll appreciate how a simple data structure can make such a big impact.