Key Question
How does IPFS represent a file system using a directed acyclic graph of content-addressed blocks?
Deep Dive
IPFS doesnβt just hash files β it builds a Merkle DAG (Directed Acyclic Graph) where every node is content-addressed. This is the same data structure as Git, and it lets IPFS represent files, directories, and even entire file systems as linked graphs of hashes.
Three node types in the IPFS data model:
| Node type | What it holds | Size limit |
|---|---|---|
| blob | Raw file data | β€ 256 KB |
| list | Ordered list of other node CIDs | Unlimited (for large files) |
| tree | Map of names β CIDs (like a directory) | Unlimited |
How a small file (β€256 KB) is stored:
"hello.txt" (150 KB)
β
βΌ
βββββββββββ
β blob β
β β
β <bytes> β β CID = QmHash(file bytes)
βββββββββββ
Simple: the file is one blob node. The fileβs CID is the blobβs hash.
How a large file is stored:
"big_movie.mp4" (1 GB)
β
βΌ
βββββββββββ
β list β β CID = QmHash(list of child CIDs)
β β
β βββββββ β
β βblob0β β β 256 KB chunk
β βββββββ€ β
β βblob1β β β 256 KB chunk
β βββββββ€ β
β β... β β
β βββββββ€ β
β βblobNβ β β final chunk (< 256 KB)
β βββββββ β
βββββββββββ
IPFS splits the file into 256 KB chunks. Each chunk becomes a blob node. A parent list node links them in order. The fileβs CID is the list nodeβs hash. To download, you fetch the list node, then fetch each chunk in parallel β partial downloads are possible!
How a directory is stored:
/docs/
βββ readme.md
βββ image.png
βββ src/
βββ main.go
βββββββββββ
β tree β β CID = QmHash(directory entries)
β β
β readme.md ββββΊ QmX... (blob, "readme.md" content)
β image.png ββββΊ QmY... (blob, image bytes)
β src ββββΊ QmZ... (tree, subdirectory)
βββββββββββ
β
βΌ
βββββββββββ
β tree β β Subdirectory node
β β
β main.go ββββΊ QmW... (blob)
βββββββββββ
A tree node maps names to CIDs β itβs IPFSβs version of a directory. Tree nodes can point to blobs (files), lists (large files), or other tree nodes (subdirectories). The entire directory tree has one root CID.
Why this is exactly like Git:
Git object model:
commit βββΊ tree βββΊ blob
β
ββββΊ tree βββΊ blob
IPFS object model:
tree βββΊ blob
β
ββββΊ tree βββΊ blob
Key similarity: Both use content-addressed Merkle DAGs.
If you change one byte in a file, its blob CID changes,
the parent tree CID changes, and the root CID changes β
a cascading hash update all the way up.
The Immutability Trade-off: Because every node is content-addressed, you canβt βeditβ a file in place. Changing a file creates a new root CID. To represent mutable state (like βthe current version of my documentβ), IPFS uses IPNS (InterPlanetary Name System) β a pointer from a public key to a CID that can be updated.
Check Your Understanding
- A directory contains two files. You modify one file and re-add it to IPFS. Which CIDs change?
- Why does IPFS chunk large files into 256 KB blocks instead of storing them as single blob nodes?
- What is the relationship between Gitβs object model and IPFSβs Merkle DAG?
The βSo What?β
The Merkle DAG is the data structure that makes IPFS more than just a βcontent-addressed network.β It can represent arbitrarily large files (via chunking), directories (via tree nodes), and any hash-linked data structure. This same design powers Git, IPFS, and countless blockchain data structures β itβs the universal format for verifiable, linked data.
βοΈ Exercises
IPFS & Decentralized Storage: Exercises
Exercise 1: CID Mutability
Alice creates a file hello.txt with content βHello, world!β and adds it to IPFS. She gets CID QmPZ9gcCe5rsLKbrJfFQW9dLLgKNLoJN7da8uDNmhCWZqJ. She then changes the file to βHello, world?β (changing ! to ?) and adds it again.
- Does the CID change? Why or why not?
- Bob downloads both files. How can he verify which one is the original?
- If Alice wants to share a link that always points to her latest version, what IPFS mechanism does she need?
Exercise 2: IPFS vs BitTorrent
Consider downloading a 2 GB open-source operating system ISO. Compare IPFS and BitTorrent:
- Discovery: How does each system find peers who have the file?
- Verification: How does each system verify that downloaded data is correct?
- Incentives: How does each system encourage peers to upload after downloading?
- Merkle structure: Both BitTorrent and IPFS use a Merkle tree / DAG. Is there a conceptual difference in how they structure and address data?
Exercise 3: PoRep Necessity
Filecoin miners earn money by storing clientsβ data. A dishonest miner considers the following attack:
- Client wants to store 100 copies of a 1 GB dataset D.
- Miner stores 1 copy of D and claims to have 100.
- When challenged with PoRep, miner quickly generates the sealed data from the single copy.
Explain why this attack fails due to the design of Proof-of-Replication. Be specific about the sealing process and what makes each sealed copy unique to a specific miner and a specific deal.
Bonus: Could this attack work if Filecoin used only Proof-of-Spacetime without Proof-of-Replication?
ποΈ View Solutions
IPFS & Decentralized Storage: Solutions
Exercise 1 Solution
1. Does the CID change?
Yes. The CID is the cryptographic hash of the fileβs content. Changing even one byte completely changes the hash output (avalanche effect). Assuming SHA-256 is the hash function:
- Original: SHA-256(βHello, world!β) β
QmPZ9gcCe5rsLKbrJfFQW9dLLgKNLoJN7da8uDNmhCWZqJ - Modified: SHA-256(βHello, world?β) β completely different CID
The two CIDs share no relationship. You cannot derive one from the other.
2. How to verify which is original?
Download each file and compute the hash. If the hash matches the CID claimed by Alice, you have the file she intended. Since the CIDs are different, you can tell theyβre different files. Without Alice telling you which CID is the βoriginal,β you canβt know the authorial intention β but you can be certain about the content.
3. Always pointing to the latest version?
Alice needs IPNS (InterPlanetary Name System). IPNS creates a pointer from Aliceβs PeerID (public key hash) to a CID. Alice can update the pointer: ipns://QmAlicePeerID always resolves to her latest CID. Users who trust Aliceβs public key will always get her latest file.
Exercise 2 Solution
1. Discovery:
| BitTorrent | IPFS |
|---|---|
| Centralized tracker or DHT with infohash | Kademlia DHT keyed by CID |
| Tracker returns list of peers | DHT returns provider records |
| PEX (Peer Exchange) for gossip | BitSwap handles peer discovery during exchange |
BitTorrent traditionally relied on trackers (centralized). Modern BitTorrent uses DHT (Mainline DHT), which inspired IPFSβs DHT. IPFSβs approach is fully decentralized from the start.
2. Verification:
| BitTorrent | IPFS |
|---|---|
| Merkle tree: root hash (infohash), 256 KB piece hashes | Merkle DAG: every node is content-addressed |
| Verify each piece against its hash in the torrent metadata | Verify each block against its CID on download |
| Root hash in .torrent file or magnet link | CID is the root hash |
Both use Merkle verification. The key difference: BitTorrentβs Merkle tree is flat (one level of pieces), while IPFSβs Merkle DAG can be nested (trees within trees).
3. Incentives:
| BitTorrent | IPFS |
|---|---|
| Tit-for-tat: βIβll only upload to you if you upload to meβ | BitSwap barter: βIβll trade blocks I have for blocks I wantβ |
| Strict: peer is choked if they donβt reciprocate | Soft: based on credit/debit ratios |
| Leeching is directly punished | Leeching is indirectly punished (credit score drops) |
BitTorrentβs tit-for-tat is more aggressive about enforcing sharing. IPFSβs BitSwap is more flexible β a peer with low credit can still fetch data, just at lower priority.
4. Merkle structure difference:
BitTorrentβs Merkle tree is a static structure: the piece list is fixed when the torrent is created. You cannot add files or reorganize without creating a new torrent.
IPFSβs Merkle DAG is a dynamic structure: you can add files, create directories, and link objects arbitrarily. IPFS directories are DAG nodes; BitTorrent has no concept of directory hierarchy in its data structure.
Exercise 3 Solution
Why the attack fails:
Proof-of-Replication involves sealing β a sequential, resource-intensive encoding process that ties a specific copy of the data to a specific miner:
Sealing process (simplified):
Original data D
β
ββββΊ Miner ID (M) βββββββ
ββββΊ Deal ID (Deal) ββββββ€
ββββΊ Random nonce ββββββββ€
ββββββββββββββββββββββββββ
β
βΌ
Layer-by-layer encoding (AES + PoSW)
β
βΌ
Sealed sector S = Seal(D, M, Deal, nonce)
Time: ~30 minutes per sector
Each sealed copy S is different because:
- The miner ID is different (each miner has a unique public key)
- The deal ID is different (each deal is a separate contract)
- The nonce is different (random value per deal)
For the miner claiming 100 copies stored:
- They would need 100 different sealed sectors: Sβ, Sβ, β¦, Sβββ
- Each requires ~30 minutes of sequential computation
- They cannot compute 100 proofs from 1 copy because the sealing input (miner ID, deal ID) differs per deal
- PoRep challenges ask about the sealed data, which is unique per copy
If caught cheating (unable to produce the correct PoRep), the minerβs entire collateral for all 100 deals is slashed.
Bonus: Without PoRep (PoSt only):
The attack would likely succeed. PoSt proves that you currently hold some data, but it doesnβt prove that the data is a unique copy. With PoSt alone:
- Store 1 copy of D
- Generate PoSt for that one copy
- Have the same PoSt serve as proof for all 100 deals (since PoSt just proves βI have this hashβ)
- Collect 100Γ payment for 1Γ storage
This is why PoRep is necessary: it binds each deal to a physically distinct, computation-bound encoding.