Key Question
What if you asked for a file by its content instead of where it lives?
Deep Dive
The web runs on location addressing. When you type https://example.com/cat.jpg, your browser resolves the domain to an IP address, connects to that server, and downloads /cat.jpg. The address tells you where to find the file — but it says nothing about what the file contains.
Location Addressing (HTTP):
"cat.jpg" ─────► DNS lookup ─────► example.com (192.0.2.1)
│
▼
Server delivers /cat.jpg
│
▼
Hopefully it's the right image?
(No way to verify without downloading)
Problems with this model:
- Server goes down → file vanishes (link rot)
- Server moves the file → broken link
- Server modifies the file → you can’t tell (unless HTTPS certs pin the content)
- Popular file → everyone hits the same server (bandwidth bottleneck)
Content addressing flips this upside down. Instead of “where is this file?”, you ask “what is this file?”. The address is a cryptographic hash of the contents:
Content Addressing (IPFS):
File contents ───► SHA-256 ───► QmX9IwhXpcJhMqE2CRuvgQLGgB9iMf7PyFbWMDhnKwFfQf
This hash IS the address
The Content Identifier (CID) is the hash prefixed with metadata about the hash algorithm. A typical IPFS CID looks like:
QmX9IwhXpcJhMqE2CRuvgQLGgB9iMf7PyFbWMDhnKwFfQf
This encodes:
- Multihash prefix: identifies the hash function (SHA-256, Blake2b, etc.)
- Hash digest: the actual hash output
- CID version: v0 (base58) or v1 (multibase + multicodec)
The multihash format makes IPFS future-proof — if SHA-256 is broken tomorrow, IPFS can switch to a new hash function without changing the protocol.
Four benefits of content addressing:
| Property | What it means |
|---|---|
| Immutability | Change one byte → CID changes. The address guarantees the content. |
| Deduplication | Same file → same CID. Stored once, referenced everywhere. |
| Any peer serves it | The CID identifies the data, not a location. Any peer with the data can respond. |
| Verification | After downloading, hash the data. If it matches the CID, you got exactly what you asked for. |
Bringing it together:
Location-addressed: "Get me the file at this URL"
↓
Server 404 → file is gone. Tough luck.
Content-addressed: "Get me the file with this hash"
↓
Peer A has it? Peer B has it? Anyone?
↓
Whoever responds first, and I verify the hash.
In IPFS, ipfs get QmX9IwhXpcJhMqE2CRuvgQLGgB9iMf7PyFbWMDhnKwFfQf says: “find me a peer who has the data with this hash, download it from them, and confirm the hash matches.” The network handles the routing; you just care about the content.
Check Your Understanding
- If a website changes its logo image but keeps the same filename, can users of HTTP tell? Can users of IPFS tell?
- Why does IPFS use the multihash format instead of just SHA-256 directly?
- Two users with the same photo upload it to IPFS. How many copies are stored?
The “So What?”
Content addressing is the foundation of every decentralized storage system (IPFS, Filecoin, Arweave, BitTorrent). It transforms data from being hostage to a server into being self-verifying and universally retrievable. Without this conceptual shift, decentralized storage doesn’t work — you’d still be asking a server “give me this file,” which defeats the purpose.
✏️ Exercises
IPFS & Decentralized Storage: Exercises
Exercise 1: CID Mutability
Alice creates a file hello.txt with content “Hello, world!” and adds it to IPFS. She gets CID QmPZ9gcCe5rsLKbrJfFQW9dLLgKNLoJN7da8uDNmhCWZqJ. She then changes the file to “Hello, world?” (changing ! to ?) and adds it again.
- Does the CID change? Why or why not?
- Bob downloads both files. How can he verify which one is the original?
- If Alice wants to share a link that always points to her latest version, what IPFS mechanism does she need?
Exercise 2: IPFS vs BitTorrent
Consider downloading a 2 GB open-source operating system ISO. Compare IPFS and BitTorrent:
- Discovery: How does each system find peers who have the file?
- Verification: How does each system verify that downloaded data is correct?
- Incentives: How does each system encourage peers to upload after downloading?
- Merkle structure: Both BitTorrent and IPFS use a Merkle tree / DAG. Is there a conceptual difference in how they structure and address data?
Exercise 3: PoRep Necessity
Filecoin miners earn money by storing clients’ data. A dishonest miner considers the following attack:
- Client wants to store 100 copies of a 1 GB dataset D.
- Miner stores 1 copy of D and claims to have 100.
- When challenged with PoRep, miner quickly generates the sealed data from the single copy.
Explain why this attack fails due to the design of Proof-of-Replication. Be specific about the sealing process and what makes each sealed copy unique to a specific miner and a specific deal.
Bonus: Could this attack work if Filecoin used only Proof-of-Spacetime without Proof-of-Replication?
👁️ View Solutions
IPFS & Decentralized Storage: Solutions
Exercise 1 Solution
1. Does the CID change?
Yes. The CID is the cryptographic hash of the file’s content. Changing even one byte completely changes the hash output (avalanche effect). Assuming SHA-256 is the hash function:
- Original: SHA-256(“Hello, world!”) →
QmPZ9gcCe5rsLKbrJfFQW9dLLgKNLoJN7da8uDNmhCWZqJ - Modified: SHA-256(“Hello, world?”) → completely different CID
The two CIDs share no relationship. You cannot derive one from the other.
2. How to verify which is original?
Download each file and compute the hash. If the hash matches the CID claimed by Alice, you have the file she intended. Since the CIDs are different, you can tell they’re different files. Without Alice telling you which CID is the “original,” you can’t know the authorial intention — but you can be certain about the content.
3. Always pointing to the latest version?
Alice needs IPNS (InterPlanetary Name System). IPNS creates a pointer from Alice’s PeerID (public key hash) to a CID. Alice can update the pointer: ipns://QmAlicePeerID always resolves to her latest CID. Users who trust Alice’s public key will always get her latest file.
Exercise 2 Solution
1. Discovery:
| BitTorrent | IPFS |
|---|---|
| Centralized tracker or DHT with infohash | Kademlia DHT keyed by CID |
| Tracker returns list of peers | DHT returns provider records |
| PEX (Peer Exchange) for gossip | BitSwap handles peer discovery during exchange |
BitTorrent traditionally relied on trackers (centralized). Modern BitTorrent uses DHT (Mainline DHT), which inspired IPFS’s DHT. IPFS’s approach is fully decentralized from the start.
2. Verification:
| BitTorrent | IPFS |
|---|---|
| Merkle tree: root hash (infohash), 256 KB piece hashes | Merkle DAG: every node is content-addressed |
| Verify each piece against its hash in the torrent metadata | Verify each block against its CID on download |
| Root hash in .torrent file or magnet link | CID is the root hash |
Both use Merkle verification. The key difference: BitTorrent’s Merkle tree is flat (one level of pieces), while IPFS’s Merkle DAG can be nested (trees within trees).
3. Incentives:
| BitTorrent | IPFS |
|---|---|
| Tit-for-tat: “I’ll only upload to you if you upload to me” | BitSwap barter: “I’ll trade blocks I have for blocks I want” |
| Strict: peer is choked if they don’t reciprocate | Soft: based on credit/debit ratios |
| Leeching is directly punished | Leeching is indirectly punished (credit score drops) |
BitTorrent’s tit-for-tat is more aggressive about enforcing sharing. IPFS’s BitSwap is more flexible — a peer with low credit can still fetch data, just at lower priority.
4. Merkle structure difference:
BitTorrent’s Merkle tree is a static structure: the piece list is fixed when the torrent is created. You cannot add files or reorganize without creating a new torrent.
IPFS’s Merkle DAG is a dynamic structure: you can add files, create directories, and link objects arbitrarily. IPFS directories are DAG nodes; BitTorrent has no concept of directory hierarchy in its data structure.
Exercise 3 Solution
Why the attack fails:
Proof-of-Replication involves sealing — a sequential, resource-intensive encoding process that ties a specific copy of the data to a specific miner:
Sealing process (simplified):
Original data D
│
├──► Miner ID (M) ──────┐
├──► Deal ID (Deal) ─────┤
├──► Random nonce ───────┤
└────────────────────────┘
│
▼
Layer-by-layer encoding (AES + PoSW)
│
▼
Sealed sector S = Seal(D, M, Deal, nonce)
Time: ~30 minutes per sector
Each sealed copy S is different because:
- The miner ID is different (each miner has a unique public key)
- The deal ID is different (each deal is a separate contract)
- The nonce is different (random value per deal)
For the miner claiming 100 copies stored:
- They would need 100 different sealed sectors: S₁, S₂, …, S₁₀₀
- Each requires ~30 minutes of sequential computation
- They cannot compute 100 proofs from 1 copy because the sealing input (miner ID, deal ID) differs per deal
- PoRep challenges ask about the sealed data, which is unique per copy
If caught cheating (unable to produce the correct PoRep), the miner’s entire collateral for all 100 deals is slashed.
Bonus: Without PoRep (PoSt only):
The attack would likely succeed. PoSt proves that you currently hold some data, but it doesn’t prove that the data is a unique copy. With PoSt alone:
- Store 1 copy of D
- Generate PoSt for that one copy
- Have the same PoSt serve as proof for all 100 deals (since PoSt just proves “I have this hash”)
- Collect 100× payment for 1× storage
This is why PoRep is necessary: it binds each deal to a physically distinct, computation-bound encoding.