Key Question
Why not just use a bigger computer? What forces drive organizations to adopt distributed architectures?
Deep Dive
When a single server works, it is dramatically simpler than a distributed system. No network calls, no clock disagreements, no partial failures. Yet every major internet service uses distribution. The reason comes down to three forces: scalability, reliability, and latency. Each one pushes against the limits of what a single machine can do.
Scalability is the ability of a system to handle increased load by adding resources. There are two approaches. Vertical scaling means replacing your server with a bigger one — more CPU cores, more RAM, faster disks. This works up to a point, but the biggest machine you can buy is limited and costs exponentially more per unit of capacity. A server with 4 TB of RAM costs over 100x more than one with 32 GB, yet it is not 100x faster. Horizontal scaling means adding more machines of modest size. If one server handles 1,000 requests per second, ten servers handle roughly 10,000 — at linear cost. Horizontal scaling is what allows systems like Google Search or Netflix to serve billions of users. The price is complexity: you must now route requests, handle failures, and keep data consistent across machines.
Reliability is the probability that a system continues to function correctly over time. A single server has a hard reliability ceiling: hard drives fail, power supplies die, network cards stop responding, and eventually the machine dies. Mean Time Between Failures (MTBF) for a typical server is measured in years. If you have one server and it fails, your service is down. If you have 100 servers and one fails, you lose 1% of capacity. By replicating data across multiple machines — keeping three copies of every piece of data, for instance — you can survive individual machine failures without any data loss and without downtime. This is why critical infrastructure runs on clusters, not single machines. Amazon S3 famously replicates data across multiple data centers; it has survived disk failures, rack failures, and even entire data center outages without losing customer data.
Latency is the time between initiating an action and seeing its result. For a user in Tokyo, a server in Tokyo can respond in 5 milliseconds. A server in Virginia takes 150 milliseconds — the speed of light across the Pacific is a physical limit you cannot beat. Distributing servers geographically — edge computing — places computation and data close to users to reduce latency. CDNs like Cloudflare and Akamai cache content at hundreds of locations worldwide. AWS has regions on every continent so applications can serve users from a nearby data center. The tradeoff is that geographically distributed systems must handle replication lag, network partitions, and conflicting updates — problems that simply do not exist in a single data center.
These three forces interact. You scale horizontally to handle load, which makes the system more reliable through redundancy, and you distribute globally to reduce latency. But each decision adds complexity. The art of distributed systems engineering is deciding how much complexity to accept for the required scalability, reliability, and latency.
Check Your Understanding
- If you need to handle 10x the traffic, which approach scales better: replacing your server with one that has 10x the capacity, or adding 10 servers? Why?
- A payment processing system must never lose a transaction. Why is a single server insufficient even if it has 99.999% uptime?
- An online game must respond to player actions within 50ms. Where should you place your servers?
The “So What?”
When you design a system, you will make tradeoffs between scalability, reliability, and latency. A caching layer improves latency but adds complexity. Replication improves reliability but creates consistency challenges. Understanding these forces helps you choose the right architecture: a startup’s MVP probably runs on one server, while a global payments system needs a distributed cluster. Know which problem you are solving before you add distribution.
✏️ Exercises
Topic 1: Distributed Systems Defined — Exercises
Exercise 1: Is It Distributed?
A company deploys a website on a single physical server. A load balancer distributes incoming HTTP requests across two application processes running on the same machine. Is this a distributed system? Explain your reasoning in terms of the three key characteristics: concurrency, no shared clock, and independent failures.
Think about whether two processes on the same machine have a shared clock, whether they can fail independently, and whether they truly operate concurrently.
Exercise 2: Scaling Showdown
Your social media startup has 10,000 users and runs on a single server with 4 CPU cores, 16 GB RAM, and a 500 GB SSD. After a year, you have 1,000,000 users and your server is at 95% CPU utilization during peak hours.
Describe a scenario where horizontal scaling is clearly better than vertical scaling, and one where vertical scaling might be preferable. Include cost, complexity, and maintenance considerations.
Exercise 3: Fallacy Identification
For each of the following real-world incidents, identify which of the eight fallacies of distributed computing was violated:
(a) An application makes 200 sequential database queries on page load. In development (localhost), the page renders in 200ms. In production (database on a separate server), the page takes 12 seconds to load.
(b) A team deploys a microservice that talks to a payment API. The payment API is maintained by a different team. During a deployment, the payment team changes their API response format from XML to JSON without updating the documentation. The microservice crashes.
(c) A server has been running for 400 days without a restart. One night, a network engineer replaces a faulty switch in the rack. The server’s network card renegotiates its link speed, and the application’s hardcoded TCP keep-alive settings cause it to stop accepting new connections until the machine is rebooted.
Exercise 4: The Replication Transparency Dilemma
You are designing a distributed key-value store that replicates every write to three servers. Users expect “strong consistency” — a read immediately after a write must return the written value. Explain why achieving replication transparency (the replicas appear as one system) is fundamentally hard when you require strong consistency.
Consider what happens when a network partition separates the replicas. What options do you have? What tradeoff is this forcing you to make?
👁️ View Solutions
Topic 1: Distributed Systems Defined — Solutions
Solution 1
No, this is not a distributed system. While there are two processes running concurrently, they share the same machine, which means:
- Shared clock: Both processes use the same system clock, so they agree on time.
- Shared fate: A hardware failure (e.g., power supply, motherboard) takes down both processes simultaneously. They fail dependently, not independently.
- Shared memory space: They can communicate via shared memory, pipes, or local sockets without the network unreliability that defines distributed systems.
The load balancer adds complexity but does not change the fundamental architecture. True distributed systems involve independent computers that communicate over a network and can fail independently. Two processes on one machine are just concurrent processes, not a distributed system.
Solution 2
When horizontal scaling is better: Your user base grows to 10,000,000. The single-machine approach hits hard limits — no single server has enough RAM to hold the active user sessions, and the CPU cannot handle the request rate. By adding 10 servers behind a load balancer, you get 10x the capacity at roughly 10x the cost. When one server fails, 90% of traffic is still served. Cost is predictable (add servers linearly), and you can use commodity hardware.
When vertical scaling is preferable: Your dataset fits entirely in memory (say, 100 GB), and your query pattern is read-heavy. Upgrading from 64 GB to 256 GB RAM on one server may be cheaper and far simpler than sharding the data across 4 servers (which requires a distributed join, consistency protocol, and more failure handling). For a small team with limited operations bandwidth, vertical scaling often wins — the complexity of distribution is not worth it until the single machine hits a fundamental ceiling.
Solution 3
(a) Fallacy 2: Latency is zero. The developer assumed database queries cost the same as local function calls. In development, the database is on localhost (microsecond latency). In production, each query incurs network round trip latency (milliseconds). 200 sequential queries × 2ms RTT = 400ms just in network time, plus query execution. The fix: batch queries, use eager loading, or reduce query count.
(b) Fallacy 6: There is one administrator. The team does not control the payment API team’s deployment schedule or communication practices. The payment API team changed their API format without coordination. This is the essence of the “one administrator” fallacy — in a distributed system with multiple owners, changes happen outside your control.
(c) Fallacy 5: Topology doesn’t change. The system assumed the network switch would remain the same configuration forever. Switch replacement caused link renegotiation, which interacted poorly with hardcoded TCP settings. The system was not designed for a topology change, so what should have been a routine maintenance operation caused an outage.
Solution 4
Replication transparency means the user should not know there are three copies. Strong consistency means if you write value X and immediately read, you must get X.
The problem arises during a network partition. Suppose replicas A, B, and C are separated: A can talk to the client but not to B and C. The client writes X to A. Now A tries to replicate to B and C but cannot reach them.
You have two options:
-
Block the write (refuse to acknowledge until all replicas confirm). This preserves strong consistency but sacrifices availability — the system becomes unavailable during the partition. This is the CP (Consistency + Partition tolerance) choice in the CAP theorem.
-
Accept the write on A and acknowledge to the client. Now B and C do not have X. A subsequent read from B returns the old value, violating strong consistency. This sacrifices consistency for availability.
Replication transparency with strong consistency is hard because it requires coordination across machines before acknowledging the operation. Coordination means waiting, and waiting conflicts with the goal of hiding the complexity of replication. The fundamental tradeoff is between the transparency of the replication illusion and the speed at which the system can respond.