Distributed & Decentralized Systems Curriculum
Time Causality · Distributed Systems Defined

Key Question

What exactly is a distributed system, and how is it different from a regular computer program?

Deep Dive

Leslie Lamport, one of the pioneers of distributed systems, once defined the field with a darkly humorous observation: “A distributed system is one where the failure of a computer you didn’t even know existed can render your own computer unusable.” This gets at the essential truth — distributed systems are defined not by what they are, but by the problems they introduce. The formal definition is more precise: a distributed system is a collection of independent computers that appear to its users as a single coherent system.

These independent computers communicate by passing messages over a network. They do not share memory or a common clock. Each computer (called a “node”) runs its own operating system and its own copy of the software. When you type a query into Google Search, that single action touches thousands of machines. Yet the result comes back as if one computer handled it. That illusion of unity is the defining challenge of distributed systems.

Three characteristics separate distributed systems from single-machine software. First, concurrency: multiple nodes run simultaneously, and their operations interleave in unpredictable ways. Two users editing the same document may send updates at the same instant. Second, no shared clock: each node has its own notion of time. If Node A records an event at “10:00:00.001” and Node B records an event at “10:00:00.002,” you cannot trust that A’s event happened first — their clocks may disagree. Third, independent failures: some nodes can fail while others continue running. A web server in Virginia can crash while its replica in Dublin keeps serving traffic.

Concrete examples help anchor the definition. Google Search is a distributed system: your query goes to a load balancer, which forwards it to one of hundreds of frontend servers, which fan out to thousands of indexing servers, which return results that are merged and ranked. You see a single page. DNS (the Domain Name System) is the oldest and largest distributed system in production — it maps domain names like google.com to IP addresses across a hierarchy of millions of servers. Bitcoin is a distributed system where thousands of nodes maintain a shared ledger through a consensus protocol, with no central authority. Each of these looks like a single service to the user.

The hard part is maintaining that illusion. When you send a request, any one of a dozen machines could fail, and you should still get an answer. When a network cable is cut, the system must route around it. When a server’s clock drifts, messages may be misordered. The distributed system is defined not by the hardware it runs on, but by the properties it must maintain despite the chaos underneath.

Check Your Understanding

  1. What are the three key characteristics that distinguish a distributed system from a single-machine system?
  2. Explain the contradiction in Lamport’s quote: how can a computer you “didn’t even know existed” affect your work?
  3. Think of one system you use daily that could NOT be a distributed system. What makes it different from Google Search or DNS?

The “So What?”

Every time you write code that talks to a database, calls an API, or handles a user request, you are working inside a distributed system. The failure modes are fundamentally different from local programming: messages get lost, clocks disagree, and two machines can partially disagree on the truth. Understanding what a distributed system is — and what makes it hard — is the prerequisite for designing systems that survive the inevitable chaos.


✏️ Exercises

Topic 1: Distributed Systems Defined — Exercises

Exercise 1: Is It Distributed?

A company deploys a website on a single physical server. A load balancer distributes incoming HTTP requests across two application processes running on the same machine. Is this a distributed system? Explain your reasoning in terms of the three key characteristics: concurrency, no shared clock, and independent failures.

Think about whether two processes on the same machine have a shared clock, whether they can fail independently, and whether they truly operate concurrently.


Exercise 2: Scaling Showdown

Your social media startup has 10,000 users and runs on a single server with 4 CPU cores, 16 GB RAM, and a 500 GB SSD. After a year, you have 1,000,000 users and your server is at 95% CPU utilization during peak hours.

Describe a scenario where horizontal scaling is clearly better than vertical scaling, and one where vertical scaling might be preferable. Include cost, complexity, and maintenance considerations.


Exercise 3: Fallacy Identification

For each of the following real-world incidents, identify which of the eight fallacies of distributed computing was violated:

(a) An application makes 200 sequential database queries on page load. In development (localhost), the page renders in 200ms. In production (database on a separate server), the page takes 12 seconds to load.

(b) A team deploys a microservice that talks to a payment API. The payment API is maintained by a different team. During a deployment, the payment team changes their API response format from XML to JSON without updating the documentation. The microservice crashes.

(c) A server has been running for 400 days without a restart. One night, a network engineer replaces a faulty switch in the rack. The server’s network card renegotiates its link speed, and the application’s hardcoded TCP keep-alive settings cause it to stop accepting new connections until the machine is rebooted.


Exercise 4: The Replication Transparency Dilemma

You are designing a distributed key-value store that replicates every write to three servers. Users expect “strong consistency” — a read immediately after a write must return the written value. Explain why achieving replication transparency (the replicas appear as one system) is fundamentally hard when you require strong consistency.

Consider what happens when a network partition separates the replicas. What options do you have? What tradeoff is this forcing you to make?

👁️ View Solutions

Topic 1: Distributed Systems Defined — Solutions

Solution 1

No, this is not a distributed system. While there are two processes running concurrently, they share the same machine, which means:

  • Shared clock: Both processes use the same system clock, so they agree on time.
  • Shared fate: A hardware failure (e.g., power supply, motherboard) takes down both processes simultaneously. They fail dependently, not independently.
  • Shared memory space: They can communicate via shared memory, pipes, or local sockets without the network unreliability that defines distributed systems.

The load balancer adds complexity but does not change the fundamental architecture. True distributed systems involve independent computers that communicate over a network and can fail independently. Two processes on one machine are just concurrent processes, not a distributed system.


Solution 2

When horizontal scaling is better: Your user base grows to 10,000,000. The single-machine approach hits hard limits — no single server has enough RAM to hold the active user sessions, and the CPU cannot handle the request rate. By adding 10 servers behind a load balancer, you get 10x the capacity at roughly 10x the cost. When one server fails, 90% of traffic is still served. Cost is predictable (add servers linearly), and you can use commodity hardware.

When vertical scaling is preferable: Your dataset fits entirely in memory (say, 100 GB), and your query pattern is read-heavy. Upgrading from 64 GB to 256 GB RAM on one server may be cheaper and far simpler than sharding the data across 4 servers (which requires a distributed join, consistency protocol, and more failure handling). For a small team with limited operations bandwidth, vertical scaling often wins — the complexity of distribution is not worth it until the single machine hits a fundamental ceiling.


Solution 3

(a) Fallacy 2: Latency is zero. The developer assumed database queries cost the same as local function calls. In development, the database is on localhost (microsecond latency). In production, each query incurs network round trip latency (milliseconds). 200 sequential queries × 2ms RTT = 400ms just in network time, plus query execution. The fix: batch queries, use eager loading, or reduce query count.

(b) Fallacy 6: There is one administrator. The team does not control the payment API team’s deployment schedule or communication practices. The payment API team changed their API format without coordination. This is the essence of the “one administrator” fallacy — in a distributed system with multiple owners, changes happen outside your control.

(c) Fallacy 5: Topology doesn’t change. The system assumed the network switch would remain the same configuration forever. Switch replacement caused link renegotiation, which interacted poorly with hardcoded TCP settings. The system was not designed for a topology change, so what should have been a routine maintenance operation caused an outage.


Solution 4

Replication transparency means the user should not know there are three copies. Strong consistency means if you write value X and immediately read, you must get X.

The problem arises during a network partition. Suppose replicas A, B, and C are separated: A can talk to the client but not to B and C. The client writes X to A. Now A tries to replicate to B and C but cannot reach them.

You have two options:

  1. Block the write (refuse to acknowledge until all replicas confirm). This preserves strong consistency but sacrifices availability — the system becomes unavailable during the partition. This is the CP (Consistency + Partition tolerance) choice in the CAP theorem.

  2. Accept the write on A and acknowledge to the client. Now B and C do not have X. A subsequent read from B returns the old value, violating strong consistency. This sacrifices consistency for availability.

Replication transparency with strong consistency is hard because it requires coordination across machines before acknowledging the operation. Coordination means waiting, and waiting conflicts with the goal of hiding the complexity of replication. The fundamental tradeoff is between the transparency of the replication illusion and the speed at which the system can respond.