Time Causality · Distributed Systems Defined

Key Question

What assumptions do developers naively make about networks that cause distributed systems to fail in production?

Deep Dive

In 1994, Peter Deutsch at Sun Microsystems published a list of false assumptions that programmers make when building networked applications. Known as the “Eight Fallacies of Distributed Computing,” this list has aged remarkably well. Every fallacy describes a belief that is true in a single-machine system but dangerously false once you connect computers. Violating any one of them can cause outages that are baffling if you do not know the fallacy exists.

Fallacy 1: The network is reliable. This is the most dangerous assumption. A developer builds and tests on a local machine or a cozy data center LAN where packets rarely drop. In production, networks are unreliable: switches fail, cables are cut, wireless interference corrupts packets, and routers drop traffic under load. Real-world packet loss on the public internet ranges from 0.1% to 5% or more. Ignoring this means your system may hang forever waiting for a response that will never arrive. The fix: timeouts, retries, and idempotent operations.

Fallacy 2: Latency is zero. Developers assume that a remote call returns as fast as a local memory access. A local function call takes nanoseconds. A network round trip within a data center takes 500 microseconds. Across continents, it takes 100-300 milliseconds — a factor of a million slower than local code. This fallacy leads to chatty protocols that make hundreds of round trips, each one adding the latency cost. The fix: batch requests, cache aggressively, and design for data locality.

Fallacy 3: Bandwidth is infinite. Bandwidth — the amount of data you can send per second — is finite and shared. A single server on a 1 Gbps link cannot stream video to 10,000 users. Worse, sending large payloads consumes bandwidth that other services need. The fallacy manifests as verbose serialization formats (e.g., XML without compression) or sending entire data structures when only a few fields are needed. The fix: compress data, use efficient formats (Protobuf, Avro), and send only what is required.

Fallacy 4: The network is secure. A developer assumes that because the data center has a firewall, internal traffic is safe. In reality, networks are porous: a compromised machine can sniff traffic, an attacker can intercept DNS, and a rogue employee can access internal services. The Target data breach (2013) began when attackers stole credentials from an HVAC vendor — an “external” partner — and used them to reach the internal payment network. The fix: encrypt everything (TLS), authenticate every request, and never trust internal IP addresses.

Fallacy 5: Topology doesn’t change. The network layout is assumed to be static. In cloud environments, topology changes constantly: servers are replaced, load balancers are reconfigured, and entire data centers are added or decommissioned. Applications that hardcode IP addresses break when a server is recycled. The fix: use service discovery (DNS, Consul, Kubernetes DNS) and design for dynamic membership.

Fallacy 6: There is one administrator. A developer assumes they control the entire network path. In a distributed system, different parts are managed by different teams or organizations. A database might be operated by one team, the application server by another, the CDN by a third party, and the client by the user. You cannot control the configuration, software versions, or security practices on machines you do not administer.

Fallacy 7: Transport cost is zero. Network calls appear free because there is no line-item charge per packet. But network costs are real: CPU cycles spent on serialization and deserialization, memory for buffering, and the operational cost of managing connections. A service that makes hundreds of microservice calls per request can spend more time on network overhead than on actual computation.

Fallacy 8: The network is homogeneous. Developers assume all connected machines run the same OS, use the same data formats, and speak the same protocol version. In reality, a system may include Linux servers, Windows clients, ARM-based IoT devices, and mobile phones running different software versions. Big-endian and little-endian byte orders, different character encodings, and protocol version mismatches all cause subtle failures.

These eight fallacies form a checklist. Every distributed system violates some of them, and a resilient design acknowledges which fallacies it is embracing and how it mitigates the risk.

Check Your Understanding

You implement retry logic with exponential backoff. Which fallacy does this address?
A microservice sends the entire customer object (50 fields, 10 KB) when only the customer name is needed. Which fallacy is being violated?
Your system works perfectly in staging but fails intermittently in production with random “connection reset” errors. Which fallacy is likely at play?
An engineer assumes “we’re behind the firewall, so we don’t need TLS between services.” Which fallacy (or fallacies) should make them reconsider?

The “So What?”

Whenever you encounter a production incident where “it worked in dev but not in prod,” chances are one of the eight fallacies is the root cause. Print this list and put it on your wall. When you design a system, go through each fallacy and ask: “What happens if this assumption is false?” That habit alone will prevent more outages than any monitoring tool.

✏️ Exercises

Topic 1: Distributed Systems Defined — Exercises

Exercise 1: Is It Distributed?

A company deploys a website on a single physical server. A load balancer distributes incoming HTTP requests across two application processes running on the same machine. Is this a distributed system? Explain your reasoning in terms of the three key characteristics: concurrency, no shared clock, and independent failures.

Think about whether two processes on the same machine have a shared clock, whether they can fail independently, and whether they truly operate concurrently.

Exercise 2: Scaling Showdown

Your social media startup has 10,000 users and runs on a single server with 4 CPU cores, 16 GB RAM, and a 500 GB SSD. After a year, you have 1,000,000 users and your server is at 95% CPU utilization during peak hours.

Describe a scenario where horizontal scaling is clearly better than vertical scaling, and one where vertical scaling might be preferable. Include cost, complexity, and maintenance considerations.

Exercise 3: Fallacy Identification

For each of the following real-world incidents, identify which of the eight fallacies of distributed computing was violated:

(a) An application makes 200 sequential database queries on page load. In development (localhost), the page renders in 200ms. In production (database on a separate server), the page takes 12 seconds to load.

(b) A team deploys a microservice that talks to a payment API. The payment API is maintained by a different team. During a deployment, the payment team changes their API response format from XML to JSON without updating the documentation. The microservice crashes.

(c) A server has been running for 400 days without a restart. One night, a network engineer replaces a faulty switch in the rack. The server’s network card renegotiates its link speed, and the application’s hardcoded TCP keep-alive settings cause it to stop accepting new connections until the machine is rebooted.

Exercise 4: The Replication Transparency Dilemma

You are designing a distributed key-value store that replicates every write to three servers. Users expect “strong consistency” — a read immediately after a write must return the written value. Explain why achieving replication transparency (the replicas appear as one system) is fundamentally hard when you require strong consistency.

Consider what happens when a network partition separates the replicas. What options do you have? What tradeoff is this forcing you to make?

👁️ View Solutions

Topic 1: Distributed Systems Defined — Solutions

Solution 1

No, this is not a distributed system. While there are two processes running concurrently, they share the same machine, which means:

Shared clock: Both processes use the same system clock, so they agree on time.
Shared fate: A hardware failure (e.g., power supply, motherboard) takes down both processes simultaneously. They fail dependently, not independently.
Shared memory space: They can communicate via shared memory, pipes, or local sockets without the network unreliability that defines distributed systems.

The load balancer adds complexity but does not change the fundamental architecture. True distributed systems involve independent computers that communicate over a network and can fail independently. Two processes on one machine are just concurrent processes, not a distributed system.

Solution 2

When horizontal scaling is better: Your user base grows to 10,000,000. The single-machine approach hits hard limits — no single server has enough RAM to hold the active user sessions, and the CPU cannot handle the request rate. By adding 10 servers behind a load balancer, you get 10x the capacity at roughly 10x the cost. When one server fails, 90% of traffic is still served. Cost is predictable (add servers linearly), and you can use commodity hardware.

When vertical scaling is preferable: Your dataset fits entirely in memory (say, 100 GB), and your query pattern is read-heavy. Upgrading from 64 GB to 256 GB RAM on one server may be cheaper and far simpler than sharding the data across 4 servers (which requires a distributed join, consistency protocol, and more failure handling). For a small team with limited operations bandwidth, vertical scaling often wins — the complexity of distribution is not worth it until the single machine hits a fundamental ceiling.

Solution 3

(a) Fallacy 2: Latency is zero. The developer assumed database queries cost the same as local function calls. In development, the database is on localhost (microsecond latency). In production, each query incurs network round trip latency (milliseconds). 200 sequential queries × 2ms RTT = 400ms just in network time, plus query execution. The fix: batch queries, use eager loading, or reduce query count.

(b) Fallacy 6: There is one administrator. The team does not control the payment API team’s deployment schedule or communication practices. The payment API team changed their API format without coordination. This is the essence of the “one administrator” fallacy — in a distributed system with multiple owners, changes happen outside your control.

(c) Fallacy 5: Topology doesn’t change. The system assumed the network switch would remain the same configuration forever. Switch replacement caused link renegotiation, which interacted poorly with hardcoded TCP settings. The system was not designed for a topology change, so what should have been a routine maintenance operation caused an outage.

Solution 4

Replication transparency means the user should not know there are three copies. Strong consistency means if you write value X and immediately read, you must get X.

The problem arises during a network partition. Suppose replicas A, B, and C are separated: A can talk to the client but not to B and C. The client writes X to A. Now A tries to replicate to B and C but cannot reach them.

You have two options:

Block the write (refuse to acknowledge until all replicas confirm). This preserves strong consistency but sacrifices availability — the system becomes unavailable during the partition. This is the CP (Consistency + Partition tolerance) choice in the CAP theorem.
Accept the write on A and acknowledge to the client. Now B and C do not have X. A subsequent read from B returns the old value, violating strong consistency. This sacrifices consistency for availability.

Replication transparency with strong consistency is hard because it requires coordination across machines before acknowledging the operation. Coordination means waiting, and waiting conflicts with the goal of hiding the complexity of replication. The fundamental tradeoff is between the transparency of the replication illusion and the speed at which the system can respond.