Time Causality · Distributed Systems Defined

Key Question

What does it mean for a distributed system to “appear as a single system,” and what forms of transparency make that illusion possible?

Deep Dive

The defining promise of a distributed system is that it should look like a single, coherent computer to its users. This illusion is built on eight forms of transparency, first codified in the ANSA Reference Model and later adopted by the ISO Reference Model for Open Distributed Processing. Each transparency hides a specific kind of complexity. No real system achieves all eight perfectly, but understanding them reveals what engineers are trying to simplify.

Access transparency means the same interface is used regardless of how the resource is accessed — local or remote. A function call to a local object should look the same as a call to an object on another machine. Remote Procedure Calls (RPC) like gRPC achieve this: the client calls getUser(id) and the framework handles the network serialization and transport. The programmer writes getUser(id) in their code and does not think about bytes on the wire.

Location transparency hides where a resource lives. The system does not require you to know its physical or network address. DNS provides location transparency: you type google.com, not 142.250.80.14. The Domain Name Service resolves the name to an IP address for you. If Google moves its service to a new server with a different IP, DNS updates the mapping, and you never notice. You refer to resources by name, not address.

Replication transparency makes multiple copies of a resource behave as one. If a database stores three replicas of your data in three data centers, a read from any replica should return the correct result. The system hides the fact that there are copies. This is one of the hardest transparencies to achieve because strong consistency across replicas is expensive — it requires coordination that adds latency. Amazon DynamoDB offers “eventual consistency” for reads (replication transparency with relaxed guarantees) because strict replication transparency would limit scalability.

Failure transparency means the system hides its own failures and recovers automatically. If a server crashes, the system routes requests to a healthy replica without the user noticing. TCP achieves partial failure transparency: if a packet is lost, TCP retransmits it automatically. At the application level, retry logic, circuit breakers, and failover mechanisms provide failure transparency.

Mobility transparency allows resources and clients to move without disrupting the system. When your phone switches from Wi-Fi to cellular, your TCP connection should survive. When a virtual machine migrates from one physical host to another, its network connections should persist. This is transparency over movement.

Performance transparency hides performance variations caused by system load. When a server is under heavy load, the system may add more replicas to handle the demand, and users should not see degradation. Auto-scaling groups in AWS provide performance transparency: new instances are spun up and requests are redistributed automatically.

Scaling transparency allows the system to grow without changing its structure or API. You should be able to add more servers without rewriting your application. Microservice architectures achieve this by communicating over standard protocols (HTTP, gRPC) and using load balancers that discover new instances.

Each form of transparency hides a distinct pain point. Together, they define the ideal: a system that is resilient to change, failure, and growth, while presenting a simple, stable interface to the outside world.

Check Your Understanding

What is the difference between access transparency and location transparency? Give an example of each.
Why is replication transparency particularly difficult when strong consistency is required?
How does a load balancer contribute to both performance transparency and scaling transparency?

The “So What?”

Every abstraction you use in distributed programming — from HTTP to DNS to database connection pools — is a form of transparency in action. When you choose a technology, you are implicitly choosing which complexities to hide and which to expose. The uncomfortable truth is that no system achieves perfect transparency. At some point, the network partition, the timeout, or the consistency error leaks through. Understanding transparency helps you predict where those leaks will appear and design systems that handle them gracefully.

✏️ Exercises

Topic 1: Distributed Systems Defined — Exercises

Exercise 1: Is It Distributed?

A company deploys a website on a single physical server. A load balancer distributes incoming HTTP requests across two application processes running on the same machine. Is this a distributed system? Explain your reasoning in terms of the three key characteristics: concurrency, no shared clock, and independent failures.

Think about whether two processes on the same machine have a shared clock, whether they can fail independently, and whether they truly operate concurrently.

Exercise 2: Scaling Showdown

Your social media startup has 10,000 users and runs on a single server with 4 CPU cores, 16 GB RAM, and a 500 GB SSD. After a year, you have 1,000,000 users and your server is at 95% CPU utilization during peak hours.

Describe a scenario where horizontal scaling is clearly better than vertical scaling, and one where vertical scaling might be preferable. Include cost, complexity, and maintenance considerations.

Exercise 3: Fallacy Identification

For each of the following real-world incidents, identify which of the eight fallacies of distributed computing was violated:

(a) An application makes 200 sequential database queries on page load. In development (localhost), the page renders in 200ms. In production (database on a separate server), the page takes 12 seconds to load.

(b) A team deploys a microservice that talks to a payment API. The payment API is maintained by a different team. During a deployment, the payment team changes their API response format from XML to JSON without updating the documentation. The microservice crashes.

(c) A server has been running for 400 days without a restart. One night, a network engineer replaces a faulty switch in the rack. The server’s network card renegotiates its link speed, and the application’s hardcoded TCP keep-alive settings cause it to stop accepting new connections until the machine is rebooted.

Exercise 4: The Replication Transparency Dilemma

You are designing a distributed key-value store that replicates every write to three servers. Users expect “strong consistency” — a read immediately after a write must return the written value. Explain why achieving replication transparency (the replicas appear as one system) is fundamentally hard when you require strong consistency.

Consider what happens when a network partition separates the replicas. What options do you have? What tradeoff is this forcing you to make?

👁️ View Solutions

Topic 1: Distributed Systems Defined — Solutions

Solution 1

No, this is not a distributed system. While there are two processes running concurrently, they share the same machine, which means:

Shared clock: Both processes use the same system clock, so they agree on time.
Shared fate: A hardware failure (e.g., power supply, motherboard) takes down both processes simultaneously. They fail dependently, not independently.
Shared memory space: They can communicate via shared memory, pipes, or local sockets without the network unreliability that defines distributed systems.

The load balancer adds complexity but does not change the fundamental architecture. True distributed systems involve independent computers that communicate over a network and can fail independently. Two processes on one machine are just concurrent processes, not a distributed system.

Solution 2

When horizontal scaling is better: Your user base grows to 10,000,000. The single-machine approach hits hard limits — no single server has enough RAM to hold the active user sessions, and the CPU cannot handle the request rate. By adding 10 servers behind a load balancer, you get 10x the capacity at roughly 10x the cost. When one server fails, 90% of traffic is still served. Cost is predictable (add servers linearly), and you can use commodity hardware.

When vertical scaling is preferable: Your dataset fits entirely in memory (say, 100 GB), and your query pattern is read-heavy. Upgrading from 64 GB to 256 GB RAM on one server may be cheaper and far simpler than sharding the data across 4 servers (which requires a distributed join, consistency protocol, and more failure handling). For a small team with limited operations bandwidth, vertical scaling often wins — the complexity of distribution is not worth it until the single machine hits a fundamental ceiling.

Solution 3

(a) Fallacy 2: Latency is zero. The developer assumed database queries cost the same as local function calls. In development, the database is on localhost (microsecond latency). In production, each query incurs network round trip latency (milliseconds). 200 sequential queries × 2ms RTT = 400ms just in network time, plus query execution. The fix: batch queries, use eager loading, or reduce query count.

(b) Fallacy 6: There is one administrator. The team does not control the payment API team’s deployment schedule or communication practices. The payment API team changed their API format without coordination. This is the essence of the “one administrator” fallacy — in a distributed system with multiple owners, changes happen outside your control.

(c) Fallacy 5: Topology doesn’t change. The system assumed the network switch would remain the same configuration forever. Switch replacement caused link renegotiation, which interacted poorly with hardcoded TCP settings. The system was not designed for a topology change, so what should have been a routine maintenance operation caused an outage.

Solution 4

Replication transparency means the user should not know there are three copies. Strong consistency means if you write value X and immediately read, you must get X.

The problem arises during a network partition. Suppose replicas A, B, and C are separated: A can talk to the client but not to B and C. The client writes X to A. Now A tries to replicate to B and C but cannot reach them.

You have two options:

Block the write (refuse to acknowledge until all replicas confirm). This preserves strong consistency but sacrifices availability — the system becomes unavailable during the partition. This is the CP (Consistency + Partition tolerance) choice in the CAP theorem.
Accept the write on A and acknowledge to the client. Now B and C do not have X. A subsequent read from B returns the old value, violating strong consistency. This sacrifices consistency for availability.

Replication transparency with strong consistency is hard because it requires coordination across machines before acknowledging the operation. Coordination means waiting, and waiting conflicts with the goal of hiding the complexity of replication. The fundamental tradeoff is between the transparency of the replication illusion and the speed at which the system can respond.