Distributed & Decentralized Systems Curriculum
Production Engineering Resilience · Observability Tracing

Key Question

How do we debug a single request that flows through dozens of different microservices?

Deep Dive

In a monolith, a stack trace tells you exactly what happened. In a distributed system, a request is like a ghost moving through walls. If a user’s checkout fails, it might be because the Auth Service was slow, the Inventory DB timed out, or the Payment Gateway returned an error.

Distributed Tracing solves this by assigning a “DNA” to every request.

The Trace and The Span

  1. Trace: The entire journey of a request from start to finish.
  2. Span: A single unit of work within that journey (e.g., one database query or one RPC call).

How it Works: Context Propagation

When a request enters the system (at the Load Balancer or API Gateway), the system generates a Trace ID. This ID must be “carried” in the headers (like X-Trace-Id or the W3C traceparent standard) of every subsequent network call.

Each service along the way:

  • Reads the incoming Trace ID.
  • Creates its own Span ID.
  • Logs the start/end time and any metadata (e.g., user_id, http_status).
  • Sends this data asynchronously to a Tracing Collector (like Jaeger or Zipkin).

The collector then stitches these spans back together using the Trace ID, creating a “Gantt chart” of the request. You can now see exactly which “hop” took 2 seconds and where the bottleneck lies.

Key Takeaways

  • Logs tell you what happened in one place; Traces tell you the relationship between places.
  • Context Propagation is the “glue” that allows traces to span across network boundaries.
  • Tracing is essential for identifying Long Tail Latency (the “p99” problem).

Exercises

  1. What is the difference between a Trace ID and a Span ID?
  2. If Service A calls Service B, but Service B doesn’t support tracing headers, what happens to the trace?
  3. Why do production systems usually “sample” traces (e.g., only record 1% of requests) instead of tracing everything?

👁️ View Solutions

  1. A Trace ID identifies the entire request journey; a Span ID identifies one specific segment or operation within that journey.
  2. The “causal link” is broken. Service A will show it called something, but Service B’s work will appear as a separate, disconnected trace (or not at all).
  3. Performance and storage costs. Tracing every request generates a massive volume of data that can outweigh the actual application traffic.