Production Engineering & Resilience: Distributed Tracing

Production Engineering Resilience · Observability Tracing

Key Question

How do we debug a single request that flows through dozens of different microservices?

Deep Dive

In a monolith, a stack trace tells you exactly what happened. In a distributed system, a request is like a ghost moving through walls. If a user’s checkout fails, it might be because the Auth Service was slow, the Inventory DB timed out, or the Payment Gateway returned an error.

Distributed Tracing solves this by assigning a “DNA” to every request.

The Trace and The Span

Trace: The entire journey of a request from start to finish.
Span: A single unit of work within that journey (e.g., one database query or one RPC call).

How it Works: Context Propagation

When a request enters the system (at the Load Balancer or API Gateway), the system generates a Trace ID. This ID must be “carried” in the headers (like X-Trace-Id or the W3C traceparent standard) of every subsequent network call.

Each service along the way:

Reads the incoming Trace ID.
Creates its own Span ID.
Logs the start/end time and any metadata (e.g., user_id, http_status).
Sends this data asynchronously to a Tracing Collector (like Jaeger or Zipkin).

The collector then stitches these spans back together using the Trace ID, creating a “Gantt chart” of the request. You can now see exactly which “hop” took 2 seconds and where the bottleneck lies.

Key Takeaways

Logs tell you what happened in one place; Traces tell you the relationship between places.
Context Propagation is the “glue” that allows traces to span across network boundaries.
Tracing is essential for identifying Long Tail Latency (the “p99” problem).

Exercises

What is the difference between a Trace ID and a Span ID?
If Service A calls Service B, but Service B doesn’t support tracing headers, what happens to the trace?
Why do production systems usually “sample” traces (e.g., only record 1% of requests) instead of tracing everything?

👁️ View Solutions

A Trace ID identifies the entire request journey; a Span ID identifies one specific segment or operation within that journey.
The “causal link” is broken. Service A will show it called something, but Service B’s work will appear as a separate, disconnected trace (or not at all).
Performance and storage costs. Tracing every request generates a massive volume of data that can outweigh the actual application traffic.