Observability: Turning On the Lights
Stop guessing why your code is slow and start seeing what is actually happening.

I'm a passionate backend dev
When you are coding on your laptop, debugging is easy. You have your terminal right there, and if something breaks, the error message pops up immediately.
In a production backend, your code is running on a server thousands of miles away. It’s talking to three different databases and a dozen external APIs. When a user experiences a "500 Internal Server Error," you can’t just "look at the screen."
You need Observability. This is the practice of building your system so that it tells you how it's feeling from the inside. We do this using the "Three Pillars." ( Logging, Metrics, Tracing ).
Why Logs Stop Being Enough
Early systems feel understandable. A request comes in. You log a few lines. Something breaks. You read the logs. This works when:
execution is linear
failures are local
time is predictable
Queues break this illusion. A request enqueues a job. The job runs later. On another machine. Possibly more than once. Possibly not at all.
Logs still exist. But they no longer tell a story by themselves.
The Core Problem: Lost Causality
Backend bugs are rarely about what happened. They’re about why it happened.
Did the job fail because: the payload was wrong?, the worker crashed?, the retry ran twice?, the database timed out?, the cache returned stale data?
Without observability, all you see is: “Something didn’t work.”
That’s not actionable. That’s anxiety.
Observability Is Not Monitoring
This distinction matters.
Monitoring answers:
“Is the system up?”
“Is CPU high?”
“Are errors increasing?”
Observability answers:
“What is the system doing right now?”
“How did this request flow through the system?”
“Where did time disappear?”
“Where did intent get lost?”
Monitoring tells you there is smoke.
Observability helps you find the fire.
1. Logging: The "What Happened?"
Logs are the most basic form of observability. They are a chronological record of events.
Bad Log:
Error: something went wrong.(Useless)Good Log:
Order processing failed for User #402. Reason: Database timeout after 500ms.(Actionable)
The Rule: Logs should tell a story. But be careful—logging too much is like trying to read a book while someone is throwing confetti in your face. Log the important stuff: errors, system starts, and critical business milestones.
2. Metrics: The "How is it Feeling?"
Metrics are numbers. They don't tell you why something is wrong, but they tell you that something is wrong. They are the dashboard of your car.
Common metrics include:
Latency: How long is a request taking?
Error Rate: What percentage of requests are failing?
Throughput: How many requests per second are we handling?
Saturation: Is the CPU at 99%? Is the memory full?
When you see a spike in the Error Rate graph, you don't go fix the graph—you go look at the Logs to find out which specific error is causing the spike.
3. Tracing: The "Where did it go?"
In a modern backend, a single request might touch five different services.
The API Gateway sends it to the Auth Service.
The Auth Service checks the Cache.
The Order Service writes to the DB and puts a task in the Queue.
If that whole process is slow, which part is to blame? Tracing gives the request a unique "Trace ID." As the request travels through your system, each service "tags" the ID with how long it stayed there.
Tracing allows you to see: "The request took 2 seconds, and 1.8 seconds of that was just waiting for the database to respond."
The Goal: MTTR (Mean Time To Resolution)
The ultimate goal of observability isn't just to "have data." It’s to reduce the time it takes to fix a problem.
In a "dark" system, an outage might last hours while engineers guess what's wrong. In an "observable" system:
An Alert triggers (Metric).
You see exactly which service is slow (Trace).
You read the specific error message (Log).
You fix the bug in minutes.
Why “It Works on My Machine” Dies Here
Local environments lie. They don’t have: traffic, concurrency, retries, partial failures, network delays.
Production does.
Observability is how backend engineers stay humble in production.
It replaces confidence with evidence.
Summary
Logging tells you the story of a specific event.
Metrics give you the big-picture health of your system.
Tracing tracks a single request as it jumps between services.
What Comes Next
In the next one, we’ll talk about Graceful Shutdown.
When you see a "Red Light" in your Observability dashboard, your first instinct is to restart the server. If you don't have a Graceful Shutdown, that restart might actually cause more errors.
Thinking in Backend means realizing that "Code that Works" is only half the job. The other half is "Code that is Visible." If you can't measure it, you can't manage it.
This article is part of the Thinking in Backend series, where we learn backend engineering by understanding how systems think, not just how databases execute.



