Observability vs Monitoring
Monitoring vs Observability
| Monitoring | Observability | |
|---|---|---|
| Core question | “Is it up? Is it slow?” | “Why is it slow? What’s affected?” |
| Approach | Predefined checks, thresholds, dashboards | Exploratory, ad-hoc queries over raw signals |
| Failure model | Known failure modes (known unknowns) | Unknown failure modes (unknown unknowns) |
| Data model | Aggregated metrics, health checks | High-cardinality: traces, logs, metrics, profiles |
| Who acts | Alert fires → on-call reacts | Engineer explores → discovers root cause |
| Scales to | Monolith, small service count | Distributed systems, polyglot, ephemeral infra |
Known unknowns vs unknown unknowns
This is the key conceptual difference.
Monitoring covers known unknowns — things you anticipated might break, so you wrote a check for them. “CPU > 90%”, “disk full”, “service returns 5xx”.
Observability lets you investigate unknown unknowns — failures you never predicted, by querying raw telemetry after the fact.
Example: A new deployment causes one specific user segment to get slow responses due to an unexpected database query plan change. No alert fires — because you never anticipated this exact scenario. With observability, an engineer notices elevated p99 latency in traces, filters by user attributes, and finds the slow SQL query in span details — all without deploying new code or adding new metrics.
Four signals of observability
Observability rests on four complementary signals:
| Signal | Answers | Tool in our stack |
|---|---|---|
| Metrics | How much? How fast? What’s the trend? | Prometheus / Mimir |
| Logs | What happened? What was the context? | Loki |
| Traces | What was the path of this request across services? | Tempo |
| Profiles | Which function is consuming CPU/memory? | Pyroscope |
No single signal is sufficient. Each one fills gaps the others leave:
- Metrics tell you something is wrong (high error rate).
- Traces tell you where it’s wrong (which service, which dependency).
- Logs tell you what went wrong (error message, stack trace).
- Profiles tell you why the code is slow (which function, which line).
Why is classic monitoring no longer sufficient?
- CPU/Memory are not enough. A service can have 10% CPU usage and still return errors because downstream calls fail or a connection pool is exhausted.
- Distributed systems. A single user request can touch 10+ services. A dashboard per service doesn’t show the cross-service picture.
- Polyglot environments. You need a vendor-neutral standard (OpenTelemetry) to instrument Java, Go, Python, .NET consistently.
- Ephemeral resources. Pods live minutes, serverless functions live seconds. You can’t SSH in to debug — the telemetry data is all you have.
Monitoring doesn’t go away
Observability does not replace monitoring. You still need uptime checks, threshold alerts, and dashboards. Observability extends monitoring by adding the ability to ask new questions without deploying new code.
In this training we will build both:
- Monitoring — alerts, dashboards, SLO tracking (Prometheus, Grafana Alerting)
- Observability — exploratory analysis across all four signals (Drilldown, TraceQL, LogQL)