Observability vs Monitoring

Monitoring vs Observability

	Monitoring	Observability
Core question	“Is it up? Is it slow?”	“Why is it slow? What’s affected?”
Approach	Predefined checks, thresholds, dashboards	Exploratory, ad-hoc queries over raw signals
Failure model	Known failure modes (known unknowns)	Unknown failure modes (unknown unknowns)
Data model	Aggregated metrics, health checks	High-cardinality: traces, logs, metrics, profiles
Who acts	Alert fires → on-call reacts	Engineer explores → discovers root cause
Scales to	Monolith, small service count	Distributed systems, polyglot, ephemeral infra

Known unknowns vs unknown unknowns

This is the key conceptual difference.

Monitoring covers known unknowns — things you anticipated might break, so you wrote a check for them. “CPU > 90%”, “disk full”, “service returns 5xx”.

Observability lets you investigate unknown unknowns — failures you never predicted, by querying raw telemetry after the fact.

Example: A new deployment causes one specific user segment to get slow responses due to an unexpected database query plan change. No alert fires — because you never anticipated this exact scenario. With observability, an engineer notices elevated p99 latency in traces, filters by user attributes, and finds the slow SQL query in span details — all without deploying new code or adding new metrics.

Four signals of observability

Observability rests on four complementary signals:

Signal	Answers	Tool in our stack
Metrics	How much? How fast? What’s the trend?	Prometheus / Mimir
Logs	What happened? What was the context?	Loki
Traces	What was the path of this request across services?	Tempo
Profiles	Which function is consuming CPU/memory?	Pyroscope

No single signal is sufficient. Each one fills gaps the others leave:

Metrics tell you something is wrong (high error rate).
Traces tell you where it’s wrong (which service, which dependency).
Logs tell you what went wrong (error message, stack trace).
Profiles tell you why the code is slow (which function, which line).

Why is classic monitoring no longer sufficient?

CPU/Memory are not enough. A service can have 10% CPU usage and still return errors because downstream calls fail or a connection pool is exhausted.
Distributed systems. A single user request can touch 10+ services. A dashboard per service doesn’t show the cross-service picture.
Polyglot environments. You need a vendor-neutral standard (OpenTelemetry) to instrument Java, Go, Python, .NET consistently.
Ephemeral resources. Pods live minutes, serverless functions live seconds. You can’t SSH in to debug — the telemetry data is all you have.

Monitoring doesn’t go away

Observability does not replace monitoring. You still need uptime checks, threshold alerts, and dashboards. Observability extends monitoring by adding the ability to ask new questions without deploying new code.

In this training we will build both:

Monitoring — alerts, dashboards, SLO tracking (Prometheus, Grafana Alerting)
Observability — exploratory analysis across all four signals (Drilldown, TraceQL, LogQL)

Observability vs Monitoring

Observability vs Monitoring

Monitoring vs Observability

Known unknowns vs unknown unknowns

Four signals of observability

Why is classic monitoring no longer sufficient?

Monitoring doesn’t go away

results matching ""

No results matching ""