Observability vs Monitoring

Monitoring vs Observability

  Monitoring Observability
Core question “Is it up? Is it slow?” “Why is it slow? What’s affected?”
Approach Predefined checks, thresholds, dashboards Exploratory, ad-hoc queries over raw signals
Failure model Known failure modes (known unknowns) Unknown failure modes (unknown unknowns)
Data model Aggregated metrics, health checks High-cardinality: traces, logs, metrics, profiles
Who acts Alert fires → on-call reacts Engineer explores → discovers root cause
Scales to Monolith, small service count Distributed systems, polyglot, ephemeral infra

Known unknowns vs unknown unknowns

This is the key conceptual difference.

Monitoring covers known unknowns — things you anticipated might break, so you wrote a check for them. “CPU > 90%”, “disk full”, “service returns 5xx”.

Observability lets you investigate unknown unknowns — failures you never predicted, by querying raw telemetry after the fact.

Example: A new deployment causes one specific user segment to get slow responses due to an unexpected database query plan change. No alert fires — because you never anticipated this exact scenario. With observability, an engineer notices elevated p99 latency in traces, filters by user attributes, and finds the slow SQL query in span details — all without deploying new code or adding new metrics.

Four signals of observability

Observability rests on four complementary signals:

Signal Answers Tool in our stack
Metrics How much? How fast? What’s the trend? Prometheus / Mimir
Logs What happened? What was the context? Loki
Traces What was the path of this request across services? Tempo
Profiles Which function is consuming CPU/memory? Pyroscope

No single signal is sufficient. Each one fills gaps the others leave:

  • Metrics tell you something is wrong (high error rate).
  • Traces tell you where it’s wrong (which service, which dependency).
  • Logs tell you what went wrong (error message, stack trace).
  • Profiles tell you why the code is slow (which function, which line).

Why is classic monitoring no longer sufficient?

  • CPU/Memory are not enough. A service can have 10% CPU usage and still return errors because downstream calls fail or a connection pool is exhausted.
  • Distributed systems. A single user request can touch 10+ services. A dashboard per service doesn’t show the cross-service picture.
  • Polyglot environments. You need a vendor-neutral standard (OpenTelemetry) to instrument Java, Go, Python, .NET consistently.
  • Ephemeral resources. Pods live minutes, serverless functions live seconds. You can’t SSH in to debug — the telemetry data is all you have.

Monitoring doesn’t go away

Observability does not replace monitoring. You still need uptime checks, threshold alerts, and dashboards. Observability extends monitoring by adding the ability to ask new questions without deploying new code.

In this training we will build both:

  • Monitoring — alerts, dashboards, SLO tracking (Prometheus, Grafana Alerting)
  • Observability — exploratory analysis across all four signals (Drilldown, TraceQL, LogQL)

results matching ""

    No results matching ""