🎯Exercises

1. Basic Syntax

1.1 Fetching a Metric

http_client_request_duration_seconds_sum

1.2 Narrowing Metrics by Label

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend"}

1.3 Narrowing Metrics by Multiple Labels

Direct comparison:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code="200"}

Negation:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code!="200"}

Regex:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code=~"20.*"}

1.4 Labels Only

Specifying a metric name is not required. You can do this:

{ job="opentelemetry-demo/frontend" }

With further narrowing:

{ job="opentelemetry-demo/frontend", http_response_status_code=~"20.*" }

You can also reference the metric name using label syntax:

{__name__="http_client_request_duration_seconds_sum"}

This allows using regex to search for metrics:

{__name__=~"http_client_request_duration_seconds_.*"}

1.5 Time and Resolution

Note: run the following two queries in the Prometheus UI (http://prometheus.workshop2.indexoutofrange.com/). For the remaining queries, go back to the Grafana UI. Pay attention to the horizontal scroll that may appear.

Values for the last 30 minutes:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[30m]

Values for the last 30 minutes with a resolution of 15 minutes (scroll the results all the way to the right):

http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[30m:15m]

1.6 Understanding rate and increase

1.6.1 rate and increase vs original values

❗❗❗ Understanding rate and increase is crucial for correctly querying metrics in Prometheus. Take time to read these articles. ❗❗❗

Run the following queries on a single chart (if needed, normalize by multiplying values by appropriate constants so that changes are visible in the graphs):

⚠️ Prometheus returns vectors (matrices). Standard matrix operations (like multiplication by a constant) can be performed on them. A description of vector multiplication capabilities is in this documentation.

Original values:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}

Increase:

increase(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[2m])

Rate:

rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[2m])

🎯⚠️ To-do, tips and tricks:

  • For analyzing how the above work, switching the format to Table available in Options for the query is helpful.
  • See what happens to the values when you change Min step to different values
  • A very good explanation of how rate works is in this StackOverflow post

1.6.2 Interval

Run the queries:

rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"} [3m])

and

rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"} [5s])

🎯 Question:

  • Why does one have results and the other doesn’t?

⚠️ Tips and tricks:

1.7 Histogram

Run the query:

http_client_request_duration_seconds_bucket{job="opentelemetry-demo/checkout"}

🎯⚠️ To-do, tips and tricks:

  • How do the buckets differ from each other? (Legend under the graph)
  • How are the values arranged?
    • Does the value for le 100 include values for le 10?

2. Exercises

2.1 How many GB are used on Kubernetes nodes

🎯 Goal:

Create a chart showing how much memory is used in GB on each node (❗not pod❗) in the Kubernetes cluster.

⚠️ To-do and tips and tricks:

  • Metrics are in node_memory_.....
  • Results should be in GB.
  • The legend should be the Kubernetes node name.

2.2 Most memory-hungry processes

🎯 Goal:

Create a chart showing the top 5 most memory-hungry processes (grouped by job).

⚠️ To-do and tips and tricks:

  • Group by job
  • Check the topk function
  • Check the by operator
  • Check the process_resident_memory_..... metrics

2.3 SLO Problem

🎯 Goal:

Find the 3 services that have the biggest problems meeting the SLO defined as:

90% of requests complete in under 10ms

⚠️ To-do and tips and tricks:

  • To measure response time you can use http_client_request_duration_seconds_bucket
  • The list of Prometheus functions is here. The bottomk function may be useful.
  • Remember that the SLO is calculated per service (label job), not per instance.
  • How to calculate which services have the biggest problem with this SLO?
    • Check what buckets exist for http_client_request_duration_seconds_bucket. Those values are response times.
    • Dividing the number of requests with time <=10ms by the total number of requests per service gives you the SLO for that service
    • Group results by job

See how results look when you group them

2.4 [Stretch] Response Time

🎯 Goal: Calculate what the response time of the opentelemetry-demo/checkout service must be to fall within the 0.9 percentile (only for this service).

⚠️ To-do and tips and tricks:

  • Use the query from the previous exercise as a starting point.
  • The function for calculating quantiles is histogram_quantile
  • A good interactive percentile visualization is here

2.5 [Stretch] Cardinality Explosion

🎯 Goal: Find the top 20 metrics that may have a cardinality problem (high number of series).

⚠️ To-do and tips and tricks:

  • How to search metrics by name — was covered in the theory section — Regex may also be useful :)
  • For grouping, check out count by

2.6 [Stretch] Cardinality Explosion — Analysis

🎯 Goal: Analyze which labels are problematic for the top 3 metrics with the highest cardinality.

⚠️ To-do and tips and tricks:

  • Remember that Metrics browser gives you an interesting overview

P.S.

To see the total number of series, use prometheus_tsdb_head_series

3. Advanced Functions

3.1 irate() vs rate() — instantaneous vs averaged rate

⚠️ This exercise builds on the metric from exercise 1.6 — you should already understand how rate() works.

🎯 Goal: Compare the results of rate() and irate() on a single chart for the metric http_client_request_duration_seconds_sum of the opentelemetry-demo/payment service (status 422). How do they differ?

⚠️ Tips and tricks:

  • irate() takes into account only the last two data points — reacts immediately to spikes
  • rate() averages over the entire range — smooths out noise
  • Rule of thumb: irate() for graphs (you see rapid changes), rate() for alerts (more stable)

3.2 predict_linear() — capacity forecasting

🎯 Goal: Write a query that predicts how much free disk space (in GB) will be on Kubernetes nodes in 4 hours (extrapolating from the last hour of data).

⚠️ Tips and tricks:

  • predict_linear(GAUGE[RANGE], SECONDS_AHEAD) — linearly extrapolates a gauge value
  • Metric: node_filesystem_avail_bytes (free bytes on filesystem)
  • predict_linear works only on gauges — not on counters
  • Result < 0 means the forecast indicates disk will be full before the specified time
  • Such queries are the basis for alerts like “disk full in 4 hours”

3.3 absent() — alerting on missing metrics

🎯 Goal: Write a query that returns 1 when the metric app_payment_transactions_total for job opentelemetry-demo/payment stops existing (e.g., service crashed and stopped sending metrics).

⚠️ Tips and tricks:

  • absent(METRIC) returns 1 when the metric doesn’t exist, empty result when it exists
  • Test by changing the job name to a non-existent one (e.g., job="does-not-exist")
  • Compare with absent_over_time() from Loki exercise 5.5 — similar idea, different query language
  • In production, such queries are placed in PrometheusRule as alerts

3.4 $__rate_interval — correct interval in Grafana

⚠️ This exercise builds on exercises 1.6.2 (interval problem) and 2.3 (SLO).

🎯 Goal: Rewrite the query from exercise 2.3 (SLO) replacing the hardcoded [5m] with the Grafana variable $__rate_interval. Compare the results of both queries on a single chart.

⚠️ Tips and tricks:

  • $__rate_interval automatically selects a safe range (>=4x scrape interval)
  • Guarantees that rate() always has enough data — eliminates the problem from exercise 1.6.2
  • In production, always use $__rate_interval instead of hardcoded values
  • Documentation

3.5 histogram_quantile — comparing percentiles across services

⚠️ This exercise extends the concept from exercise 2.4 to multiple services and percentiles.

🎯 Goal: On a single chart, show p50, p95, and p99 of HTTP response time for all opentelemetry-demo services. Which service has the largest discrepancy between p50 and p99?

⚠️ Tips and tricks:

  • Use 3 queries on a single chart (A = p50, B = p95, C = p99)
  • histogram_quantile(QUANTILE, sum(rate(BUCKET[$__rate_interval])) by (le, job))
  • Note the by (le, job)le is required by histogram_quantile, job groups by service
  • Large discrepancy p50 vs p99 = “tail latency” problem

4. Native Histograms

Prometheus supports native histograms — a format where a histogram is stored as one sample instead of many _bucket series. This provides better percentile accuracy and lower resource usage.

Prometheus metrics (job="prometheus-native-histograms") are collected in protobuf format, which includes native histograms.

⚠️ Native histograms are an experimental feature — requires the --enable-feature=native-histograms flag on Prometheus.

4.1 histogram_count() — number of observations

🎯 Goal: Check how many HTTP requests Prometheus handled for each endpoint (handler). Use the metric prometheus_http_request_duration_seconds with job prometheus-native-histograms.

⚠️ Tips and tricks:

  • histogram_count(NATIVE_HISTOGRAM) — returns the number of observations in a native histogram
  • Does not require _bucket, _count, or _sum — works directly on the native histogram
  • Compare with the classic approach: prometheus_http_request_duration_seconds_count

4.2 histogram_sum() and histogram_avg() — average response time

🎯 Goal: Calculate the average response time of the Prometheus /api/v1/write endpoint using the native histogram.

⚠️ Tips and tricks:

  • histogram_avg(NATIVE_HISTOGRAM) — returns the average value directly
  • Alternatively: histogram_sum() / histogram_count() — same as histogram_avg
  • With classic histograms, you needed separate _sum and _count series

4.3 histogram_fraction() — what percentage of requests falls within a threshold

🎯 Goal: What percentage of requests to the /api/v1/write endpoint have a response time below 100ms (0.1s)?

⚠️ Tips and tricks:

  • histogram_fraction(LOWER_BOUND, UPPER_BOUND, NATIVE_HISTOGRAM) — returns the fraction of observations in the range [lower, upper]
  • For the threshold “below 100ms” use histogram_fraction(0, 0.1, ...)
  • This is the equivalent of the SLO from exercise 2.3, but without manual bucket division — more precise and simpler
  • Result 1.0 = 100% of requests fall within the threshold

4.4 histogram_quantile() on native histogram — comparison with classic

🎯 Goal: Calculate the p99 response time of the /api/v1/write endpoint using a native histogram. Compare the simplicity of the query with the classic approach from exercise 3.5.

⚠️ Tips and tricks:

  • On native histogram: histogram_quantile(0.99, NATIVE_HISTOGRAM)without sum by (le), without rate on buckets
  • On classic: histogram_quantile(0.99, sum(rate(..._bucket[5m])) by (le)) — requires _bucket, rate, by (le)
  • Native histogram stores more accurate boundaries — the result is more precise than from classic buckets
  • Use rate() if you want the percentile for the last N minutes instead of the total

results matching ""

    No results matching ""