🎯Exercises
1. Basic Syntax
1.1 Fetching a Metric
http_client_request_duration_seconds_sum
1.2 Narrowing Metrics by Label
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend"}
1.3 Narrowing Metrics by Multiple Labels
Direct comparison:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code="200"}
Negation:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code!="200"}
Regex:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code=~"20.*"}
1.4 Labels Only
Specifying a metric name is not required. You can do this:
{ job="opentelemetry-demo/frontend" }
With further narrowing:
{ job="opentelemetry-demo/frontend", http_response_status_code=~"20.*" }
You can also reference the metric name using label syntax:
{__name__="http_client_request_duration_seconds_sum"}
This allows using regex to search for metrics:
{__name__=~"http_client_request_duration_seconds_.*"}
1.5 Time and Resolution
Note: run the following two queries in the Prometheus UI (http://prometheus.workshop2.indexoutofrange.com/). For the remaining queries, go back to the Grafana UI. Pay attention to the horizontal scroll that may appear.
Values for the last 30 minutes:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"}[30m]
Values for the last 30 minutes with a resolution of 15 minutes (scroll the results all the way to the right):
http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"}[30m:15m]
1.6 Understanding rate and increase
1.6.1 rate and increase vs original values
❗❗❗ Understanding rate and increase is crucial for correctly querying metrics in Prometheus. Take time to read these articles.
❗❗❗
Why can’t you just use the raw counter value?
Counters in Prometheus are monotonically increasing — they only go up (or reset to 0 when a process restarts). The raw value of a counter (e.g., http_requests_total = 14 523 847) tells you almost nothing useful by itself:
- You don’t know when those requests happened — was it over 5 minutes or 5 months?
- When a pod restarts, the counter resets to 0, creating a sudden drop in the graph that has nothing to do with actual traffic.
- Different instances started at different times, so raw values are not comparable between pods.
That’s why you should never graph a raw counter directly. Instead, you need rate() or increase() to extract meaningful information.
increase() — how much did the value grow?
increase(metric[5m]) returns the total increase of the counter over the specified time window. In simple terms:
increase ≈ value_at_end - value_at_start
- Result is in the same unit as the counter (e.g., if the counter counts requests →
increasereturns the number of requests) increase(http_requests_total[5m])= “how many requests happened in the last 5 minutes”- Automatically handles counter resets (pod restarts) — Prometheus detects the drop and compensates for it
rate() — how fast is the value growing?
rate(metric[5m]) returns the average per-second rate of increase over the specified time window. In simple terms:
rate ≈ increase / duration_in_seconds
- Result is in units per second (e.g., requests/second, bytes/second)
rate(http_requests_total[5m])= “how many requests per second, averaged over the last 5 minutes”- Also handles counter resets automatically
Key differences at a glance
increase(m[5m]) |
rate(m[5m]) |
|
|---|---|---|
| Returns | Total growth in the window | Per-second average growth |
| Unit | Same as counter (requests, bytes, errors) | Counter unit per second |
| Relationship | increase = rate × window_seconds |
rate = increase / window_seconds |
| Typical use | “How many errors in the last hour?” | “What is the request throughput?” |
| In alerts | “More than 100 errors in 5 minutes” | “Error rate exceeds 10/s” |
A practical example
If a counter grew from 1000 to 1300 over a 5-minute window:
increase(...[5m])= 300 (300 requests happened)rate(...[5m])= 1.0 (300 requests / 300 seconds = 1 request per second)
💡 Rule of thumb: use
rate()when you want to compare metrics with different time ranges or build dashboards (rate is independent of window size). Useincrease()when you care about absolute numbers (“how many errors happened?”).
Run the following queries on a single chart (if needed, normalize by multiplying values by appropriate constants so that changes are visible in the graphs):
⚠️ Prometheus returns vectors (matrices). Standard matrix operations (like multiplication by a constant) can be performed on them. A description of vector multiplication capabilities is in this documentation.
Original values:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment"}
Increase:
increase(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment"}[2m])
Rate:
rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment"}[2m])
🎯⚠️ To-do, tips and tricks:
- For analyzing how the above work, switching the format to
Tableavailable inOptionsfor the query is helpful. - See what happens to the values when you change
Min stepto different values - A very good explanation of how rate works is in this StackOverflow post
1.6.2 Interval
Run the queries:
rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"} [3m])
and
rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"} [5s])
🎯 Question:
- Why does one have results and the other doesn’t?
⚠️ Tips and tricks:
- Read the article about choosing an interval
- Also read the article about the introduction of
$__rate_intervalby Grafana.
1.7 Histogram
Run the query:
http_client_request_duration_seconds_bucket{job="opentelemetry-demo/checkout"}
🎯⚠️ To-do, tips and tricks:
- How do the buckets differ from each other? (Legend under the graph)
- How are the values arranged?
- Does the value for
le 100include values forle 10?
- Does the value for
2. Exercises
2.1 How many GB are used on Kubernetes nodes
🎯 Goal:
Create a chart showing how much memory is used in GB on each node (❗not pod❗) in the Kubernetes cluster.
⚠️ To-do and tips and tricks:
- Metrics are in
node_memory_..... - Results should be in GB.
- The legend should be the Kubernetes node name.
2.2 Most memory-hungry processes
🎯 Goal:
Create a chart showing the top 5 most memory-hungry processes (grouped by job).
⚠️ To-do and tips and tricks:
- Group by
job - Check the
topkfunction - Check the
byoperator - Check the
process_resident_memory_.....metrics
2.3 SLO Problem
🎯 Goal:
Find the 3 services that have the biggest problems meeting the SLO defined as:
90% of requests complete in under 10ms
⚠️ To-do and tips and tricks:
- To measure response time you can use
http_client_request_duration_seconds_bucket - The list of Prometheus functions is here. The
bottomkfunction may be useful. - Remember that the SLO is calculated per service (label
job), not per instance. - How to calculate which services have the biggest problem with this SLO?
- Check what buckets exist for
http_client_request_duration_seconds_bucket. Those values are response times. - Dividing the number of requests with time <=10ms by the total number of requests per service gives you the SLO for that service
- Group results by
job
- Check what buckets exist for
See how results look when you group them
2.4 [Stretch] Response Time
🎯 Goal:
Calculate what the response time of the opentelemetry-demo/checkout service must be to fall within the 0.9 percentile (only for this service).
⚠️ To-do and tips and tricks:
- Use the query from the previous exercise as a starting point.
- The function for calculating quantiles is
histogram_quantile - A good interactive percentile visualization is here
2.5 [Stretch] Cardinality Explosion
🎯 Goal: Find the top 20 metrics that may have a cardinality problem (high number of series).
⚠️ To-do and tips and tricks:
- How to search metrics by name — was covered in the theory section — Regex may also be useful :)
- For grouping, check out
count by
2.6 [Stretch] Cardinality Explosion — Analysis
🎯 Goal: Analyze which labels are problematic for the top 3 metrics with the highest cardinality.
⚠️ To-do and tips and tricks:
- Remember that
Metrics browsergives you an interesting overview
P.S.
To see the total number of series, use prometheus_tsdb_head_series
3. Advanced Functions
3.1 irate() vs rate() — instantaneous vs averaged rate
⚠️ This exercise builds on the metric from exercise 1.6 — you should already understand how
rate()works.
🎯 Goal: Compare the results of rate() and irate() on a single chart for the metric http_client_request_duration_seconds_sum of the opentelemetry-demo/payment service (status 400). How do they differ?
⚠️ Tips and tricks:
irate()takes into account only the last two data points — reacts immediately to spikesrate()averages over the entire range — smooths out noise- Rule of thumb:
irate()for graphs (you see rapid changes),rate()for alerts (more stable)
3.2 predict_linear() — capacity forecasting
🎯 Goal: Write a query that predicts how much free disk space (in GB) will be on Kubernetes nodes in 4 hours (extrapolating from the last hour of data).
⚠️ Tips and tricks:
predict_linear(GAUGE[RANGE], SECONDS_AHEAD)— linearly extrapolates a gauge value- Metric:
node_filesystem_avail_bytes(free bytes on filesystem) predict_linearworks only on gauges — not on counters- Result < 0 means the forecast indicates disk will be full before the specified time
- Such queries are the basis for alerts like “disk full in 4 hours”
3.3 absent() — alerting on missing metrics
🎯 Goal: Write a query that returns 1 when the metric app_payment_transactions_total for job opentelemetry-demo/payment stops existing (e.g., service crashed and stopped sending metrics).
⚠️ Tips and tricks:
absent(METRIC)returns1when the metric doesn’t exist, empty result when it exists- Test by changing the job name to a non-existent one (e.g.,
job="does-not-exist") - Compare with
absent_over_time()from Loki exercise 5.5 — similar idea, different query language - In production, such queries are placed in
PrometheusRuleas alerts
3.4 $__rate_interval — correct interval in Grafana
⚠️ This exercise builds on exercises 1.6.2 (interval problem) and 2.3 (SLO).
🎯 Goal: Rewrite the query from exercise 2.3 (SLO) replacing the hardcoded [5m] with the Grafana variable $__rate_interval. Compare the results of both queries on a single chart.
⚠️ Tips and tricks:
$__rate_intervalautomatically selects a safe range (>=4x scrape interval)- Guarantees that
rate()always has enough data — eliminates the problem from exercise 1.6.2 - In production, always use
$__rate_intervalinstead of hardcoded values - Documentation
3.5 histogram_quantile — comparing percentiles across services
⚠️ This exercise extends the concept from exercise 2.4 to multiple services and percentiles.
🎯 Goal: On a single chart, show p50, p95, and p99 of HTTP response time for all opentelemetry-demo services. Which service has the largest discrepancy between p50 and p99?
⚠️ Tips and tricks:
- Use 3 queries on a single chart (A = p50, B = p95, C = p99)
histogram_quantile(QUANTILE, sum(rate(BUCKET[$__rate_interval])) by (le, job))- Note the
by (le, job)—leis required byhistogram_quantile,jobgroups by service - Large discrepancy p50 vs p99 = “tail latency” problem
4. Native Histograms
Prometheus supports native histograms — a format where a histogram is stored as one sample instead of many
_bucketseries. This provides better percentile accuracy and lower resource usage.Prometheus metrics (
job="prometheus-native-histograms") are collected in protobuf format, which includes native histograms.⚠️ Native histograms are an experimental feature — requires the
--enable-feature=native-histogramsflag on Prometheus.
4.1 histogram_count() — number of observations
🎯 Goal: Check how many HTTP requests Prometheus handled for each endpoint (handler). Use the metric prometheus_http_request_duration_seconds with job prometheus-native-histograms.
⚠️ Tips and tricks:
histogram_count(NATIVE_HISTOGRAM)— returns the number of observations in a native histogram- Does not require
_bucket,_count, or_sum— works directly on the native histogram - Compare with the classic approach:
prometheus_http_request_duration_seconds_count
4.2 histogram_sum() and histogram_avg() — average response time
🎯 Goal: Calculate the average response time of the Prometheus /api/v1/write endpoint using the native histogram.
⚠️ Tips and tricks:
histogram_avg(NATIVE_HISTOGRAM)— returns the average value directly- Alternatively:
histogram_sum() / histogram_count()— same as histogram_avg - With classic histograms, you needed separate
_sumand_countseries
4.3 histogram_fraction() — what percentage of requests falls within a threshold
🎯 Goal: What percentage of requests to the /api/v1/write endpoint have a response time below 100ms (0.1s)?
⚠️ Tips and tricks:
histogram_fraction(LOWER_BOUND, UPPER_BOUND, NATIVE_HISTOGRAM)— returns the fraction of observations in the range [lower, upper]- For the threshold “below 100ms” use
histogram_fraction(0, 0.1, ...) - This is the equivalent of the SLO from exercise 2.3, but without manual bucket division — more precise and simpler
- Result 1.0 = 100% of requests fall within the threshold
4.4 histogram_quantile() on native histogram — comparison with classic
🎯 Goal: Calculate the p99 response time of the /api/v1/write endpoint using a native histogram. Compare the simplicity of the query with the classic approach from exercise 3.5.
⚠️ Tips and tricks:
- On native histogram:
histogram_quantile(0.99, NATIVE_HISTOGRAM)— withoutsum by (le), withoutrateon buckets - On classic:
histogram_quantile(0.99, sum(rate(..._bucket[5m])) by (le))— requires_bucket,rate,by (le) - Native histogram stores more accurate boundaries — the result is more precise than from classic buckets
- Use
rate()if you want the percentile for the last N minutes instead of the total