🎯Exercises
1. Basic Syntax
1.1 Fetching a Metric
http_client_request_duration_seconds_sum
1.2 Narrowing Metrics by Label
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend"}
1.3 Narrowing Metrics by Multiple Labels
Direct comparison:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code="200"}
Negation:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code!="200"}
Regex:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code=~"20.*"}
1.4 Labels Only
Specifying a metric name is not required. You can do this:
{ job="opentelemetry-demo/frontend" }
With further narrowing:
{ job="opentelemetry-demo/frontend", http_response_status_code=~"20.*" }
You can also reference the metric name using label syntax:
{__name__="http_client_request_duration_seconds_sum"}
This allows using regex to search for metrics:
{__name__=~"http_client_request_duration_seconds_.*"}
1.5 Time and Resolution
Note: run the following two queries in the Prometheus UI (http://prometheus.workshop2.indexoutofrange.com/). For the remaining queries, go back to the Grafana UI. Pay attention to the horizontal scroll that may appear.
Values for the last 30 minutes:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[30m]
Values for the last 30 minutes with a resolution of 15 minutes (scroll the results all the way to the right):
http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[30m:15m]
1.6 Understanding rate and increase
1.6.1 rate and increase vs original values
❗❗❗ Understanding rate and increase is crucial for correctly querying metrics in Prometheus. Take time to read these articles.
❗❗❗
Run the following queries on a single chart (if needed, normalize by multiplying values by appropriate constants so that changes are visible in the graphs):
⚠️ Prometheus returns vectors (matrices). Standard matrix operations (like multiplication by a constant) can be performed on them. A description of vector multiplication capabilities is in this documentation.
Original values:
http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}
Increase:
increase(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[2m])
Rate:
rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"}[2m])
🎯⚠️ To-do, tips and tricks:
- For analyzing how the above work, switching the format to
Tableavailable inOptionsfor the query is helpful. - See what happens to the values when you change
Min stepto different values - A very good explanation of how rate works is in this StackOverflow post
1.6.2 Interval
Run the queries:
rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"} [3m])
and
rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="422"} [5s])
🎯 Question:
- Why does one have results and the other doesn’t?
⚠️ Tips and tricks:
- Read the article about choosing an interval
- Also read the article about the introduction of
$__rate_intervalby Grafana.
1.7 Histogram
Run the query:
http_client_request_duration_seconds_bucket{job="opentelemetry-demo/checkout"}
🎯⚠️ To-do, tips and tricks:
- How do the buckets differ from each other? (Legend under the graph)
- How are the values arranged?
- Does the value for
le 100include values forle 10?
- Does the value for
2. Exercises
2.1 How many GB are used on Kubernetes nodes
🎯 Goal:
Create a chart showing how much memory is used in GB on each node (❗not pod❗) in the Kubernetes cluster.
⚠️ To-do and tips and tricks:
- Metrics are in
node_memory_..... - Results should be in GB.
- The legend should be the Kubernetes node name.
2.2 Most memory-hungry processes
🎯 Goal:
Create a chart showing the top 5 most memory-hungry processes (grouped by job).
⚠️ To-do and tips and tricks:
- Group by
job - Check the
topkfunction - Check the
byoperator - Check the
process_resident_memory_.....metrics
2.3 SLO Problem
🎯 Goal:
Find the 3 services that have the biggest problems meeting the SLO defined as:
90% of requests complete in under 10ms
⚠️ To-do and tips and tricks:
- To measure response time you can use
http_client_request_duration_seconds_bucket - The list of Prometheus functions is here. The
bottomkfunction may be useful. - Remember that the SLO is calculated per service (label
job), not per instance. - How to calculate which services have the biggest problem with this SLO?
- Check what buckets exist for
http_client_request_duration_seconds_bucket. Those values are response times. - Dividing the number of requests with time <=10ms by the total number of requests per service gives you the SLO for that service
- Group results by
job
- Check what buckets exist for
See how results look when you group them
2.4 [Stretch] Response Time
🎯 Goal:
Calculate what the response time of the opentelemetry-demo/checkout service must be to fall within the 0.9 percentile (only for this service).
⚠️ To-do and tips and tricks:
- Use the query from the previous exercise as a starting point.
- The function for calculating quantiles is
histogram_quantile - A good interactive percentile visualization is here
2.5 [Stretch] Cardinality Explosion
🎯 Goal: Find the top 20 metrics that may have a cardinality problem (high number of series).
⚠️ To-do and tips and tricks:
- How to search metrics by name — was covered in the theory section — Regex may also be useful :)
- For grouping, check out
count by
2.6 [Stretch] Cardinality Explosion — Analysis
🎯 Goal: Analyze which labels are problematic for the top 3 metrics with the highest cardinality.
⚠️ To-do and tips and tricks:
- Remember that
Metrics browsergives you an interesting overview
P.S.
To see the total number of series, use prometheus_tsdb_head_series
3. Advanced Functions
3.1 irate() vs rate() — instantaneous vs averaged rate
⚠️ This exercise builds on the metric from exercise 1.6 — you should already understand how
rate()works.
🎯 Goal: Compare the results of rate() and irate() on a single chart for the metric http_client_request_duration_seconds_sum of the opentelemetry-demo/payment service (status 422). How do they differ?
⚠️ Tips and tricks:
irate()takes into account only the last two data points — reacts immediately to spikesrate()averages over the entire range — smooths out noise- Rule of thumb:
irate()for graphs (you see rapid changes),rate()for alerts (more stable)
3.2 predict_linear() — capacity forecasting
🎯 Goal: Write a query that predicts how much free disk space (in GB) will be on Kubernetes nodes in 4 hours (extrapolating from the last hour of data).
⚠️ Tips and tricks:
predict_linear(GAUGE[RANGE], SECONDS_AHEAD)— linearly extrapolates a gauge value- Metric:
node_filesystem_avail_bytes(free bytes on filesystem) predict_linearworks only on gauges — not on counters- Result < 0 means the forecast indicates disk will be full before the specified time
- Such queries are the basis for alerts like “disk full in 4 hours”
3.3 absent() — alerting on missing metrics
🎯 Goal: Write a query that returns 1 when the metric app_payment_transactions_total for job opentelemetry-demo/payment stops existing (e.g., service crashed and stopped sending metrics).
⚠️ Tips and tricks:
absent(METRIC)returns1when the metric doesn’t exist, empty result when it exists- Test by changing the job name to a non-existent one (e.g.,
job="does-not-exist") - Compare with
absent_over_time()from Loki exercise 5.5 — similar idea, different query language - In production, such queries are placed in
PrometheusRuleas alerts
3.4 $__rate_interval — correct interval in Grafana
⚠️ This exercise builds on exercises 1.6.2 (interval problem) and 2.3 (SLO).
🎯 Goal: Rewrite the query from exercise 2.3 (SLO) replacing the hardcoded [5m] with the Grafana variable $__rate_interval. Compare the results of both queries on a single chart.
⚠️ Tips and tricks:
$__rate_intervalautomatically selects a safe range (>=4x scrape interval)- Guarantees that
rate()always has enough data — eliminates the problem from exercise 1.6.2 - In production, always use
$__rate_intervalinstead of hardcoded values - Documentation
3.5 histogram_quantile — comparing percentiles across services
⚠️ This exercise extends the concept from exercise 2.4 to multiple services and percentiles.
🎯 Goal: On a single chart, show p50, p95, and p99 of HTTP response time for all opentelemetry-demo services. Which service has the largest discrepancy between p50 and p99?
⚠️ Tips and tricks:
- Use 3 queries on a single chart (A = p50, B = p95, C = p99)
histogram_quantile(QUANTILE, sum(rate(BUCKET[$__rate_interval])) by (le, job))- Note the
by (le, job)—leis required byhistogram_quantile,jobgroups by service - Large discrepancy p50 vs p99 = “tail latency” problem
4. Native Histograms
Prometheus supports native histograms — a format where a histogram is stored as one sample instead of many
_bucketseries. This provides better percentile accuracy and lower resource usage.Prometheus metrics (
job="prometheus-native-histograms") are collected in protobuf format, which includes native histograms.⚠️ Native histograms are an experimental feature — requires the
--enable-feature=native-histogramsflag on Prometheus.
4.1 histogram_count() — number of observations
🎯 Goal: Check how many HTTP requests Prometheus handled for each endpoint (handler). Use the metric prometheus_http_request_duration_seconds with job prometheus-native-histograms.
⚠️ Tips and tricks:
histogram_count(NATIVE_HISTOGRAM)— returns the number of observations in a native histogram- Does not require
_bucket,_count, or_sum— works directly on the native histogram - Compare with the classic approach:
prometheus_http_request_duration_seconds_count
4.2 histogram_sum() and histogram_avg() — average response time
🎯 Goal: Calculate the average response time of the Prometheus /api/v1/write endpoint using the native histogram.
⚠️ Tips and tricks:
histogram_avg(NATIVE_HISTOGRAM)— returns the average value directly- Alternatively:
histogram_sum() / histogram_count()— same as histogram_avg - With classic histograms, you needed separate
_sumand_countseries
4.3 histogram_fraction() — what percentage of requests falls within a threshold
🎯 Goal: What percentage of requests to the /api/v1/write endpoint have a response time below 100ms (0.1s)?
⚠️ Tips and tricks:
histogram_fraction(LOWER_BOUND, UPPER_BOUND, NATIVE_HISTOGRAM)— returns the fraction of observations in the range [lower, upper]- For the threshold “below 100ms” use
histogram_fraction(0, 0.1, ...) - This is the equivalent of the SLO from exercise 2.3, but without manual bucket division — more precise and simpler
- Result 1.0 = 100% of requests fall within the threshold
4.4 histogram_quantile() on native histogram — comparison with classic
🎯 Goal: Calculate the p99 response time of the /api/v1/write endpoint using a native histogram. Compare the simplicity of the query with the classic approach from exercise 3.5.
⚠️ Tips and tricks:
- On native histogram:
histogram_quantile(0.99, NATIVE_HISTOGRAM)— withoutsum by (le), withoutrateon buckets - On classic:
histogram_quantile(0.99, sum(rate(..._bucket[5m])) by (le))— requires_bucket,rate,by (le) - Native histogram stores more accurate boundaries — the result is more precise than from classic buckets
- Use
rate()if you want the percentile for the last N minutes instead of the total