🎯Exercises

1. Basic Syntax

1.1 Fetching a Metric

http_client_request_duration_seconds_sum

1.2 Narrowing Metrics by Label

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend"}

1.3 Narrowing Metrics by Multiple Labels

Direct comparison:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code="200"}

Negation:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code!="200"}

Regex:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/frontend", http_response_status_code=~"20.*"}

1.4 Labels Only

Specifying a metric name is not required. You can do this:

{ job="opentelemetry-demo/frontend" }

With further narrowing:

{ job="opentelemetry-demo/frontend", http_response_status_code=~"20.*" }

You can also reference the metric name using label syntax:

{__name__="http_client_request_duration_seconds_sum"}

This allows using regex to search for metrics:

{__name__=~"http_client_request_duration_seconds_.*"}

1.5 Time and Resolution

Note: run the following two queries in the Prometheus UI (http://prometheus.workshop2.indexoutofrange.com/). For the remaining queries, go back to the Grafana UI. Pay attention to the horizontal scroll that may appear.

Values for the last 30 minutes:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"}[30m]

Values for the last 30 minutes with a resolution of 15 minutes (scroll the results all the way to the right):

http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"}[30m:15m]

1.6 Understanding `rate` and `increase`

1.6.1 rate and increase vs original values

❗❗❗ Understanding rate and increase is crucial for correctly querying metrics in Prometheus. Take time to read these articles. ❗❗❗

Why can’t you just use the raw counter value?

Counters in Prometheus are monotonically increasing — they only go up (or reset to 0 when a process restarts). The raw value of a counter (e.g., http_requests_total = 14 523 847) tells you almost nothing useful by itself:

You don’t know when those requests happened — was it over 5 minutes or 5 months?
When a pod restarts, the counter resets to 0, creating a sudden drop in the graph that has nothing to do with actual traffic.
Different instances started at different times, so raw values are not comparable between pods.

That’s why you should never graph a raw counter directly. Instead, you need rate() or increase() to extract meaningful information.

`increase()` — how much did the value grow?

increase(metric[5m]) returns the total increase of the counter over the specified time window. In simple terms:

increase ≈ value_at_end - value_at_start

Result is in the same unit as the counter (e.g., if the counter counts requests → increase returns the number of requests)
increase(http_requests_total[5m]) = “how many requests happened in the last 5 minutes”
Automatically handles counter resets (pod restarts) — Prometheus detects the drop and compensates for it

`rate()` — how fast is the value growing?

rate(metric[5m]) returns the average per-second rate of increase over the specified time window. In simple terms:

rate ≈ increase / duration_in_seconds

Result is in units per second (e.g., requests/second, bytes/second)
rate(http_requests_total[5m]) = “how many requests per second, averaged over the last 5 minutes”
Also handles counter resets automatically

Key differences at a glance

	`increase(m[5m])`	`rate(m[5m])`
Returns	Total growth in the window	Per-second average growth
Unit	Same as counter (requests, bytes, errors)	Counter unit per second
Relationship	`increase = rate × window_seconds`	`rate = increase / window_seconds`
Typical use	“How many errors in the last hour?”	“What is the request throughput?”
In alerts	“More than 100 errors in 5 minutes”	“Error rate exceeds 10/s”

A practical example

If a counter grew from 1000 to 1300 over a 5-minute window:

increase(...[5m]) = 300 (300 requests happened)
rate(...[5m]) = 1.0 (300 requests / 300 seconds = 1 request per second)

💡 Rule of thumb: use rate() when you want to compare metrics with different time ranges or build dashboards (rate is independent of window size). Use increase() when you care about absolute numbers (“how many errors happened?”).

Run the following queries on a single chart (if needed, normalize by multiplying values by appropriate constants so that changes are visible in the graphs):

⚠️ Prometheus returns vectors (matrices). Standard matrix operations (like multiplication by a constant) can be performed on them. A description of vector multiplication capabilities is in this documentation.

Original values:

http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment"}

Increase:

increase(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment"}[2m])

Rate:

rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment"}[2m])

🎯⚠️ To-do, tips and tricks:

For analyzing how the above work, switching the format to Table available in Options for the query is helpful.
See what happens to the values when you change Min step to different values
A very good explanation of how rate works is in this StackOverflow post

1.6.2 Interval

Run the queries:

rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"} [3m])

and

rate(http_client_request_duration_seconds_sum{job="opentelemetry-demo/payment", http_response_status_code="400"} [5s])

🎯 Question:

Why does one have results and the other doesn’t?

⚠️ Tips and tricks:

Read the article about choosing an interval
Also read the article about the introduction of $__rate_interval by Grafana.

1.7 Histogram

Run the query:

http_client_request_duration_seconds_bucket{job="opentelemetry-demo/checkout"}

🎯⚠️ To-do, tips and tricks:

How do the buckets differ from each other? (Legend under the graph)
How are the values arranged?
- Does the value for le 100 include values for le 10?

2. Exercises

2.1 How many GB are used on Kubernetes nodes

🎯 Goal:

Create a chart showing how much memory is used in GB on each node (❗not pod❗) in the Kubernetes cluster.