Recording Rules

Recording Rules

Recording Rules allow precalculating expensive queries and saving results as new metrics for faster access.

Basics

  • Purpose: Precalculation of frequently used or expensive PromQL expressions
  • Benefits: Faster dashboards and alerts, reduced server load
  • Format: YAML files, reloaded with SIGHUP signal
  • Configuration: rule_files field in Prometheus config

Syntax

File structure:

groups:
  - name: <group_name>
    interval: <duration>        # optional
    rules:
    - record: <metric_name>
      expr: <promql_expression>
      labels:                   # optional
        <label>: <value>

Example:

groups:
  - name: cpu_rules
    interval: 30s
    rules:
    - record: instance:cpu_usage:rate5m
      expr: 1 - avg by (instance) (rate(cpu_seconds_total{mode="idle"}[5m]))
    - record: job:cpu_usage:rate5m
      expr: avg by (job) (instance:cpu_usage:rate5m)

Key Group Parameters

  • name: Unique group name
  • interval: Evaluation frequency (default global.evaluation_interval)
  • limit: Maximum number of series/alerts (0 = no limit)
  • query_offset: Query time offset (for distributed systems)
  • labels: Global labels for all rules in group

Tools

Syntax validation:

promtool check rules /path/to/rules.yml
  • Status 0: File valid
  • Status 1: Syntax errors

Best practices

Naming convention: level:metric:operations

  • level: instance, job, cluster
  • metric: Base metric name
  • operations: rate5m, sum, ratio

Good name examples:

  • instance:http_requests:rate5m
  • job:memory_usage:avg
  • cluster:cpu_utilization:ratio

Hierarchical aggregation:

# Level 1: Instance
- record: instance:requests:rate5m
  expr: sum by (instance) (rate(http_requests_total[5m]))

# Level 2: Service
- record: service:requests:rate5m
  expr: sum by (service) (instance:requests:rate5m)

# Level 3: Global
- record: global:requests:rate5m
  expr: sum(service:requests:rate5m)

Use cases

  • Dashboards: Fast loading of complex graphs
  • Alerts: Accelerated condition evaluations
  • SLA/SLI: Precompiled quality metrics
  • Capacity planning: Aggregated resource metrics

Monitoring

  • Metric: rule_group_iterations_missed_total - missed evaluations
  • Problem: Group doesn’t finish before next evaluation → skips iterations
  • Effect: Gaps in recording rule data

results matching ""

    No results matching ""