Metric Cardinality

Metric cardinality problem

Metric cardinality is one of the most important problems in time series-based monitoring systems. Improper cardinality management can lead to serious performance and cost issues.

What is cardinality?

Cardinality is the number of unique combinations of all label key-value pairs for a given metric.

Basic example:

http_requests_total{method="GET", status="200", path="/api/users"}
http_requests_total{method="POST", status="201", path="/api/users"}
http_requests_total{method="GET", status="404", path="/api/orders"}

Cardinality = number of unique combinations of {method, status, path}

For:

  • 4 HTTP methods (GET, POST, PUT, DELETE)
  • 10 different status codes
  • 100 different paths

Maximum cardinality = 4 × 10 × 100 = 4,000 time series

Why is cardinality a problem?

Performance impact

  1. Memory consumption: Each unique time series requires:
    • Memory to store label metadata
    • Memory to buffer current values
    • Indexes for fast lookup
  2. Disk consumption:
    • Each series is written separately
    • Increase in number of files and data blocks
    • Slower compaction
  3. Query performance:
    • More data to search
    • Longer response times
    • Higher CPU usage during aggregations

Cost formula

total_cardinality = metric_1_cardinality + metric_2_cardinality + ... + metric_n_cardinality

With typical consumption of 1-2 bytes per sample:

ram_usage ≈ number_of_series × 3KB (metadata + buffers)
disk_usage = number_of_series × samples_per_day × bytes_per_sample × retention_days

Causes of High Cardinality

1. Labels with Unlimited Number of Values

❌ VERY BAD PRACTICES:

# User identifiers
http_requests_total{user_id="12345"}
http_requests_total{user_id="67890"}
# Potentially millions of unique user_id!

# Email addresses
login_attempts{email="user@example.com"}

# Session tokens
api_calls{session_token="abc123xyz"}

# Full URLs with parameters
requests_total{url="/api/search?q=prometheus&page=1&limit=50"}

# Timestamps
events_total{timestamp="2025-11-16T10:30:45Z"}

# User IPs
connections_total{client_ip="192.168.1.100"}

2. Overly Detailed Labels

❌ BAD PRACTICE:

# Full paths with parameters
http_requests{path="/api/users/123/orders/456/items/789"}

# Versions with build number
app_version{version="1.2.3-build-20231116-abc123"}

✅ GOOD PRACTICE:

# Path patterns
http_requests{path="/api/users/:id/orders/:id/items/:id"}

# Simplified version
app_version{version="1.2.3"}

3. Combination of Multiple Labels

# 5 labels with high cardinality
http_requests_total{
  region="us-east-1",        # 20 regions
  az="us-east-1a",          # 60 availability zones
  instance="i-abc123",       # 1000 instances
  container="web-1",         # 500 containers
  pod="web-deployment-xyz"   # 2000 pods
}
# Theoretical maximum cardinality: 20 × 60 × 1000 × 500 × 2000 = 1.2 trillion!

How to Manage Cardinality?

1. Use Aggregates Instead of Details

❌ BAD:

requests_total{user_id="123", endpoint="/api/users"}

✅ GOOD:

# Metric without user_id
requests_total{endpoint="/api/users"}

# Separate metric for unique users (aggregate)
active_users_total{endpoint="/api/users"}

2. Group Values into Buckets

❌ BAD:

http_response_time{duration_ms="1234"}

✅ GOOD:

# Use histogram with predefined buckets
http_response_time_bucket{le="0.1"}
http_response_time_bucket{le="0.5"}
http_response_time_bucket{le="1.0"}

3. Use Patterns Instead of Specific Values

❌ BAD:

api_requests{path="/api/users/123"}
api_requests{path="/api/users/456"}
api_requests{path="/api/orders/789"}

✅ GOOD:

api_requests{path="/api/users/:id"}
api_requests{path="/api/orders/:id"}

4. Limit Cardinality via Relabeling

W konfiguracji Prometheus:

scrape_configs:
  - job_name: 'api'
    metric_relabel_configs:
      # Remove high-cardinality labels
      - source_labels: [user_id]
        regex: '.*'
        action: labeldrop

      # Replace detailed values with general ones
      - source_labels: [http_status]
        regex: '2..'
        replacement: '2xx'
        target_label: http_status_class

      # Remove entire metrics with problematic labels
      - source_labels: [__name__, user_email]
        regex: 'user_activity;.*'
        action: drop

5. Monitor Cardinality

Metrics to Track:

# Total number of time series
prometheus_tsdb_symbol_table_size_bytes

# Number of active series
prometheus_tsdb_head_series

# Series per metric
count by (__name__) ({__name__=~".+"})

# Top 10 metrics by cardinality
topk(10, count by (__name__) ({__name__=~".+"}))

# Cardinality growth over time
rate(prometheus_tsdb_head_series[5m])

Cardinality Alerts:

groups:
  - name: cardinality
    rules:
    - alert: HighCardinality
      expr: prometheus_tsdb_head_series > 1000000
      for: 10m
      annotations:
        summary: "Metric cardinality too high"
        description: "Number of time series exceeded 1 million"

    - alert: CardinalityGrowth
      expr: rate(prometheus_tsdb_head_series[1h]) > 1000
      for: 15m
      annotations:
        summary: "Rapid cardinality growth"
        description: "Cardinality is growing by more than 1000 series/h"

Cardinality Analysis Tools

1. Promtool

# TSDB data analysis
promtool tsdb analyze /path/to/prometheus/data

# Top metrics by series count
promtool tsdb analyze /path/to/prometheus/data | grep "Highest cardinality"

2. Diagnostic Queries

# Most problematic metrics
topk(10,
  count by (__name__) ({__name__=~".+"})
)

# Most problematic labels
topk(10,
  count by (label_name) ({__name__="your_metric"})
)

# Metrics with highest number of unique label values
sort_desc(
  count by (__name__) (
    count by (__name__, instance) ({__name__=~".+"})
  )
)

Best Practices

DO:

✅ Use low-cardinality labels (status codes, HTTP methods, operation types) ✅ Predefine possible label values in code ✅ Use patterns for URL paths ✅ Aggregate data at application level before exporting ✅ Regularly monitor cardinality ✅ Document maximum expected cardinality for each metric

DON’T:

❌ Don’t use user identifiers as labels ❌ Don’t use email addresses, tokens, IPs as labels ❌ Don’t use timestamps as labels ❌ Don’t use full URLs with parameters ❌ Don’t create dynamic metric names ❌ Don’t use UUIDs, hash sums as label values

Refactoring Example

BEFORE (bad cardinality):

# 1,000,000 users × 10 endpoints = 10,000,000 series
user_api_requests_total{
  user_id="123456",
  email="user@example.com",
  endpoint="/api/profile",
  full_url="/api/profile?tab=settings&lang=en"
}

Memory usage: ~30GB RAM Disk usage: ~200GB/month with 15s scrape interval

AFTER (good cardinality):

# Metric without user-specific data: 10 endpoints = 10 series
api_requests_total{
  endpoint="/api/profile",
  status="200"
}

# Aggregate for unique users: 10 endpoints = 10 series
api_unique_users_total{
  endpoint="/api/profile"
}

# Additional logs/traces for detailed user analysis
# (outside Prometheus, in logging system)

Memory usage: ~60KB RAM Disk usage: ~400MB/month with 15s scrape interval

Savings: 99.9%+ fewer resources!

Summary

Cardinality is a key factor affecting:

  • Performance: query time, CPU usage
  • Costs: RAM, disk, infrastructure
  • Stability: OOM errors, slow responses

Golden rule: If the number of possible label values is unlimited or very large (>100), DON’T use it as a label.

Instead:

  • Use logs for detailed data
  • Use traces for transaction flow
  • Use metrics for aggregates and statistics

results matching ""

    No results matching ""