Metric Cardinality

Metric cardinality problem

Metric cardinality is one of the most important problems in time series-based monitoring systems. Improper cardinality management can lead to serious performance and cost issues.

What is cardinality?

Cardinality is the number of unique combinations of all label key-value pairs for a given metric.

Basic example:

http_requests_total{method="GET", status="200", path="/api/users"}
http_requests_total{method="POST", status="201", path="/api/users"}
http_requests_total{method="GET", status="404", path="/api/orders"}

Cardinality = number of unique combinations of {method, status, path}

For:

4 HTTP methods (GET, POST, PUT, DELETE)
10 different status codes
100 different paths

Maximum cardinality = 4 × 10 × 100 = 4,000 time series

Why is cardinality a problem?

Performance impact

Memory consumption: Each unique time series requires:
- Memory to store label metadata
- Memory to buffer current values
- Indexes for fast lookup
Disk consumption:
- Each series is written separately
- Increase in number of files and data blocks
- Slower compaction
Query performance:
- More data to search
- Longer response times
- Higher CPU usage during aggregations

Cost formula

total_cardinality = metric_1_cardinality + metric_2_cardinality + ... + metric_n_cardinality

With typical consumption of 1-2 bytes per sample:

ram_usage ≈ number_of_series × 3KB (metadata + buffers)
disk_usage = number_of_series × samples_per_day × bytes_per_sample × retention_days

Causes of High Cardinality

1. Labels with Unlimited Number of Values

❌ VERY BAD PRACTICES:

# User identifiers
http_requests_total{user_id="12345"}
http_requests_total{user_id="67890"}
# Potentially millions of unique user_id!

# Email addresses
login_attempts{email="user@example.com"}

# Session tokens
api_calls{session_token="abc123xyz"}

# Full URLs with parameters
requests_total{url="/api/search?q=prometheus&page=1&limit=50"}

# Timestamps
events_total{timestamp="2025-11-16T10:30:45Z"}

# User IPs
connections_total{client_ip="192.168.1.100"}

2. Overly Detailed Labels

❌ BAD PRACTICE:

# Full paths with parameters
http_requests{path="/api/users/123/orders/456/items/789"}

# Versions with build number
app_version{version="1.2.3-build-20231116-abc123"}

✅ GOOD PRACTICE:

# Path patterns
http_requests{path="/api/users/:id/orders/:id/items/:id"}

# Simplified version
app_version{version="1.2.3"}

3. Combination of Multiple Labels

# 5 labels with high cardinality
http_requests_total{
  region="us-east-1",        # 20 regions
  az="us-east-1a",          # 60 availability zones
  instance="i-abc123",       # 1000 instances
  container="web-1",         # 500 containers
  pod="web-deployment-xyz"   # 2000 pods
}
# Theoretical maximum cardinality: 20 × 60 × 1000 × 500 × 2000 = 1.2 trillion!

How to Manage Cardinality?

1. Use Aggregates Instead of Details

❌ BAD:

requests_total{user_id="123", endpoint="/api/users"}

✅ GOOD:

# Metric without user_id
requests_total{endpoint="/api/users"}

# Separate metric for unique users (aggregate)
active_users_total{endpoint="/api/users"}

2. Group Values into Buckets

❌ BAD:

http_response_time{duration_ms="1234"}

✅ GOOD:

# Use histogram with predefined buckets
http_response_time_bucket{le="0.1"}
http_response_time_bucket{le="0.5"}
http_response_time_bucket{le="1.0"}

3. Use Patterns Instead of Specific Values

❌ BAD:

api_requests{path="/api/users/123"}
api_requests{path="/api/users/456"}
api_requests{path="/api/orders/789"}

✅ GOOD:

api_requests{path="/api/users/:id"}
api_requests{path="/api/orders/:id"}

4. Limit Cardinality via Relabeling

W konfiguracji Prometheus:

scrape_configs:
  - job_name: 'api'
    metric_relabel_configs:
      # Remove high-cardinality labels
      - source_labels: [user_id]
        regex: '.*'
        action: labeldrop

      # Replace detailed values with general ones
      - source_labels: [http_status]
        regex: '2..'
        replacement: '2xx'
        target_label: http_status_class

      # Remove entire metrics with problematic labels
      - source_labels: [__name__, user_email]
        regex: 'user_activity;.*'
        action: drop

5. Monitor Cardinality

Metrics to Track:

# Total number of time series
prometheus_tsdb_symbol_table_size_bytes

# Number of active series
prometheus_tsdb_head_series

# Series per metric
count by (__name__) ({__name__=~".+"})

# Top 10 metrics by cardinality
topk(10, count by (__name__) ({__name__=~".+"}))

# Cardinality growth over time
rate(prometheus_tsdb_head_series[5m])

Cardinality Alerts:

groups:
  - name: cardinality
    rules:
    - alert: HighCardinality
      expr: prometheus_tsdb_head_series > 1000000
      for: 10m
      annotations:
        summary: "Metric cardinality too high"
        description: "Number of time series exceeded 1 million"

    - alert: CardinalityGrowth
      expr: rate(prometheus_tsdb_head_series[1h]) > 1000
      for: 15m
      annotations:
        summary: "Rapid cardinality growth"
        description: "Cardinality is growing by more than 1000 series/h"

Cardinality Analysis Tools

1. Promtool

# TSDB data analysis
promtool tsdb analyze /path/to/prometheus/data

# Top metrics by series count
promtool tsdb analyze /path/to/prometheus/data | grep "Highest cardinality"

2. Diagnostic Queries

# Most problematic metrics
topk(10,
  count by (__name__) ({__name__=~".+"})
)

# Most problematic labels
topk(10,
  count by (label_name) ({__name__="your_metric"})
)

# Metrics with highest number of unique label values
sort_desc(
  count by (__name__) (
    count by (__name__, instance) ({__name__=~".+"})
  )
)

Best Practices

DO:

✅ Use low-cardinality labels (status codes, HTTP methods, operation types)
✅ Predefine possible label values in code
✅ Use patterns for URL paths
✅ Aggregate data at application level before exporting
✅ Regularly monitor cardinality
✅ Document maximum expected cardinality for each metric

DON’T:

❌ Don’t use user identifiers as labels
❌ Don’t use email addresses, tokens, IPs as labels
❌ Don’t use timestamps as labels
❌ Don’t use full URLs with parameters
❌ Don’t create dynamic metric names
❌ Don’t use UUIDs, hash sums as label values

Refactoring Example

BEFORE (bad cardinality):

# 1,000,000 users × 10 endpoints = 10,000,000 series
user_api_requests_total{
  user_id="123456",
  email="user@example.com",
  endpoint="/api/profile",
  full_url="/api/profile?tab=settings&lang=en"
}

Memory usage: ~30GB RAM Disk usage: ~200GB/month with 15s scrape interval

AFTER (good cardinality):

# Metric without user-specific data: 10 endpoints = 10 series
api_requests_total{
  endpoint="/api/profile",
  status="200"
}

# Aggregate for unique users: 10 endpoints = 10 series
api_unique_users_total{
  endpoint="/api/profile"
}

# Additional logs/traces for detailed user analysis
# (outside Prometheus, in logging system)

Memory usage: ~60KB RAM Disk usage: ~400MB/month with 15s scrape interval

Savings: 99.9%+ fewer resources!

Summary

Cardinality is a key factor affecting:

Performance: query time, CPU usage
Costs: RAM, disk, infrastructure
Stability: OOM errors, slow responses

Golden rule: If the number of possible label values is unlimited or very large (>100), DON’T use it as a label.

Instead:

Use logs for detailed data
Use traces for transaction flow
Use metrics for aggregates and statistics

Metric Cardinality

Metric Cardinality

Metric cardinality problem

What is cardinality?

Basic example:

Why is cardinality a problem?

Performance impact

Cost formula

Causes of High Cardinality

1. Labels with Unlimited Number of Values

2. Overly Detailed Labels

3. Combination of Multiple Labels

How to Manage Cardinality?

1. Use Aggregates Instead of Details

2. Group Values into Buckets

3. Use Patterns Instead of Specific Values

4. Limit Cardinality via Relabeling

5. Monitor Cardinality

Cardinality Analysis Tools

1. Promtool

2. Diagnostic Queries

Best Practices

DO:

DON’T:

Refactoring Example

BEFORE (bad cardinality):

AFTER (good cardinality):

Summary

results matching ""

No results matching ""