Metric Cardinality
Metric cardinality problem
Metric cardinality is one of the most important problems in time series-based monitoring systems. Improper cardinality management can lead to serious performance and cost issues.
What is cardinality?
Cardinality is the number of unique combinations of all label key-value pairs for a given metric.
Basic example:
http_requests_total{method="GET", status="200", path="/api/users"}
http_requests_total{method="POST", status="201", path="/api/users"}
http_requests_total{method="GET", status="404", path="/api/orders"}
Cardinality = number of unique combinations of {method, status, path}
For:
- 4 HTTP methods (GET, POST, PUT, DELETE)
- 10 different status codes
- 100 different paths
Maximum cardinality = 4 × 10 × 100 = 4,000 time series
Why is cardinality a problem?
Performance impact
- Memory consumption: Each unique time series requires:
- Memory to store label metadata
- Memory to buffer current values
- Indexes for fast lookup
- Disk consumption:
- Each series is written separately
- Increase in number of files and data blocks
- Slower compaction
- Query performance:
- More data to search
- Longer response times
- Higher CPU usage during aggregations
Cost formula
total_cardinality = metric_1_cardinality + metric_2_cardinality + ... + metric_n_cardinality
With typical consumption of 1-2 bytes per sample:
ram_usage ≈ number_of_series × 3KB (metadata + buffers)
disk_usage = number_of_series × samples_per_day × bytes_per_sample × retention_days
Causes of High Cardinality
1. Labels with Unlimited Number of Values
❌ VERY BAD PRACTICES:
# User identifiers
http_requests_total{user_id="12345"}
http_requests_total{user_id="67890"}
# Potentially millions of unique user_id!
# Email addresses
login_attempts{email="user@example.com"}
# Session tokens
api_calls{session_token="abc123xyz"}
# Full URLs with parameters
requests_total{url="/api/search?q=prometheus&page=1&limit=50"}
# Timestamps
events_total{timestamp="2025-11-16T10:30:45Z"}
# User IPs
connections_total{client_ip="192.168.1.100"}
2. Overly Detailed Labels
❌ BAD PRACTICE:
# Full paths with parameters
http_requests{path="/api/users/123/orders/456/items/789"}
# Versions with build number
app_version{version="1.2.3-build-20231116-abc123"}
✅ GOOD PRACTICE:
# Path patterns
http_requests{path="/api/users/:id/orders/:id/items/:id"}
# Simplified version
app_version{version="1.2.3"}
3. Combination of Multiple Labels
# 5 labels with high cardinality
http_requests_total{
region="us-east-1", # 20 regions
az="us-east-1a", # 60 availability zones
instance="i-abc123", # 1000 instances
container="web-1", # 500 containers
pod="web-deployment-xyz" # 2000 pods
}
# Theoretical maximum cardinality: 20 × 60 × 1000 × 500 × 2000 = 1.2 trillion!
How to Manage Cardinality?
1. Use Aggregates Instead of Details
❌ BAD:
requests_total{user_id="123", endpoint="/api/users"}
✅ GOOD:
# Metric without user_id
requests_total{endpoint="/api/users"}
# Separate metric for unique users (aggregate)
active_users_total{endpoint="/api/users"}
2. Group Values into Buckets
❌ BAD:
http_response_time{duration_ms="1234"}
✅ GOOD:
# Use histogram with predefined buckets
http_response_time_bucket{le="0.1"}
http_response_time_bucket{le="0.5"}
http_response_time_bucket{le="1.0"}
3. Use Patterns Instead of Specific Values
❌ BAD:
api_requests{path="/api/users/123"}
api_requests{path="/api/users/456"}
api_requests{path="/api/orders/789"}
✅ GOOD:
api_requests{path="/api/users/:id"}
api_requests{path="/api/orders/:id"}
4. Limit Cardinality via Relabeling
W konfiguracji Prometheus:
scrape_configs:
- job_name: 'api'
metric_relabel_configs:
# Remove high-cardinality labels
- source_labels: [user_id]
regex: '.*'
action: labeldrop
# Replace detailed values with general ones
- source_labels: [http_status]
regex: '2..'
replacement: '2xx'
target_label: http_status_class
# Remove entire metrics with problematic labels
- source_labels: [__name__, user_email]
regex: 'user_activity;.*'
action: drop
5. Monitor Cardinality
Metrics to Track:
# Total number of time series
prometheus_tsdb_symbol_table_size_bytes
# Number of active series
prometheus_tsdb_head_series
# Series per metric
count by (__name__) ({__name__=~".+"})
# Top 10 metrics by cardinality
topk(10, count by (__name__) ({__name__=~".+"}))
# Cardinality growth over time
rate(prometheus_tsdb_head_series[5m])
Cardinality Alerts:
groups:
- name: cardinality
rules:
- alert: HighCardinality
expr: prometheus_tsdb_head_series > 1000000
for: 10m
annotations:
summary: "Metric cardinality too high"
description: "Number of time series exceeded 1 million"
- alert: CardinalityGrowth
expr: rate(prometheus_tsdb_head_series[1h]) > 1000
for: 15m
annotations:
summary: "Rapid cardinality growth"
description: "Cardinality is growing by more than 1000 series/h"
Cardinality Analysis Tools
1. Promtool
# TSDB data analysis
promtool tsdb analyze /path/to/prometheus/data
# Top metrics by series count
promtool tsdb analyze /path/to/prometheus/data | grep "Highest cardinality"
2. Diagnostic Queries
# Most problematic metrics
topk(10,
count by (__name__) ({__name__=~".+"})
)
# Most problematic labels
topk(10,
count by (label_name) ({__name__="your_metric"})
)
# Metrics with highest number of unique label values
sort_desc(
count by (__name__) (
count by (__name__, instance) ({__name__=~".+"})
)
)
Best Practices
DO:
✅ Use low-cardinality labels (status codes, HTTP methods, operation types) ✅ Predefine possible label values in code ✅ Use patterns for URL paths ✅ Aggregate data at application level before exporting ✅ Regularly monitor cardinality ✅ Document maximum expected cardinality for each metric
DON’T:
❌ Don’t use user identifiers as labels ❌ Don’t use email addresses, tokens, IPs as labels ❌ Don’t use timestamps as labels ❌ Don’t use full URLs with parameters ❌ Don’t create dynamic metric names ❌ Don’t use UUIDs, hash sums as label values
Refactoring Example
BEFORE (bad cardinality):
# 1,000,000 users × 10 endpoints = 10,000,000 series
user_api_requests_total{
user_id="123456",
email="user@example.com",
endpoint="/api/profile",
full_url="/api/profile?tab=settings&lang=en"
}
Memory usage: ~30GB RAM Disk usage: ~200GB/month with 15s scrape interval
AFTER (good cardinality):
# Metric without user-specific data: 10 endpoints = 10 series
api_requests_total{
endpoint="/api/profile",
status="200"
}
# Aggregate for unique users: 10 endpoints = 10 series
api_unique_users_total{
endpoint="/api/profile"
}
# Additional logs/traces for detailed user analysis
# (outside Prometheus, in logging system)
Memory usage: ~60KB RAM Disk usage: ~400MB/month with 15s scrape interval
Savings: 99.9%+ fewer resources!
Summary
Cardinality is a key factor affecting:
- Performance: query time, CPU usage
- Costs: RAM, disk, infrastructure
- Stability: OOM errors, slow responses
Golden rule: If the number of possible label values is unlimited or very large (>100), DON’T use it as a label.
Instead:
- Use logs for detailed data
- Use traces for transaction flow
- Use metrics for aggregates and statistics