Metrics
π Introduction
What are they for?
They provide context of what happened and the trend
- Numerical data over time: CPU, latency, number of requests
- Key to alerting and SLA/SLO
- Low-cost and efficient
Tools: Prometheus, Grafana Mimir, InfluxDB
π― Use of Metrics vs. Metrics in Logs
π Why dedicated metrics systems?
Metrics are numerical time-series data optimized for:
- Fast aggregations (sum, average, percentiles)
- Efficient storage (compression, downsampling)
- Lightning-fast queries (time-based indexing)
- Real-time alerting
β‘ The problem with metrics in logs
β Bad approach - metrics as logs:
{"timestamp": "2025-10-27T10:00:00Z", "level": "info", "message": "Request processed", "response_time": 250, "status": 200}
{"timestamp": "2025-10-27T10:00:01Z", "level": "info", "message": "Request processed", "response_time": 180, "status": 200}
{"timestamp": "2025-10-27T10:00:02Z", "level": "info", "message": "Request processed", "response_time": 420, "status": 500}
β Good approach - dedicated metrics:
# Prometheus format
http_request_duration_seconds{method="GET",status="200"} 0.250
http_request_duration_seconds{method="GET",status="200"} 0.180
http_request_duration_seconds{method="GET",status="500"} 0.420
πΎ Storage Comparison
Scenario: 1000 requests/second for 1 hour
π Metrics in logs (JSON):
{"ts":"2025-10-27T10:00:00Z","msg":"Request","duration":250,"status":200,"method":"GET","endpoint":"/api/users"}
- Single entry size: ~110 bytes
- 1000 req/s Γ 3600s Γ 110B = 396 MB/hour
- 396 MB Γ 24h = 9.5 GB/day
π Dedicated metrics (Prometheus):
http_request_duration_seconds{method="GET",endpoint="/api/users",status="200"} 0.250 1698412800
- Single entry size: ~85 bytes
- But with time compression: 1000 points β ~15 points/minute (aggregation)
- 15 Γ 60 minutes Γ 85B = 76.5 KB/hour
- 76.5 KB Γ 24h = 1.8 MB/day
π Efficiency Difference
| Aspect | Metrics in logs | Dedicated metrics | Improvement |
|---|---|---|---|
| Disk space | 9.5 GB/day | 1.8 MB/day | 5277x less! |
| Query time | 5-10 seconds | 50-100ms | 50-100x faster |
βοΈ When to use logs vs metrics?
| Data | Logs | Metrics | Reason |
|---|---|---|---|
| Application errors | β | β | Context and stack trace needed |
| Response times | β | β | Aggregations, percentiles, alerting |
| Request count | β | β | Sums, trends, dashboards |
| Business events | β | β | Logs for context, metrics for KPIs |
| User actions | β | β | Detailed behavior tracking |
| System resources | β | β | Monitoring, alerting, autoscaling |
π Metric Types
π’ 1. Counter
![]()
Definition: A value that only increases (or resets to zero)
Examples:
http_requests_total{method="GET", status="200"} 1547
errors_total{service="payment"} 23
bytes_sent_total{endpoint="/api/users"} 2048576
Characteristics:
- β Monotonic (always goes up)
- β Ideal for counting events
- β Can calculate rate (increase per second)
- β Does not show current value
Use cases:
- Number of HTTP requests
- Number of errors
- Number of processed tasks
- Bytes sent over network
PromQL examples:
# Rate - requests per second
rate(http_requests_total[5m])
# Increase in the last 5 minutes
increase(http_requests_total[5m])
π 2. Gauge
Definition: A value that can increase and decrease - shows current state
Examples:
cpu_usage_percent{host="web01"} 85.2
memory_available_bytes{host="web01"} 2147483648
active_connections{service="database"} 42
queue_size{queue="orders"} 156
Characteristics:
- β Can increase and decrease
- β Shows current value
- β Ideal for alerting
- β Historical trend is not inherently meaningful per se
Use cases:
- CPU/RAM usage
- Temperature
- Number of active connections
- Queue size
- Number of active users
PromQL examples:
# Average CPU usage in the last 5 minutes
avg_over_time(cpu_usage_percent[5m])
# Maximum memory usage
max_over_time(memory_usage_percent[1h])
π 3. Histogram

Definition: Counts observations in predefined buckets (ranges)
Example structure:
http_request_duration_seconds_bucket{le="0.1"} 2450
http_request_duration_seconds_bucket{le="0.5"} 4321
http_request_duration_seconds_bucket{le="1.0"} 4890
http_request_duration_seconds_bucket{le="2.0"} 4950
http_request_duration_seconds_bucket{le="+Inf"} 5000
http_request_duration_seconds_sum 2847.3
http_request_duration_seconds_count 5000

What you get:
- Buckets - number of observations in each range
- Sum - sum of all values
- Count - total number of observations
Use cases:
- Response time
- Request size
- Processing duration
- SLA/percentile monitoring
PromQL examples:
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Percentage of requests below 500ms
rate(http_request_duration_seconds_bucket{le="0.5"}[5m]) / rate(http_request_duration_seconds_count[5m])
π 3a. Native Histogram
Definition: Native histograms (also known as sparse histograms) are a new generation of histograms in Prometheus (since v2.40) that automatically select buckets based on observed values, eliminating the need for manual configuration.
Key differences vs classic histogram:
| Aspect | Classic Histogram | Native Histogram |
|---|---|---|
| Buckets | Manually predefined | Automatic (exponential) |
| Time series | One per bucket (le=ββ¦β) | One series per metric |
| Data size | Grows with number of buckets | Constant, compact |
| Accuracy | Depends on bucket selection | Controlled by resolution schema |
| Configuration | Requires choosing boundaries | Minimal β works out-of-the-box |
How it works:
- Buckets based on powers of 2 (exponential boundaries)
- Schema parameter (from -4 to 8) controls resolution β higher schema = more buckets = greater accuracy
- Stored as a single time series instead of multiple
_bucketseries
Example structure (text format):
# Classic histogram: 5+ time series
http_request_duration_seconds_bucket{le="0.1"} 2450
http_request_duration_seconds_bucket{le="0.5"} 4321
http_request_duration_seconds_bucket{le="1.0"} 4890
http_request_duration_seconds_bucket{le="+Inf"} 5000
http_request_duration_seconds_sum 2847.3
http_request_duration_seconds_count 5000
# Native histogram: 1 time series contains all buckets!
http_request_duration_seconds β {schema:5, count:5000, sum:2847.3,
positive_spans:[...], positive_deltas:[...]}
Enabling in Prometheus:
# prometheus.yml - global enablement
global:
scrape_protocols:
- PrometheusProto # required for native histograms
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText1.0.0
- PrometheusText0.0.4
# Enable feature flag when starting Prometheus:
# --enable-feature=native-histograms
PromQL β queries work the same:
# 95th percentile β identical syntax as for classic histogram
histogram_quantile(0.95, rate(http_request_duration_seconds[5m]))
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
When to use Native Histogram:
- β Prometheus >= 2.40 and you want to reduce cardinality (fewer time series)
- β You donβt know what buckets to choose β native histogram adapts automatically
- β You need more accurate percentiles without increasing the number of buckets
- β You have many histogram metrics and want to save storage/memory
- β Not yet supported by all tools (e.g., older versions of Grafana, Thanos)
π‘ Tip: You can configure dual scraping β Prometheus collects both classic and native histograms simultaneously, making migration easier.
π 4. Summary

Definition: Similar to histogram, but with pre-calculated quantiles
Example structure:
http_request_duration_seconds{quantile="0.5"} 0.235
http_request_duration_seconds{quantile="0.9"} 0.821
http_request_duration_seconds{quantile="0.95"} 1.234
http_request_duration_seconds{quantile="0.99"} 2.156
http_request_duration_seconds_sum 2847.3
http_request_duration_seconds_count 5000
Characteristics:
- β Pre-calculated quantiles (fast queries)
- β Accurate percentile values
- β Cannot aggregate across instances
- β Quantiles fixed at application level
Use cases:
- Response time (when you need accurate percentiles)
- Processing time
- Queue wait time
βοΈ Histogram vs Summary
| Aspect | Histogram | Summary |
|---|---|---|
| Percentiles | Approximated | Exact |
| Aggregation | β Possible across instances | β Not possible |
| Overhead | Lower | Higher |
| Flexibility | β Quantiles in PromQL | β Fixed upfront |
| Usage | Recommended for most cases | When you need exact percentiles |
π¨ Metric Naming Best Practices
Two similar standards:
Conventions
# Counter - ends with _total
http_requests_total
errors_total
bytes_sent_total
# Gauge - describes current state
cpu_usage_percent
memory_available_bytes
active_connections
# Histogram/Summary - ends with unit + _bucket/_sum/_count
response_time_seconds_bucket
request_size_bytes_bucket
# Base units (SI)
_seconds (not _milliseconds)
_bytes (not _kilobytes)
_total (for counters)
Labels
# Good
http_requests_total{method="GET", status="200", endpoint="/api/user"}
# Bad - too high cardinality
http_requests_total{user_id="123456", session="abc-def-ghi"}
π οΈ Practical Tips
β DO:
- Use metrics for everything that can be counted, measured, aggregated
- Log context, errors, unusual events
- Implement metrics at both application and infrastructure level
- Set alerts on metrics, not on logs
β DONβT:
- Donβt log numerical data that repeats regularly
- Donβt use logs for performance monitoring
- Avoid real-time alerting on logs
- Donβt mix business metrics with diagnostic logs
Formats
π― Prometheus - Exposition Format
Format:
# HELP metric_name Description of the metric
# TYPE metric_name metric_type
metric_name{label1="value1",label2="value2"} metric_value timestamp
Example:
# HELP cpu_usage_percent Current CPU usage percentage
# TYPE cpu_usage_percent gauge
cpu_usage_percent{host="server01",region="us-west",service="web-app"} 85.2 1698412800000
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",endpoint="/api/users"} 1500 1698412800000
# HELP memory_usage_bytes Memory usage in bytes
# TYPE memory_usage_bytes gauge
memory_usage_bytes{host="server01",region="us-west",type="available"} 2048000000 1698412800000
memory_usage_bytes{host="server01",region="us-west",type="used"} 6144000000 1698412800000
Prometheus Structure:
- Metric Name - metric name (e.g.,
cpu_usage_percent) - Labels - key-value pairs in
{}(e.g.,{host="server01"}) - Value - single numerical value
- Timestamp - Unix timestamp in milliseconds (optional)
Prometheus Advantages:
- β Wide ecosystem and adoption (CNCF)
- β Built-in alerting (Alertmanager)
- β Pull model - better for service discovery
- β PromQL - powerful query language
- β Federation and hierarchical deployment
Prometheus Disadvantages:
- β One value per metric (requires multiple metrics for complex data types)
- β Limited long-term retention capabilities
- β Issues with high label cardinality
π OpenTelemetry - OTLP Metrics Format
Format (JSON):
{
"resourceMetrics": [{
"resource": {
"attributes": [{
"key": "service.name",
"value": {"stringValue": "web-app"}
}]
},
"scopeMetrics": [{
"metrics": [{
"name": "http_request_duration",
"unit": "s",
"gauge": {
"dataPoints": [{
"timeUnixNano": "1698412800000000000",
"asDouble": 0.250,
"attributes": [{
"key": "method",
"value": {"stringValue": "GET"}
}]
}]
}
}]
}]
}]
}
Format (Protobuf - binary):
message ResourceMetrics {
Resource resource = 1;
repeated ScopeMetrics scope_metrics = 2;
}
Practical example (JSON):
{
"resourceMetrics": [{
"resource": {
"attributes": [
{"key": "service.name", "value": {"stringValue": "payment-service"}},
{"key": "service.version", "value": {"stringValue": "1.2.3"}},
{"key": "host.name", "value": {"stringValue": "server01"}}
]
},
"scopeMetrics": [{
"scope": {
"name": "payment-instrumentation",
"version": "0.1.0"
},
"metrics": [
{
"name": "http_requests_total",
"description": "Total HTTP requests",
"unit": "1",
"sum": {
"aggregationTemporality": 2,
"isMonotonic": true,
"dataPoints": [{
"timeUnixNano": "1698412800000000000",
"asInt": "1500",
"attributes": [ /*Labels*/
{"key": "method", "value": {"stringValue": "GET"}},
{"key": "status_code", "value": {"intValue": "200"}}
]
}]
}
},
{
"name": "response_time_histogram",
"description": "HTTP response time distribution",
"unit": "s", /*Unit*/
"histogram": { /*Type*/
"aggregationTemporality": 2,
"dataPoints": [{
"timeUnixNano": "1698412800000000000",
"count": "100",
"sum": 25.0,
"bucketCounts": ["10", "30", "40", "20"],
"explicitBounds": [0.1, 0.5, 1.0, 2.0]
}]
}
}
]
}]
}]
}
OpenTelemetry Structure:
- Resource - resource metadata (service.name, host.name)
- Scope - instrumentation scope (library, version)
- Metrics - list of metrics with data
- DataPoints - data points with timestamps and attributes
OpenTelemetry Metric Types:
- Gauge - current value (like Prometheus gauge)
- Sum - cumulative sum (like Prometheus counter)
- Histogram - value distribution in buckets
- ExponentialHistogram - histogram with exponential buckets (Native histogram)
OpenTelemetry Advantages:
- β Vendor-neutral - works with many backends
- β Standardization across traces, logs, and metrics
- β Rich data model (Resource + Scope + Attributes)
- β Support for both push and pull
- β Automatic instrumentation for many languages
- β Support for sampling and batching
OpenTelemetry Disadvantages:
- β Higher overhead compared to simpler formats
- β Complexity - more layers of abstraction
- β Newer standard - less operational experience
- β Requires OTel Collector for full functionality
This configuration demonstrates the power of OpenTelemetry - one Collector can receive metrics in both OTel and Prometheus formats, and then export them to different backends!