Metrics

πŸ“ˆ Introduction

What are they for?

They provide context of what happened and the trend

  • Numerical data over time: CPU, latency, number of requests
  • Key to alerting and SLA/SLO
  • Low-cost and efficient

Tools: Prometheus, Grafana Mimir, InfluxDB

🎯 Use of Metrics vs. Metrics in Logs

πŸ“Š Why dedicated metrics systems?

Metrics are numerical time-series data optimized for:

  • Fast aggregations (sum, average, percentiles)
  • Efficient storage (compression, downsampling)
  • Lightning-fast queries (time-based indexing)
  • Real-time alerting

⚑ The problem with metrics in logs

❌ Bad approach - metrics as logs:

{"timestamp": "2025-10-27T10:00:00Z", "level": "info", "message": "Request processed", "response_time": 250, "status": 200}
{"timestamp": "2025-10-27T10:00:01Z", "level": "info", "message": "Request processed", "response_time": 180, "status": 200}
{"timestamp": "2025-10-27T10:00:02Z", "level": "info", "message": "Request processed", "response_time": 420, "status": 500}

βœ… Good approach - dedicated metrics:

# Prometheus format
http_request_duration_seconds{method="GET",status="200"} 0.250
http_request_duration_seconds{method="GET",status="200"} 0.180
http_request_duration_seconds{method="GET",status="500"} 0.420

πŸ’Ύ Storage Comparison

Scenario: 1000 requests/second for 1 hour

πŸ“‹ Metrics in logs (JSON):

{"ts":"2025-10-27T10:00:00Z","msg":"Request","duration":250,"status":200,"method":"GET","endpoint":"/api/users"}
  • Single entry size: ~110 bytes
  • 1000 req/s Γ— 3600s Γ— 110B = 396 MB/hour
  • 396 MB Γ— 24h = 9.5 GB/day

πŸ“Š Dedicated metrics (Prometheus):

http_request_duration_seconds{method="GET",endpoint="/api/users",status="200"} 0.250 1698412800
  • Single entry size: ~85 bytes
  • But with time compression: 1000 points β†’ ~15 points/minute (aggregation)
  • 15 Γ— 60 minutes Γ— 85B = 76.5 KB/hour
  • 76.5 KB Γ— 24h = 1.8 MB/day

πŸ“ˆ Efficiency Difference

Aspect Metrics in logs Dedicated metrics Improvement
Disk space 9.5 GB/day 1.8 MB/day 5277x less!
Query time 5-10 seconds 50-100ms 50-100x faster

βš–οΈ When to use logs vs metrics?

Data Logs Metrics Reason
Application errors βœ… ❌ Context and stack trace needed
Response times ❌ βœ… Aggregations, percentiles, alerting
Request count ❌ βœ… Sums, trends, dashboards
Business events βœ… βœ… Logs for context, metrics for KPIs
User actions βœ… ❌ Detailed behavior tracking
System resources ❌ βœ… Monitoring, alerting, autoscaling

πŸ“Š Metric Types

πŸ”’ 1. Counter

Definition: A value that only increases (or resets to zero)

Examples:

http_requests_total{method="GET", status="200"} 1547
errors_total{service="payment"} 23
bytes_sent_total{endpoint="/api/users"} 2048576

Characteristics:

  • βœ… Monotonic (always goes up)
  • βœ… Ideal for counting events
  • βœ… Can calculate rate (increase per second)
  • ❌ Does not show current value

Use cases:

  • Number of HTTP requests
  • Number of errors
  • Number of processed tasks
  • Bytes sent over network

PromQL examples:

# Rate - requests per second
rate(http_requests_total[5m])

# Increase in the last 5 minutes
increase(http_requests_total[5m])

πŸ“ 2. Gauge

Definition: A value that can increase and decrease - shows current state

Examples:

cpu_usage_percent{host="web01"} 85.2
memory_available_bytes{host="web01"} 2147483648
active_connections{service="database"} 42
queue_size{queue="orders"} 156

Characteristics:

  • βœ… Can increase and decrease
  • βœ… Shows current value
  • βœ… Ideal for alerting
  • ❌ Historical trend is not inherently meaningful per se

Use cases:

  • CPU/RAM usage
  • Temperature
  • Number of active connections
  • Queue size
  • Number of active users

PromQL examples:

# Average CPU usage in the last 5 minutes
avg_over_time(cpu_usage_percent[5m])

# Maximum memory usage
max_over_time(memory_usage_percent[1h])

πŸ“ˆ 3. Histogram

Definition: Counts observations in predefined buckets (ranges)

Example structure:

http_request_duration_seconds_bucket{le="0.1"} 2450
http_request_duration_seconds_bucket{le="0.5"} 4321
http_request_duration_seconds_bucket{le="1.0"} 4890
http_request_duration_seconds_bucket{le="2.0"} 4950
http_request_duration_seconds_bucket{le="+Inf"} 5000
http_request_duration_seconds_sum 2847.3
http_request_duration_seconds_count 5000

What you get:

  • Buckets - number of observations in each range
  • Sum - sum of all values
  • Count - total number of observations

Use cases:

  • Response time
  • Request size
  • Processing duration
  • SLA/percentile monitoring

PromQL examples:

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Percentage of requests below 500ms
rate(http_request_duration_seconds_bucket{le="0.5"}[5m]) / rate(http_request_duration_seconds_count[5m])

πŸ“ˆ 3a. Native Histogram

Definition: Native histograms (also known as sparse histograms) are a new generation of histograms in Prometheus (since v2.40) that automatically select buckets based on observed values, eliminating the need for manual configuration.

Key differences vs classic histogram:

Aspect Classic Histogram Native Histogram
Buckets Manually predefined Automatic (exponential)
Time series One per bucket (le=”…”) One series per metric
Data size Grows with number of buckets Constant, compact
Accuracy Depends on bucket selection Controlled by resolution schema
Configuration Requires choosing boundaries Minimal β€” works out-of-the-box

How it works:

  • Buckets based on powers of 2 (exponential boundaries)
  • Schema parameter (from -4 to 8) controls resolution β€” higher schema = more buckets = greater accuracy
  • Stored as a single time series instead of multiple _bucket series

Example structure (text format):

# Classic histogram: 5+ time series
http_request_duration_seconds_bucket{le="0.1"} 2450
http_request_duration_seconds_bucket{le="0.5"} 4321
http_request_duration_seconds_bucket{le="1.0"} 4890
http_request_duration_seconds_bucket{le="+Inf"} 5000
http_request_duration_seconds_sum 2847.3
http_request_duration_seconds_count 5000

# Native histogram: 1 time series contains all buckets!
http_request_duration_seconds  β†’  {schema:5, count:5000, sum:2847.3,
                                    positive_spans:[...], positive_deltas:[...]}

Enabling in Prometheus:

# prometheus.yml - global enablement
global:
  scrape_protocols:
    - PrometheusProto        # required for native histograms
    - OpenMetricsText1.0.0
    - OpenMetricsText0.0.1
    - PrometheusText1.0.0
    - PrometheusText0.0.4

# Enable feature flag when starting Prometheus:
# --enable-feature=native-histograms

PromQL β€” queries work the same:

# 95th percentile β€” identical syntax as for classic histogram
histogram_quantile(0.95, rate(http_request_duration_seconds[5m]))

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

When to use Native Histogram:

  • βœ… Prometheus >= 2.40 and you want to reduce cardinality (fewer time series)
  • βœ… You don’t know what buckets to choose β€” native histogram adapts automatically
  • βœ… You need more accurate percentiles without increasing the number of buckets
  • βœ… You have many histogram metrics and want to save storage/memory
  • ❌ Not yet supported by all tools (e.g., older versions of Grafana, Thanos)

πŸ’‘ Tip: You can configure dual scraping β€” Prometheus collects both classic and native histograms simultaneously, making migration easier.

πŸ“ 4. Summary

Definition: Similar to histogram, but with pre-calculated quantiles

Example structure:

http_request_duration_seconds{quantile="0.5"} 0.235
http_request_duration_seconds{quantile="0.9"} 0.821
http_request_duration_seconds{quantile="0.95"} 1.234
http_request_duration_seconds{quantile="0.99"} 2.156
http_request_duration_seconds_sum 2847.3
http_request_duration_seconds_count 5000

Characteristics:

  • βœ… Pre-calculated quantiles (fast queries)
  • βœ… Accurate percentile values
  • ❌ Cannot aggregate across instances
  • ❌ Quantiles fixed at application level

Use cases:

  • Response time (when you need accurate percentiles)
  • Processing time
  • Queue wait time

βš–οΈ Histogram vs Summary

Aspect Histogram Summary
Percentiles Approximated Exact
Aggregation βœ… Possible across instances ❌ Not possible
Overhead Lower Higher
Flexibility βœ… Quantiles in PromQL ❌ Fixed upfront
Usage Recommended for most cases When you need exact percentiles

🎨 Metric Naming Best Practices

Two similar standards:

Conventions

# Counter - ends with _total
http_requests_total
errors_total
bytes_sent_total

# Gauge - describes current state
cpu_usage_percent
memory_available_bytes
active_connections

# Histogram/Summary - ends with unit + _bucket/_sum/_count
response_time_seconds_bucket
request_size_bytes_bucket

# Base units (SI)
_seconds (not _milliseconds)
_bytes (not _kilobytes)
_total (for counters)

Labels

# Good
http_requests_total{method="GET", status="200", endpoint="/api/user"}

# Bad - too high cardinality
http_requests_total{user_id="123456", session="abc-def-ghi"}

πŸ› οΈ Practical Tips

βœ… DO:

  • Use metrics for everything that can be counted, measured, aggregated
  • Log context, errors, unusual events
  • Implement metrics at both application and infrastructure level
  • Set alerts on metrics, not on logs

❌ DON’T:

  • Don’t log numerical data that repeats regularly
  • Don’t use logs for performance monitoring
  • Avoid real-time alerting on logs
  • Don’t mix business metrics with diagnostic logs

Formats

🎯 Prometheus - Exposition Format

Format:

# HELP metric_name Description of the metric
# TYPE metric_name metric_type
metric_name{label1="value1",label2="value2"} metric_value timestamp

Example:

# HELP cpu_usage_percent Current CPU usage percentage
# TYPE cpu_usage_percent gauge
cpu_usage_percent{host="server01",region="us-west",service="web-app"} 85.2 1698412800000

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",endpoint="/api/users"} 1500 1698412800000

# HELP memory_usage_bytes Memory usage in bytes
# TYPE memory_usage_bytes gauge
memory_usage_bytes{host="server01",region="us-west",type="available"} 2048000000 1698412800000
memory_usage_bytes{host="server01",region="us-west",type="used"} 6144000000 1698412800000

Prometheus Structure:

  • Metric Name - metric name (e.g., cpu_usage_percent)
  • Labels - key-value pairs in {} (e.g., {host="server01"})
  • Value - single numerical value
  • Timestamp - Unix timestamp in milliseconds (optional)

Prometheus Advantages:

  • βœ… Wide ecosystem and adoption (CNCF)
  • βœ… Built-in alerting (Alertmanager)
  • βœ… Pull model - better for service discovery
  • βœ… PromQL - powerful query language
  • βœ… Federation and hierarchical deployment

Prometheus Disadvantages:

  • ❌ One value per metric (requires multiple metrics for complex data types)
  • ❌ Limited long-term retention capabilities
  • ❌ Issues with high label cardinality

🌐 OpenTelemetry - OTLP Metrics Format

Format (JSON):

{
  "resourceMetrics": [{
    "resource": {
      "attributes": [{
        "key": "service.name",
        "value": {"stringValue": "web-app"}
      }]
    },
    "scopeMetrics": [{
      "metrics": [{
        "name": "http_request_duration",
        "unit": "s",
        "gauge": {
          "dataPoints": [{
            "timeUnixNano": "1698412800000000000",
            "asDouble": 0.250,
            "attributes": [{
              "key": "method",
              "value": {"stringValue": "GET"}
            }]
          }]
        }
      }]
    }]
  }]
}

Format (Protobuf - binary):

message ResourceMetrics {
  Resource resource = 1;
  repeated ScopeMetrics scope_metrics = 2;
}

Practical example (JSON):

{
  "resourceMetrics": [{
    "resource": {
      "attributes": [
        {"key": "service.name", "value": {"stringValue": "payment-service"}},
        {"key": "service.version", "value": {"stringValue": "1.2.3"}},
        {"key": "host.name", "value": {"stringValue": "server01"}}
      ]
    },
    "scopeMetrics": [{
      "scope": {
        "name": "payment-instrumentation",
        "version": "0.1.0"
      },
      "metrics": [
        {
          "name": "http_requests_total",
          "description": "Total HTTP requests",
          "unit": "1",
          "sum": {
            "aggregationTemporality": 2,
            "isMonotonic": true,
            "dataPoints": [{
              "timeUnixNano": "1698412800000000000",
              "asInt": "1500",
              "attributes": [ /*Labels*/
                {"key": "method", "value": {"stringValue": "GET"}},
                {"key": "status_code", "value": {"intValue": "200"}}
              ]
            }]
          }
        },
        {
          "name": "response_time_histogram",
          "description": "HTTP response time distribution",
          "unit": "s", /*Unit*/
          "histogram": { /*Type*/
            "aggregationTemporality": 2,
            "dataPoints": [{
              "timeUnixNano": "1698412800000000000",
              "count": "100",
              "sum": 25.0,
              "bucketCounts": ["10", "30", "40", "20"],
              "explicitBounds": [0.1, 0.5, 1.0, 2.0]
            }]
          }
        }
      ]
    }]
  }]
}

OpenTelemetry Structure:

  • Resource - resource metadata (service.name, host.name)
  • Scope - instrumentation scope (library, version)
  • Metrics - list of metrics with data
  • DataPoints - data points with timestamps and attributes

OpenTelemetry Metric Types:

  • Gauge - current value (like Prometheus gauge)
  • Sum - cumulative sum (like Prometheus counter)
  • Histogram - value distribution in buckets
  • ExponentialHistogram - histogram with exponential buckets (Native histogram)

OpenTelemetry Advantages:

  • βœ… Vendor-neutral - works with many backends
  • βœ… Standardization across traces, logs, and metrics
  • βœ… Rich data model (Resource + Scope + Attributes)
  • βœ… Support for both push and pull
  • βœ… Automatic instrumentation for many languages
  • βœ… Support for sampling and batching

OpenTelemetry Disadvantages:

  • ❌ Higher overhead compared to simpler formats
  • ❌ Complexity - more layers of abstraction
  • ❌ Newer standard - less operational experience
  • ❌ Requires OTel Collector for full functionality

This configuration demonstrates the power of OpenTelemetry - one Collector can receive metrics in both OTel and Prometheus formats, and then export them to different backends!

results matching ""

    No results matching ""