Cost Optimization

πŸ’° The Problem

Observability at production scale generates massive amounts of data. Without conscious cost management, the bill for storage and compute can quickly spiral out of control.

Where do costs come from?

Signal Main Cost Driver Scale of the Problem
Traces Number of spans Γ— retention Most expensive signal without sampling β€” one request is 5-20+ spans, each with attributes, events, and status
Logs Data volume (GB/day) Second largest β€” easy to generate TB/day with verbose logging
Metrics Cardinality (number of unique time series) Explodes with dynamic labels
Profiles Sampling frequency Γ— number of pods Relatively low cost

Note: Without sampling, traces generate more data than logs β€” one HTTP request is one log line, but many spans (each hop between services, database query, external API call). That’s why trace sampling is crucial for cost control.

Traces β€” Most Expensive Signal Without Sampling

Problem: Without sampling, traces account for the largest data volume. Each request in a microservices architecture generates 5-20+ spans, and each span is β€œheavier” than a typical log line (attributes, events, timestamps, links).

Reduction strategies:

  • Tail-based sampling β€” keep 100% of errors and slow requests, 1-5% of normal traffic
  • Head-based sampling β€” simpler, but discards errors before seeing them
  • Adaptive sampling β€” dynamically adjust percentage based on load
  • Span filtering β€” discard spans from health checks and readiness probes
# OTel Collector: drop health check spans
processors:
  filter/health:
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

Logs β€” Second Largest by Volume

Problem: Logs are the second largest signal, especially with verbose logging (debug/trace level in production).

Reduction strategies:

  • Filtering at the Collector level β€” drop debug/trace logs in production before they reach the backend
    // Alloy: drop debug logs before sending to Loki
    loki.process "filter" {
      forward_to = [loki.write.default.receiver]
    
      stage.drop {
        expression = ".*level=debug.*"
      }
    }
    
  • Structured logs β€” JSON instead of plain text, better compression ratio in Loki
  • Retention β€” short retention for debug logs (3-7 days), longer for error/audit (30-90 days)
  • Parsing at ingestion β€” extract only needed fields, drop the rest

Metrics β€” Cardinality Under Control

Problem: Each unique label combination creates a new time series. A label with user_id with 1M users = 1M series.

Reduction strategies:

  • Avoid high-cardinality labels β€” never use user_id, request_id, trace_id as metric labels
  • Aggregation in the Collector β€” sum/average before sending to the backend
  • Native histograms (Prometheus) β€” one series instead of many buckets
  • Recording rules β€” pre-aggregate frequent queries, reduce query load
# Prometheus recording rule β€” pre-aggregation
groups:
  - name: cost-optimization
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, status_code)

Data Retention

Signal Recommended Retention Justification
Logs (debug) 3-7 days Only needed for current debugging
Logs (error/audit) 30-90 days Compliance and trend analysis
Metrics 90-365 days (in Mimir) Capacity planning, long-term trends
Traces 7-30 days Debugging β€” older ones rarely needed
Profiles 7-14 days Performance regression analysis between deployments

Architecture and Costs

  • Object storage (Azure Blob / S3) instead of local disks β€” 10-100x cheaper per GB
  • Tiered storage β€” hot data on SSD, cold on object storage (Mimir, Loki, Tempo support this natively)
  • Compression β€” Loki and Tempo compress data automatically (snappy/zstd), but structured data compresses better
  • Dedicated Collectors per signal β€” easier scaling and cost control per pipeline

results matching ""

    No results matching ""