Cost Optimization
π° The Problem
Observability at production scale generates massive amounts of data. Without conscious cost management, the bill for storage and compute can quickly spiral out of control.
Where do costs come from?
| Signal | Main Cost Driver | Scale of the Problem |
|---|---|---|
| Traces | Number of spans Γ retention | Most expensive signal without sampling β one request is 5-20+ spans, each with attributes, events, and status |
| Logs | Data volume (GB/day) | Second largest β easy to generate TB/day with verbose logging |
| Metrics | Cardinality (number of unique time series) | Explodes with dynamic labels |
| Profiles | Sampling frequency Γ number of pods | Relatively low cost |
Note: Without sampling, traces generate more data than logs β one HTTP request is one log line, but many spans (each hop between services, database query, external API call). Thatβs why trace sampling is crucial for cost control.
Traces β Most Expensive Signal Without Sampling
Problem: Without sampling, traces account for the largest data volume. Each request in a microservices architecture generates 5-20+ spans, and each span is βheavierβ than a typical log line (attributes, events, timestamps, links).
Reduction strategies:
- Tail-based sampling β keep 100% of errors and slow requests, 1-5% of normal traffic
- Head-based sampling β simpler, but discards errors before seeing them
- Adaptive sampling β dynamically adjust percentage based on load
- Span filtering β discard spans from health checks and readiness probes
# OTel Collector: drop health check spans
processors:
filter/health:
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
Logs β Second Largest by Volume
Problem: Logs are the second largest signal, especially with verbose logging (debug/trace level in production).
Reduction strategies:
- Filtering at the Collector level β drop debug/trace logs in production before they reach the backend
// Alloy: drop debug logs before sending to Loki loki.process "filter" { forward_to = [loki.write.default.receiver] stage.drop { expression = ".*level=debug.*" } } - Structured logs β JSON instead of plain text, better compression ratio in Loki
- Retention β short retention for debug logs (3-7 days), longer for error/audit (30-90 days)
- Parsing at ingestion β extract only needed fields, drop the rest
Metrics β Cardinality Under Control
Problem: Each unique label combination creates a new time series. A label with user_id with 1M users = 1M series.
Reduction strategies:
- Avoid high-cardinality labels β never use
user_id,request_id,trace_idas metric labels - Aggregation in the Collector β sum/average before sending to the backend
- Native histograms (Prometheus) β one series instead of many buckets
- Recording rules β pre-aggregate frequent queries, reduce query load
# Prometheus recording rule β pre-aggregation
groups:
- name: cost-optimization
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, status_code)
Data Retention
| Signal | Recommended Retention | Justification |
|---|---|---|
| Logs (debug) | 3-7 days | Only needed for current debugging |
| Logs (error/audit) | 30-90 days | Compliance and trend analysis |
| Metrics | 90-365 days (in Mimir) | Capacity planning, long-term trends |
| Traces | 7-30 days | Debugging β older ones rarely needed |
| Profiles | 7-14 days | Performance regression analysis between deployments |
Architecture and Costs
- Object storage (Azure Blob / S3) instead of local disks β 10-100x cheaper per GB
- Tiered storage β hot data on SSD, cold on object storage (Mimir, Loki, Tempo support this natively)
- Compression β Loki and Tempo compress data automatically (snappy/zstd), but structured data compresses better
- Dedicated Collectors per signal β easier scaling and cost control per pipeline