Cost Optimization
π° The Problem
Observability at production scale generates massive amounts of data. Without conscious cost management, the bill for storage and compute can quickly spiral out of control.
Where do costs come from?
| Signal | Main Cost Driver | Scale of the Problem |
|---|---|---|
| Traces | Number of spans Γ retention | Most expensive signal without sampling β one request is 5-20+ spans, each with attributes, events, and status |
| Logs | Data volume (GB/day) | Second largest β easy to generate TB/day with verbose logging |
| Metrics | Cardinality (number of unique time series) | Explodes with dynamic labels |
| Profiles | Sampling frequency Γ number of pods | Relatively low cost |
Note: Without sampling, traces generate more data than logs β one HTTP request is one log line, but many spans (each hop between services, database query, external API call). Thatβs why trace sampling is crucial for cost control.
Traces β Most Expensive Signal Without Sampling
Problem: Without sampling, traces account for the largest data volume. Each request in a microservices architecture generates 5-20+ spans, and each span is βheavierβ than a typical log line (attributes, events, timestamps, links).
Reduction strategies:
- Tail-based sampling β keep 100% of errors and slow requests, 1-5% of normal traffic
- Head-based sampling β simpler, but discards errors before seeing them
- Adaptive sampling β dynamically adjust percentage based on load
- Span filtering β discard spans from health checks and readiness probes
# OTel Collector: drop health check spans
processors:
filter/health:
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
Logs β Second Largest by Volume
Problem: Logs are the second largest signal, especially with verbose logging (debug/trace level in production).
Reduction strategies:
- Filtering at the Collector level β drop debug/trace logs in production before they reach the backend
// Alloy: drop debug logs before sending to Loki loki.process "filter" { forward_to = [loki.write.default.receiver] stage.drop { expression = ".*level=debug.*" } } - Structured logs β JSON instead of plain text, better compression ratio in Loki
- Retention β short retention for debug logs (3-7 days), longer for error/audit (30-90 days)
- Parsing at ingestion β extract only needed fields, drop the rest
Metrics β Cardinality Under Control
Problem: Each unique label combination creates a new time series. A label with user_id with 1M users = 1M series.
Reduction strategies:
- Avoid high-cardinality labels β never use
user_id,request_id,trace_idas metric labels - Aggregation in the Collector β sum/average before sending to the backend
- Native histograms (Prometheus) β one series instead of many buckets
- Recording rules β pre-aggregate frequent queries, reduce query load
# Prometheus recording rule β pre-aggregation
groups:
- name: cost-optimization
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, status_code)
Data Retention
| Signal | Recommended Retention | Justification |
|---|---|---|
| Logs (debug) | 3-7 days | Only needed for current debugging |
| Logs (error/audit) | 30-90 days | Compliance and trend analysis |
| Metrics | 90-365 days (in Mimir) | Capacity planning, long-term trends |
| Traces | 7-30 days | Debugging β older ones rarely needed |
| Profiles | 7-14 days | Performance regression analysis between deployments |
Architecture and Costs
- Object storage (Azure Blob / S3) instead of local disks β 10-100x cheaper per GB
- Tiered storage β hot data on SSD, cold on object storage (Mimir, Loki, Tempo support this natively)
- Compression β Loki and Tempo compress data automatically (snappy/zstd), but structured data compresses better
- Dedicated Collectors per signal β easier scaling and cost control per pipeline
Measuring Costs β What to Monitor
Before you can optimize, you need to measure. Each backend exposes internal metrics that tell you exactly where your data (and money) goes.
| Signal | Key Metric | What It Tells You |
|---|---|---|
| Metrics | prometheus_tsdb_head_series |
Total active time series β the main cardinality indicator |
| Metrics | prometheus_remote_storage_bytes_total |
Volume sent to remote write (Mimir, Grafana Cloud) |
| Metrics | scrape_series_added per target |
Which scrape targets contribute the most series |
| Logs | loki_distributor_bytes_received_total |
Ingested bytes β break down by tenant or namespace label |
| Logs | loki_ingester_streams_created_total |
Number of active streams β too many = expensive index |
| Traces | tempo_distributor_spans_received_total |
Spans ingested per second β the main cost driver |
| Traces | tempo_ingester_bytes_received_total |
Raw bytes ingested into Tempo |
| Collector | otelcol_exporter_sent_spans_total vs otelcol_receiver_accepted_spans_total |
Pipeline efficiency β large gap means data loss |
| Collector | otelcol_processor_dropped_log_records_total |
How much your filters are actually dropping |
Cardinality analysis (metrics)
Monitor prometheus_tsdb_head_series over time β a sudden spike means a cardinality explosion (usually a new high-cardinality label). Find the top offenders:
# Top 10 metric names by number of series
topk(10, count by (__name__) ({__name__=~".+"}))
Tools: mimirtool (analyze prometheus), Grafana Cardinality Management dashboard.
Forecasting
Use predict_linear() to project costs 30 days ahead:
# Projected log volume in 30 days (GB)
predict_linear(
sum(loki_distributor_bytes_received_total)[7d:1h],
30 * 86400
) / 1e9
Cost Attribution β Who Consumes What
Attribution is not about reducing costs β itβs about visibility. Before you can have a conversation about budgets, you need to answer: which team, service, and environment is responsible for which portion of the observability bill?
What to attribute
Cost attribution has four dimensions β most teams only measure the first one and miss the rest:
| Dimension | What it measures | Why it matters |
|---|---|---|
| Ingestion volume | How much data a team sends in (GB, series, spans) | The primary cost driver β this is what backends charge for |
| Query cost | How often and how expensively a team queries data | Heavy dashboards and alerts can cost as much as ingestion |
| Storage footprint | How much space a teamβs data occupies after compaction | Retention policies multiply ingestion cost over time |
| Compute consumption | CPU/memory used by ingesters, queriers, compactors for a teamβs data | Relevant in self-hosted setups where you pay for the infrastructure |
In managed platforms (Grafana Cloud, Datadog, New Relic), ingestion volume is the dominant cost and the easiest to attribute. In self-hosted setups, compute and storage matter equally.
Labeling strategy β the foundation
Attribution only works if every telemetry signal carries consistent labels that map data to its owner. This must be enforced at the infrastructure level, not left to individual teams.
Required resource attributes (OTel)
Set these in the OTel SDK or Collector β they propagate to all three signals automatically:
| Attribute | Example | Maps to |
|---|---|---|
service.name |
payment-api |
Individual service |
service.namespace |
checkout |
Team / domain |
deployment.environment |
production |
Environment |
service.version |
2.4.1 |
Useful for detecting cost changes after deployments |
How to enforce labels
Option 1: OTel Collector resource processor β add/override attributes centrally:
processors:
resource/attribution:
attributes:
# Inject from environment variables (set by Kubernetes downward API)
- key: service.namespace
from_attribute: k8s.namespace.name
action: upsert
- key: deployment.environment
value: production
action: upsert
Option 2: Kubernetes labels β OTel resource attributes β use the k8sattributes processor:
processors:
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
labels:
- tag_name: team
key: app.kubernetes.io/team
from: pod
- tag_name: cost_center
key: billing/cost-center
from: namespace
This way, Kubernetes labels like app.kubernetes.io/team: checkout on pods automatically become telemetry attributes β no code changes needed.
Option 3: Grafana Alloy labels β if using Alloy for log collection, labels come from Kubernetes discovery:
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "add_team" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_team"]
target_label = "team"
}
}
Label consistency across signals
A common problem: metrics use label namespace, logs use namespace, but traces use service.namespace. Your attribution queries break because the same data has different keys.
| Signal | Typical label | Normalize to |
|---|---|---|
| Metrics (Prometheus) | namespace (from kubernetes_sd) |
namespace |
| Logs (Loki) | namespace (from discovery) |
namespace |
| Traces (Tempo) | service.namespace (OTel resource) |
Map via Collector resource processor |
Use the Collector to normalize before data reaches backends:
processors:
resource/normalize:
attributes:
- key: namespace
from_attribute: service.namespace
action: upsert
Attribution per signal β what to measure and how
Metrics (Prometheus / Mimir)
What to measure: number of active time series per team/service.
# Active series per namespace
count by (namespace) ({__name__=~".+"})
# Active series per job (more granular)
count by (namespace, job) ({__name__=~".+"})
# Scrape volume per target (bytes/sec)
sum by (namespace, job) (scrape_samples_post_metric_relabeling)
Mimir-specific: Mimir tracks per-tenant usage out of the box. If you use multi-tenancy (one tenant per team), you get attribution for free:
# Per-tenant active series in Mimir
cortex_ingester_active_series{}
What often gets missed:
- Recording rules β a teamβs recording rules consume compute but donβt show up in ingestion metrics. Track with
cortex_ruler_queries_totalby tenant. - Alert evaluation β similar to recording rules. Track with
cortex_ruler_ring_check_errors_total.
Logs (Loki)
What to measure: ingested bytes and stream count per team.
# Ingested bytes per namespace (GB/day)
sum by (namespace) (
rate(loki_distributor_bytes_received_total[24h])
) * 86400 / 1e9
# Ingested lines per namespace (lines/day)
sum by (namespace) (
rate(loki_distributor_lines_received_total[24h])
) * 86400
# Active streams per tenant β high stream count = expensive indexing
sum by (tenant) (loki_ingester_streams_created_total)
What often gets missed:
- Query cost β some teams have dashboards that run expensive full-scan queries every 30 seconds. Track with:
# Bytes scanned per query (Loki query-frontend) sum by (tenant) (rate(loki_query_frontend_bytes_processed_per_second[1h])) - Log volume vs. log value β a team may ingest 50 GB/day but only query 1% of it. Cross-reference ingestion with query frequency to find βwrite-onlyβ log streams.
Traces (Tempo)
What to measure: spans per second and bytes ingested per service.
# Spans per service (per day)
sum by (service_name) (
rate(tempo_distributor_spans_received_total[24h])
) * 86400
# Bytes per service (GB/day)
sum by (service_name) (
rate(tempo_distributor_bytes_received_total[24h])
) * 86400 / 1e9
What often gets missed:
- Span size variance β two services may send the same number of spans, but one attaches 50 attributes per span (including SQL queries and request bodies) and costs 10x more in storage. Track average span size:
# Average span size per service (bytes) sum by (service_name) (rate(tempo_distributor_bytes_received_total[1h])) / sum by (service_name) (rate(tempo_distributor_spans_received_total[1h])) - Trace depth β one request from service A may generate 5 spans, while service B generates 200 (deep call chains, loops). This is visible in span count per trace but harder to query β consider recording it as a custom metric via the
spanmetricsconnector.
Attributing shared infrastructure costs
Not all costs map cleanly to a single team. Shared components need a fair split.
| Shared component | Attribution strategy |
|---|---|
| Ingress / API gateway | Attribute to the upstream service that receives the request (use service.name from the first downstream span) |
| Message queues (Kafka, RabbitMQ) | Split by topic/queue β each topic is usually owned by one team |
| Databases | Attribute to the service that issues the query (from span db.system + service.name) |
| OTel Collector infra | Allocate proportionally to ingestion volume per tenant |
| Kubernetes system components | Treat as platform overhead β split evenly or by node usage |
| Grafana / query infrastructure | Attribute by dashboard ownership or query origin (harder β see below) |
Query cost attribution
This is the hardest dimension. Most backends donβt expose βwho ran this query.β Approaches:
- Grafana Cloud β Usage Insights shows query cost per dashboard, per user
- Self-hosted Grafana β enable query audit logging (
[log] filters = rendering:debug) and parse the logs to extract dashboard UID and user - Mimir / Loki β if using multi-tenant mode, query cost is automatically per-tenant via
cortex_query_frontend_queries_total{tenant="..."} - Convention β assign each dashboard to an owning team via folder structure (e.g., Grafana folder = team name), then attribute query cost by folder
Building the attribution report
Dashboard structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COST ATTRIBUTION REPORT β April 2026 β
βββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββ¬βββββββββ€
β Team β Metrics β Logs β Traces β Storage β Total β
β β (series) β (GB/day) β (M spans)β (GB) β ($/mo) β
βββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββΌβββββββββ€
β checkout β 120k β 45 GB β 12M β 890 GB β $2,340 β
β platform β 350k β 12 GB β 3M β 420 GB β $1,870 β
β search β 80k β 95 GB β 28M β 1,200 GB β $3,150 β
β mobile β 45k β 8 GB β 45M β 650 GB β $2,890 β
β shared β 200k β 30 GB β β β 600 GB β $1,200 β
βββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββΌβββββββββ€
β TOTAL β 795k β 190 GB β 88M β 3,760 GB β$11,450 β
βββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββ
Grafana implementation
Use a table panel with transformations:
- Multiple queries (one per signal, grouped by
namespace) - Outer join on
namespace - Add field from calculation β multiply each volume by its unit price
- Sort by total cost descending
Add Grafana variables for time range and team filter so managers can drill down.
Automated reporting
- Grafana Reporting (Enterprise/Cloud) β schedule PDF delivery to Slack or email monthly
- Self-hosted β use Grafana HTTP API to render dashboard as PNG/PDF via a cron job:
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \ "https://grafana.internal/render/d/cost-attribution?width=1200&height=800" \ -o report.png - Alertmanager β send monthly summary via webhook to Slack with top 5 consumers
Common pitfalls
| Pitfall | Why it happens | How to avoid |
|---|---|---|
| Missing labels on some signals | Team added OTel SDK to app but didnβt configure resource attributes | Enforce via Collector resource processor β inject from K8s metadata |
| Label cardinality in attribution | Using pod as attribution label β pods are ephemeral, creates thousands of βteamsβ |
Attribute by namespace or deployment, never by pod |
| Sampling distorts attribution | Team A samples at 1%, team B at 100% β raw span counts donβt reflect true traffic | Normalize by sampling rate: spans * (1 / sampling_rate) |
| Shared services skew numbers | API gateway or message bus shows as top consumer, but itβs proxying for other teams | Use downstream service.name or split by route/topic |
| Ignoring query cost | Team ingests little data but runs 50 heavy dashboards refreshing every 10s | Track query metrics alongside ingestion |
| Point-in-time vs. cumulative | Report shows βseries count nowβ but team deleted services mid-month | Use avg_over_time or integrate rate over the billing period |
Cost Dashboard β What to Include
A cost dashboard should answer five questions at a glance:
1. How much are we spending? (Executive summary)
Stat panels showing total estimated cost per signal per month, with month-over-month trend. Use Grafana dashboard variables for unit prices ($cost_per_gb_logs, $cost_per_1k_series, $cost_per_m_spans) so the dashboard adapts to your pricing model.
# Estimated monthly log cost
sum(rate(loki_distributor_bytes_received_total[24h])) * 86400 * 30 / 1e9 * $cost_per_gb_logs
# Estimated monthly metrics cost
prometheus_tsdb_head_series / 1000 * $cost_per_1k_series
# Estimated monthly traces cost
sum(rate(tempo_distributor_spans_received_total[24h])) * 86400 * 30 / 1e6 * $cost_per_m_spans
2. Who generates the most? (Breakdown per team/service)
Stacked bar chart of ingestion volume grouped by namespace or service_name, split by signal type (metrics, logs, traces).
3. What is wasted? (Waste detection)
- Metrics nobody queries (Grafana Cloud: Adaptive Metrics; self-hosted: check Grafana query logs)
- DEBUG/TRACE logs in production β often 60-80% of total log volume
- 100% sampled traces on non-critical services
- Health-check and readiness probe data across all signals
4. What will it cost next month? (Forecast)
Time series panel with predict_linear() projecting ingestion 30 days ahead, overlaid with a budget threshold line.
5. Storage backend costs
Table panel showing object storage breakdown: Loki chunks, Tempo blocks, Mimir blocks β size in TB and estimated cost per month.
OTel Collector as a Cost Gateway
The Collector pipeline is the single best place to control costs. All data flows through it before reaching backends.
Tail-based sampling (traces)
processors:
tail_sampling:
decision_wait: 10s
policies:
# Always keep errors
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Always keep slow requests
- name: slow-requests
type: latency
latency: {threshold_ms: 2000}
# Drop health checks entirely
- name: drop-health
type: string_attribute
string_attribute:
key: http.route
values: ["/healthz", "/readyz", "/livez"]
invert_match: true
# Sample 5% of the rest
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 5}
Log filtering and attribute reduction
processors:
filter/drop_debug:
logs:
exclude:
severity_number:
min: 1 # TRACE
max: 8 # DEBUG
transform/reduce_attributes:
log_statements:
- context: log
statements:
- delete_key(attributes, "http.request.body")
- delete_key(attributes, "http.response.body")
- truncate_all(attributes, 256)
Span attribute control
processors:
attributes/reduce_spans:
actions:
- key: db.statement
action: hash # Hash SQL queries instead of storing full text
- key: http.request.body
action: delete
- key: http.response.body
action: delete
Routing by environment
Use different pipelines per environment β aggressive filtering for dev, careful sampling for production:
connectors:
routing:
from_attribute: deployment.environment
table:
- value: production
pipelines: [traces/prod]
- value: staging
pipelines: [traces/staging]
default_pipelines: [traces/dev]
# Dev: aggressive sampling (1%), short retention
# Staging: moderate sampling (10%)
# Production: tail-based sampling (100% errors, 5% normal)
Cost Alerts
Set up alerts to catch cost anomalies before they hit your invoice.
groups:
- name: observability_cost_alerts
rules:
- alert: LogVolumeSpike
expr: |
sum(rate(loki_distributor_bytes_received_total[1h])) * 3600 / 1e9
> 1.5 * sum(rate(loki_distributor_bytes_received_total[7d])) * 3600 / 1e9
for: 30m
annotations:
summary: "Log ingestion 50%+ above 7-day average"
- alert: CardinalityExplosion
expr: |
deriv(prometheus_tsdb_head_series[1h]) * 3600 > 50000
for: 15m
annotations:
summary: "Gaining >50k new series per hour"
- alert: TraceVolumeAnomaly
expr: |
sum(rate(tempo_distributor_spans_received_total[1h]))
> 2 * sum(rate(tempo_distributor_spans_received_total[24h]))
for: 15m
annotations:
summary: "Trace ingestion 2x above 24h average"
- alert: ProjectedCostOverBudget
expr: |
predict_linear(
sum(loki_distributor_bytes_received_total)[7d:1h],
30 * 86400
) / 1e9 * 0.50 > 5000
for: 1h
annotations:
summary: "Projected monthly log cost exceeds $5000"
Governance and Chargeback
Technical controls only work long-term with organizational process around them.
Monthly cycle
- Cost dashboard generates automated report per team (Grafana Reporting or screenshot to Slack)
- Compare actual vs budget per team
- Top 3 waste items per team β ticket to the owning team with specific recommendations
- Teams have 2 weeks to address or justify
Quarterly review
- Audit: what are we monitoring? What is unused? What costs the most?
- Update sampling policies, retention periods, and filters
- Review new services β are they instrumented with cost-aware defaults?
- Update unit cost variables in the dashboard
Cost-aware defaults for new services
Define org-wide defaults so new services donβt start with expensive configurations:
| Setting | Default | Override requires |
|---|---|---|
| Log level in prod | INFO | Team lead approval |
| Trace sampling | 5% (tail-based) | SRE approval |
| Metric scrape interval | 30s | Justification in PR |
| Span attributes | Max 20, no bodies | Automatic enforcement in Collector |
| Retention | Per signal table above | Finance approval for extensions |