Cost Optimization

💰 The Problem

Observability at production scale generates massive amounts of data. Without conscious cost management, the bill for storage and compute can quickly spiral out of control.

Where do costs come from?

Signal	Main Cost Driver	Scale of the Problem
Traces	Number of spans × retention	Most expensive signal without sampling — one request is 5-20+ spans, each with attributes, events, and status
Logs	Data volume (GB/day)	Second largest — easy to generate TB/day with verbose logging
Metrics	Cardinality (number of unique time series)	Explodes with dynamic labels
Profiles	Sampling frequency × number of pods	Relatively low cost

Note: Without sampling, traces generate more data than logs — one HTTP request is one log line, but many spans (each hop between services, database query, external API call). That’s why trace sampling is crucial for cost control.

Traces — Most Expensive Signal Without Sampling

Problem: Without sampling, traces account for the largest data volume. Each request in a microservices architecture generates 5-20+ spans, and each span is “heavier” than a typical log line (attributes, events, timestamps, links).

Reduction strategies:

Tail-based sampling — keep 100% of errors and slow requests, 1-5% of normal traffic
Head-based sampling — simpler, but discards errors before seeing them
Adaptive sampling — dynamically adjust percentage based on load
Span filtering — discard spans from health checks and readiness probes

# OTel Collector: drop health check spans
processors:
  filter/health:
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

Logs — Second Largest by Volume

Problem: Logs are the second largest signal, especially with verbose logging (debug/trace level in production).

Reduction strategies:

Filtering at the Collector level — drop debug/trace logs in production before they reach the backend

// Alloy: drop debug logs before sending to Loki
loki.process "filter" {
  forward_to = [loki.write.default.receiver]

  stage.drop {
    expression = ".*level=debug.*"
  }
}

Structured logs — JSON instead of plain text, better compression ratio in Loki
Retention — short retention for debug logs (3-7 days), longer for error/audit (30-90 days)
Parsing at ingestion — extract only needed fields, drop the rest

Metrics — Cardinality Under Control

Problem: Each unique label combination creates a new time series. A label with user_id with 1M users = 1M series.

Reduction strategies:

Avoid high-cardinality labels — never use user_id, request_id, trace_id as metric labels
Aggregation in the Collector — sum/average before sending to the backend
Native histograms (Prometheus) — one series instead of many buckets
Recording rules — pre-aggregate frequent queries, reduce query load

# Prometheus recording rule — pre-aggregation
groups:
  - name: cost-optimization
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, status_code)

Data Retention

Signal	Recommended Retention	Justification
Logs (debug)	3-7 days	Only needed for current debugging
Logs (error/audit)	30-90 days	Compliance and trend analysis
Metrics	90-365 days (in Mimir)	Capacity planning, long-term trends
Traces	7-30 days	Debugging — older ones rarely needed
Profiles	7-14 days	Performance regression analysis between deployments

Architecture and Costs

Object storage (Azure Blob / S3) instead of local disks — 10-100x cheaper per GB
Tiered storage — hot data on SSD, cold on object storage (Mimir, Loki, Tempo support this natively)
Compression — Loki and Tempo compress data automatically (snappy/zstd), but structured data compresses better
Dedicated Collectors per signal — easier scaling and cost control per pipeline

Measuring Costs — What to Monitor

Before you can optimize, you need to measure. Each backend exposes internal metrics that tell you exactly where your data (and money) goes.

Signal	Key Metric	What It Tells You
Metrics	`prometheus_tsdb_head_series`	Total active time series — the main cardinality indicator
Metrics	`prometheus_remote_storage_bytes_total`	Volume sent to remote write (Mimir, Grafana Cloud)
Metrics	`scrape_series_added` per target	Which scrape targets contribute the most series
Logs	`loki_distributor_bytes_received_total`	Ingested bytes — break down by `tenant` or `namespace` label
Logs	`loki_ingester_streams_created_total`	Number of active streams — too many = expensive index
Traces	`tempo_distributor_spans_received_total`	Spans ingested per second — the main cost driver
Traces	`tempo_ingester_bytes_received_total`	Raw bytes ingested into Tempo
Collector	`otelcol_exporter_sent_spans_total` vs `otelcol_receiver_accepted_spans_total`	Pipeline efficiency — large gap means data loss
Collector	`otelcol_processor_dropped_log_records_total`	How much your filters are actually dropping

Cardinality analysis (metrics)

Monitor prometheus_tsdb_head_series over time — a sudden spike means a cardinality explosion (usually a new high-cardinality label). Find the top offenders:

# Top 10 metric names by number of series
topk(10, count by (__name__) ({__name__=~".+"}))

Tools: mimirtool (analyze prometheus), Grafana Cardinality Management dashboard.

Forecasting

Use predict_linear() to project costs 30 days ahead:

# Projected log volume in 30 days (GB)
predict_linear(
  sum(loki_distributor_bytes_received_total)[7d:1h],
  30 * 86400
) / 1e9

Cost Attribution — Who Consumes What

Attribution is not about reducing costs — it’s about visibility. Before you can have a conversation about budgets, you need to answer: which team, service, and environment is responsible for which portion of the observability bill?

What to attribute

Cost attribution has four dimensions — most teams only measure the first one and miss the rest:

Dimension	What it measures	Why it matters
Ingestion volume	How much data a team sends in (GB, series, spans)	The primary cost driver — this is what backends charge for
Query cost	How often and how expensively a team queries data	Heavy dashboards and alerts can cost as much as ingestion
Storage footprint	How much space a team’s data occupies after compaction	Retention policies multiply ingestion cost over time
Compute consumption	CPU/memory used by ingesters, queriers, compactors for a team’s data	Relevant in self-hosted setups where you pay for the infrastructure

In managed platforms (Grafana Cloud, Datadog, New Relic), ingestion volume is the dominant cost and the easiest to attribute. In self-hosted setups, compute and storage matter equally.

Labeling strategy — the foundation

Attribution only works if every telemetry signal carries consistent labels that map data to its owner. This must be enforced at the infrastructure level, not left to individual teams.

Required resource attributes (OTel)

Set these in the OTel SDK or Collector — they propagate to all three signals automatically:

Attribute	Example	Maps to
`service.name`	`payment-api`	Individual service
`service.namespace`	`checkout`	Team / domain
`deployment.environment`	`production`	Environment
`service.version`	`2.4.1`	Useful for detecting cost changes after deployments

How to enforce labels

Option 1: OTel Collector resource processor — add/override attributes centrally:

processors:
  resource/attribution:
    attributes:
      # Inject from environment variables (set by Kubernetes downward API)
      - key: service.namespace
        from_attribute: k8s.namespace.name
        action: upsert
      - key: deployment.environment
        value: production
        action: upsert

Option 2: Kubernetes labels → OTel resource attributes — use the k8sattributes processor:

processors:
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
      labels:
        - tag_name: team
          key: app.kubernetes.io/team
          from: pod
        - tag_name: cost_center
          key: billing/cost-center
          from: namespace

This way, Kubernetes labels like app.kubernetes.io/team: checkout on pods automatically become telemetry attributes — no code changes needed.

Option 3: Grafana Alloy labels — if using Alloy for log collection, labels come from Kubernetes discovery:

discovery.kubernetes "pods" {
  role = "pod"
}

discovery.relabel "add_team" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_team"]
    target_label  = "team"
  }
}

Label consistency across signals

A common problem: metrics use label namespace, logs use namespace, but traces use service.namespace. Your attribution queries break because the same data has different keys.

Signal	Typical label	Normalize to
Metrics (Prometheus)	`namespace` (from `kubernetes_sd`)	`namespace`
Logs (Loki)	`namespace` (from discovery)	`namespace`
Traces (Tempo)	`service.namespace` (OTel resource)	Map via Collector `resource` processor

Use the Collector to normalize before data reaches backends:

processors:
  resource/normalize:
    attributes:
      - key: namespace
        from_attribute: service.namespace
        action: upsert

Attribution per signal — what to measure and how

Metrics (Prometheus / Mimir)

What to measure: number of active time series per team/service.

# Active series per namespace
count by (namespace) ({__name__=~".+"})

# Active series per job (more granular)
count by (namespace, job) ({__name__=~".+"})

# Scrape volume per target (bytes/sec)
sum by (namespace, job) (scrape_samples_post_metric_relabeling)

Mimir-specific: Mimir tracks per-tenant usage out of the box. If you use multi-tenancy (one tenant per team), you get attribution for free:

# Per-tenant active series in Mimir
cortex_ingester_active_series{}

What often gets missed:

Recording rules — a team’s recording rules consume compute but don’t show up in ingestion metrics. Track with cortex_ruler_queries_total by tenant.
Alert evaluation — similar to recording rules. Track with cortex_ruler_ring_check_errors_total.

Logs (Loki)

What to measure: ingested bytes and stream count per team.

# Ingested bytes per namespace (GB/day)
sum by (namespace) (
  rate(loki_distributor_bytes_received_total[24h])
) * 86400 / 1e9

# Ingested lines per namespace (lines/day)
sum by (namespace) (
  rate(loki_distributor_lines_received_total[24h])
) * 86400

# Active streams per tenant — high stream count = expensive indexing
sum by (tenant) (loki_ingester_streams_created_total)

What often gets missed:

Query cost — some teams have dashboards that run expensive full-scan queries every 30 seconds. Track with:

# Bytes scanned per query (Loki query-frontend)
sum by (tenant) (rate(loki_query_frontend_bytes_processed_per_second[1h]))

Log volume vs. log value — a team may ingest 50 GB/day but only query 1% of it. Cross-reference ingestion with query frequency to find “write-only” log streams.

Traces (Tempo)

What to measure: spans per second and bytes ingested per service.

# Spans per service (per day)
sum by (service_name) (
  rate(tempo_distributor_spans_received_total[24h])
) * 86400

# Bytes per service (GB/day)
sum by (service_name) (
  rate(tempo_distributor_bytes_received_total[24h])
) * 86400 / 1e9

What often gets missed:

Span size variance — two services may send the same number of spans, but one attaches 50 attributes per span (including SQL queries and request bodies) and costs 10x more in storage. Track average span size:
```
# Average span size per service (bytes)
sum by (service_name) (rate(tempo_distributor_bytes_received_total[1h]))
/
sum by (service_name) (rate(tempo_distributor_spans_received_total[1h]))
```
Trace depth — one request from service A may generate 5 spans, while service B generates 200 (deep call chains, loops). This is visible in span count per trace but harder to query — consider recording it as a custom metric via the spanmetrics connector.

Attributing shared infrastructure costs

Not all costs map cleanly to a single team. Shared components need a fair split.

Shared component	Attribution strategy
Ingress / API gateway	Attribute to the upstream service that receives the request (use `service.name` from the first downstream span)
Message queues (Kafka, RabbitMQ)	Split by topic/queue — each topic is usually owned by one team
Databases	Attribute to the service that issues the query (from span `db.system` + `service.name`)
OTel Collector infra	Allocate proportionally to ingestion volume per tenant
Kubernetes system components	Treat as platform overhead — split evenly or by node usage
Grafana / query infrastructure	Attribute by dashboard ownership or query origin (harder — see below)

Query cost attribution

This is the hardest dimension. Most backends don’t expose “who ran this query.” Approaches:

Grafana Cloud — Usage Insights shows query cost per dashboard, per user
Self-hosted Grafana — enable query audit logging ([log] filters = rendering:debug) and parse the logs to extract dashboard UID and user
Mimir / Loki — if using multi-tenant mode, query cost is automatically per-tenant via cortex_query_frontend_queries_total{tenant="..."}
Convention — assign each dashboard to an owning team via folder structure (e.g., Grafana folder = team name), then attribute query cost by folder

Building the attribution report

Dashboard structure

┌─────────────────────────────────────────────────────────────────┐
│  COST ATTRIBUTION REPORT — April 2026                           │
├───────────┬──────────┬──────────┬──────────┬───────────┬────────┤
│ Team      │ Metrics  │ Logs     │ Traces   │ Storage   │ Total  │
│           │ (series) │ (GB/day) │ (M spans)│ (GB)      │ ($/mo) │
├───────────┼──────────┼──────────┼──────────┼───────────┼────────┤
│ checkout  │ 120k     │ 45 GB    │ 12M      │ 890 GB    │ $2,340 │
│ platform  │ 350k     │ 12 GB    │ 3M       │ 420 GB    │ $1,870 │
│ search    │ 80k      │ 95 GB    │ 28M      │ 1,200 GB  │ $3,150 │
│ mobile    │ 45k      │ 8 GB     │ 45M      │ 650 GB    │ $2,890 │
│ shared    │ 200k     │ 30 GB    │ —        │ 600 GB    │ $1,200 │
├───────────┼──────────┼──────────┼──────────┼───────────┼────────┤
│ TOTAL     │ 795k     │ 190 GB   │ 88M      │ 3,760 GB  │$11,450 │
└───────────┴──────────┴──────────┴──────────┴───────────┴────────┘

Grafana implementation

Use a table panel with transformations:

Multiple queries (one per signal, grouped by namespace)
Outer join on namespace
Add field from calculation — multiply each volume by its unit price
Sort by total cost descending

Add Grafana variables for time range and team filter so managers can drill down.

Automated reporting

Grafana Reporting (Enterprise/Cloud) — schedule PDF delivery to Slack or email monthly

Self-hosted — use Grafana HTTP API to render dashboard as PNG/PDF via a cron job:

curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "https://grafana.internal/render/d/cost-attribution?width=1200&height=800" \
  -o report.png

Alertmanager — send monthly summary via webhook to Slack with top 5 consumers

Common pitfalls

Pitfall	Why it happens	How to avoid
Missing labels on some signals	Team added OTel SDK to app but didn’t configure resource attributes	Enforce via Collector `resource` processor — inject from K8s metadata
Label cardinality in attribution	Using `pod` as attribution label — pods are ephemeral, creates thousands of “teams”	Attribute by `namespace` or `deployment`, never by `pod`
Sampling distorts attribution	Team A samples at 1%, team B at 100% — raw span counts don’t reflect true traffic	Normalize by sampling rate: `spans * (1 / sampling_rate)`
Shared services skew numbers	API gateway or message bus shows as top consumer, but it’s proxying for other teams	Use downstream `service.name` or split by route/topic
Ignoring query cost	Team ingests little data but runs 50 heavy dashboards refreshing every 10s	Track query metrics alongside ingestion
Point-in-time vs. cumulative	Report shows “series count now” but team deleted services mid-month	Use `avg_over_time` or integrate rate over the billing period

Cost Dashboard — What to Include

A cost dashboard should answer five questions at a glance:

1. How much are we spending? (Executive summary)

Stat panels showing total estimated cost per signal per month, with month-over-month trend. Use Grafana dashboard variables for unit prices ($cost_per_gb_logs, $cost_per_1k_series, $cost_per_m_spans) so the dashboard adapts to your pricing model.

# Estimated monthly log cost
sum(rate(loki_distributor_bytes_received_total[24h])) * 86400 * 30 / 1e9 * $cost_per_gb_logs

# Estimated monthly metrics cost
prometheus_tsdb_head_series / 1000 * $cost_per_1k_series

# Estimated monthly traces cost
sum(rate(tempo_distributor_spans_received_total[24h])) * 86400 * 30 / 1e6 * $cost_per_m_spans

2. Who generates the most? (Breakdown per team/service)

Stacked bar chart of ingestion volume grouped by namespace or service_name, split by signal type (metrics, logs, traces).

3. What is wasted? (Waste detection)

Metrics nobody queries (Grafana Cloud: Adaptive Metrics; self-hosted: check Grafana query logs)
DEBUG/TRACE logs in production — often 60-80% of total log volume
100% sampled traces on non-critical services
Health-check and readiness probe data across all signals

4. What will it cost next month? (Forecast)

Time series panel with predict_linear() projecting ingestion 30 days ahead, overlaid with a budget threshold line.

5. Storage backend costs

Table panel showing object storage breakdown: Loki chunks, Tempo blocks, Mimir blocks — size in TB and estimated cost per month.

OTel Collector as a Cost Gateway

The Collector pipeline is the single best place to control costs. All data flows through it before reaching backends.

Tail-based sampling (traces)

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always keep slow requests
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 2000}

      # Drop health checks entirely
      - name: drop-health
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/healthz", "/readyz", "/livez"]
          invert_match: true

      # Sample 5% of the rest
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Log filtering and attribute reduction

processors:
  filter/drop_debug:
    logs:
      exclude:
        severity_number:
          min: 1    # TRACE
          max: 8    # DEBUG

  transform/reduce_attributes:
    log_statements:
      - context: log
        statements:
          - delete_key(attributes, "http.request.body")
          - delete_key(attributes, "http.response.body")
          - truncate_all(attributes, 256)

Span attribute control

processors:
  attributes/reduce_spans:
    actions:
      - key: db.statement
        action: hash          # Hash SQL queries instead of storing full text
      - key: http.request.body
        action: delete
      - key: http.response.body
        action: delete

Routing by environment

Use different pipelines per environment — aggressive filtering for dev, careful sampling for production:

connectors:
  routing:
    from_attribute: deployment.environment
    table:
      - value: production
        pipelines: [traces/prod]
      - value: staging
        pipelines: [traces/staging]
    default_pipelines: [traces/dev]

# Dev: aggressive sampling (1%), short retention
# Staging: moderate sampling (10%)
# Production: tail-based sampling (100% errors, 5% normal)

Cost Alerts

Set up alerts to catch cost anomalies before they hit your invoice.

groups:
  - name: observability_cost_alerts
    rules:
      - alert: LogVolumeSpike
        expr: |
          sum(rate(loki_distributor_bytes_received_total[1h])) * 3600 / 1e9
          > 1.5 * sum(rate(loki_distributor_bytes_received_total[7d])) * 3600 / 1e9
        for: 30m
        annotations:
          summary: "Log ingestion 50%+ above 7-day average"

      - alert: CardinalityExplosion
        expr: |
          deriv(prometheus_tsdb_head_series[1h]) * 3600 > 50000
        for: 15m
        annotations:
          summary: "Gaining >50k new series per hour"

      - alert: TraceVolumeAnomaly
        expr: |
          sum(rate(tempo_distributor_spans_received_total[1h]))
          > 2 * sum(rate(tempo_distributor_spans_received_total[24h]))
        for: 15m
        annotations:
          summary: "Trace ingestion 2x above 24h average"

      - alert: ProjectedCostOverBudget
        expr: |
          predict_linear(
            sum(loki_distributor_bytes_received_total)[7d:1h],
            30 * 86400
          ) / 1e9 * 0.50 > 5000
        for: 1h
        annotations:
          summary: "Projected monthly log cost exceeds $5000"

Governance and Chargeback

Technical controls only work long-term with organizational process around them.

Monthly cycle

Cost dashboard generates automated report per team (Grafana Reporting or screenshot to Slack)
Compare actual vs budget per team
Top 3 waste items per team → ticket to the owning team with specific recommendations
Teams have 2 weeks to address or justify

Quarterly review

Audit: what are we monitoring? What is unused? What costs the most?
Update sampling policies, retention periods, and filters
Review new services — are they instrumented with cost-aware defaults?
Update unit cost variables in the dashboard

Cost-aware defaults for new services

Define org-wide defaults so new services don’t start with expensive configurations:

Setting	Default	Override requires
Log level in prod	INFO	Team lead approval
Trace sampling	5% (tail-based)	SRE approval
Metric scrape interval	30s	Justification in PR
Span attributes	Max 20, no bodies	Automatic enforcement in Collector
Retention	Per signal table above	Finance approval for extensions

Cost Optimization

💰 The Problem

Where do costs come from?

Traces — Most Expensive Signal Without Sampling

Logs — Second Largest by Volume

Metrics — Cardinality Under Control

Data Retention

Architecture and Costs

Measuring Costs — What to Monitor

Cardinality analysis (metrics)

Forecasting

Cost Attribution — Who Consumes What

What to attribute

Labeling strategy — the foundation

Required resource attributes (OTel)

How to enforce labels

Label consistency across signals

Attribution per signal — what to measure and how

Metrics (Prometheus / Mimir)

Logs (Loki)

Traces (Tempo)

Attributing shared infrastructure costs

Query cost attribution

Building the attribution report

Dashboard structure

Grafana implementation

Automated reporting

Common pitfalls

Cost Dashboard — What to Include

1. How much are we spending? (Executive summary)

2. Who generates the most? (Breakdown per team/service)

3. What is wasted? (Waste detection)

4. What will it cost next month? (Forecast)

5. Storage backend costs

OTel Collector as a Cost Gateway

Tail-based sampling (traces)

Log filtering and attribute reduction

Span attribute control

Routing by environment

Cost Alerts

Governance and Chargeback

Monthly cycle

Quarterly review

Cost-aware defaults for new services

results matching ""

No results matching ""