Cost Optimization

πŸ’° The Problem

Observability at production scale generates massive amounts of data. Without conscious cost management, the bill for storage and compute can quickly spiral out of control.

Where do costs come from?

Signal Main Cost Driver Scale of the Problem
Traces Number of spans Γ— retention Most expensive signal without sampling β€” one request is 5-20+ spans, each with attributes, events, and status
Logs Data volume (GB/day) Second largest β€” easy to generate TB/day with verbose logging
Metrics Cardinality (number of unique time series) Explodes with dynamic labels
Profiles Sampling frequency Γ— number of pods Relatively low cost

Note: Without sampling, traces generate more data than logs β€” one HTTP request is one log line, but many spans (each hop between services, database query, external API call). That’s why trace sampling is crucial for cost control.

Traces β€” Most Expensive Signal Without Sampling

Problem: Without sampling, traces account for the largest data volume. Each request in a microservices architecture generates 5-20+ spans, and each span is β€œheavier” than a typical log line (attributes, events, timestamps, links).

Reduction strategies:

  • Tail-based sampling β€” keep 100% of errors and slow requests, 1-5% of normal traffic
  • Head-based sampling β€” simpler, but discards errors before seeing them
  • Adaptive sampling β€” dynamically adjust percentage based on load
  • Span filtering β€” discard spans from health checks and readiness probes
# OTel Collector: drop health check spans
processors:
  filter/health:
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

Logs β€” Second Largest by Volume

Problem: Logs are the second largest signal, especially with verbose logging (debug/trace level in production).

Reduction strategies:

  • Filtering at the Collector level β€” drop debug/trace logs in production before they reach the backend
    // Alloy: drop debug logs before sending to Loki
    loki.process "filter" {
      forward_to = [loki.write.default.receiver]
    
      stage.drop {
        expression = ".*level=debug.*"
      }
    }
    
  • Structured logs β€” JSON instead of plain text, better compression ratio in Loki
  • Retention β€” short retention for debug logs (3-7 days), longer for error/audit (30-90 days)
  • Parsing at ingestion β€” extract only needed fields, drop the rest

Metrics β€” Cardinality Under Control

Problem: Each unique label combination creates a new time series. A label with user_id with 1M users = 1M series.

Reduction strategies:

  • Avoid high-cardinality labels β€” never use user_id, request_id, trace_id as metric labels
  • Aggregation in the Collector β€” sum/average before sending to the backend
  • Native histograms (Prometheus) β€” one series instead of many buckets
  • Recording rules β€” pre-aggregate frequent queries, reduce query load
# Prometheus recording rule β€” pre-aggregation
groups:
  - name: cost-optimization
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, status_code)

Data Retention

Signal Recommended Retention Justification
Logs (debug) 3-7 days Only needed for current debugging
Logs (error/audit) 30-90 days Compliance and trend analysis
Metrics 90-365 days (in Mimir) Capacity planning, long-term trends
Traces 7-30 days Debugging β€” older ones rarely needed
Profiles 7-14 days Performance regression analysis between deployments

Architecture and Costs

  • Object storage (Azure Blob / S3) instead of local disks β€” 10-100x cheaper per GB
  • Tiered storage β€” hot data on SSD, cold on object storage (Mimir, Loki, Tempo support this natively)
  • Compression β€” Loki and Tempo compress data automatically (snappy/zstd), but structured data compresses better
  • Dedicated Collectors per signal β€” easier scaling and cost control per pipeline

Measuring Costs β€” What to Monitor

Before you can optimize, you need to measure. Each backend exposes internal metrics that tell you exactly where your data (and money) goes.

Signal Key Metric What It Tells You
Metrics prometheus_tsdb_head_series Total active time series β€” the main cardinality indicator
Metrics prometheus_remote_storage_bytes_total Volume sent to remote write (Mimir, Grafana Cloud)
Metrics scrape_series_added per target Which scrape targets contribute the most series
Logs loki_distributor_bytes_received_total Ingested bytes β€” break down by tenant or namespace label
Logs loki_ingester_streams_created_total Number of active streams β€” too many = expensive index
Traces tempo_distributor_spans_received_total Spans ingested per second β€” the main cost driver
Traces tempo_ingester_bytes_received_total Raw bytes ingested into Tempo
Collector otelcol_exporter_sent_spans_total vs otelcol_receiver_accepted_spans_total Pipeline efficiency β€” large gap means data loss
Collector otelcol_processor_dropped_log_records_total How much your filters are actually dropping

Cardinality analysis (metrics)

Monitor prometheus_tsdb_head_series over time β€” a sudden spike means a cardinality explosion (usually a new high-cardinality label). Find the top offenders:

# Top 10 metric names by number of series
topk(10, count by (__name__) ({__name__=~".+"}))

Tools: mimirtool (analyze prometheus), Grafana Cardinality Management dashboard.

Forecasting

Use predict_linear() to project costs 30 days ahead:

# Projected log volume in 30 days (GB)
predict_linear(
  sum(loki_distributor_bytes_received_total)[7d:1h],
  30 * 86400
) / 1e9

Cost Attribution β€” Who Consumes What

Attribution is not about reducing costs β€” it’s about visibility. Before you can have a conversation about budgets, you need to answer: which team, service, and environment is responsible for which portion of the observability bill?

What to attribute

Cost attribution has four dimensions β€” most teams only measure the first one and miss the rest:

Dimension What it measures Why it matters
Ingestion volume How much data a team sends in (GB, series, spans) The primary cost driver β€” this is what backends charge for
Query cost How often and how expensively a team queries data Heavy dashboards and alerts can cost as much as ingestion
Storage footprint How much space a team’s data occupies after compaction Retention policies multiply ingestion cost over time
Compute consumption CPU/memory used by ingesters, queriers, compactors for a team’s data Relevant in self-hosted setups where you pay for the infrastructure

In managed platforms (Grafana Cloud, Datadog, New Relic), ingestion volume is the dominant cost and the easiest to attribute. In self-hosted setups, compute and storage matter equally.

Labeling strategy β€” the foundation

Attribution only works if every telemetry signal carries consistent labels that map data to its owner. This must be enforced at the infrastructure level, not left to individual teams.

Required resource attributes (OTel)

Set these in the OTel SDK or Collector β€” they propagate to all three signals automatically:

Attribute Example Maps to
service.name payment-api Individual service
service.namespace checkout Team / domain
deployment.environment production Environment
service.version 2.4.1 Useful for detecting cost changes after deployments

How to enforce labels

Option 1: OTel Collector resource processor β€” add/override attributes centrally:

processors:
  resource/attribution:
    attributes:
      # Inject from environment variables (set by Kubernetes downward API)
      - key: service.namespace
        from_attribute: k8s.namespace.name
        action: upsert
      - key: deployment.environment
        value: production
        action: upsert

Option 2: Kubernetes labels β†’ OTel resource attributes β€” use the k8sattributes processor:

processors:
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
      labels:
        - tag_name: team
          key: app.kubernetes.io/team
          from: pod
        - tag_name: cost_center
          key: billing/cost-center
          from: namespace

This way, Kubernetes labels like app.kubernetes.io/team: checkout on pods automatically become telemetry attributes β€” no code changes needed.

Option 3: Grafana Alloy labels β€” if using Alloy for log collection, labels come from Kubernetes discovery:

discovery.kubernetes "pods" {
  role = "pod"
}

discovery.relabel "add_team" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_team"]
    target_label  = "team"
  }
}

Label consistency across signals

A common problem: metrics use label namespace, logs use namespace, but traces use service.namespace. Your attribution queries break because the same data has different keys.

Signal Typical label Normalize to
Metrics (Prometheus) namespace (from kubernetes_sd) namespace
Logs (Loki) namespace (from discovery) namespace
Traces (Tempo) service.namespace (OTel resource) Map via Collector resource processor

Use the Collector to normalize before data reaches backends:

processors:
  resource/normalize:
    attributes:
      - key: namespace
        from_attribute: service.namespace
        action: upsert

Attribution per signal β€” what to measure and how

Metrics (Prometheus / Mimir)

What to measure: number of active time series per team/service.

# Active series per namespace
count by (namespace) ({__name__=~".+"})

# Active series per job (more granular)
count by (namespace, job) ({__name__=~".+"})

# Scrape volume per target (bytes/sec)
sum by (namespace, job) (scrape_samples_post_metric_relabeling)

Mimir-specific: Mimir tracks per-tenant usage out of the box. If you use multi-tenancy (one tenant per team), you get attribution for free:

# Per-tenant active series in Mimir
cortex_ingester_active_series{} 

What often gets missed:

  • Recording rules β€” a team’s recording rules consume compute but don’t show up in ingestion metrics. Track with cortex_ruler_queries_total by tenant.
  • Alert evaluation β€” similar to recording rules. Track with cortex_ruler_ring_check_errors_total.

Logs (Loki)

What to measure: ingested bytes and stream count per team.

# Ingested bytes per namespace (GB/day)
sum by (namespace) (
  rate(loki_distributor_bytes_received_total[24h])
) * 86400 / 1e9

# Ingested lines per namespace (lines/day)
sum by (namespace) (
  rate(loki_distributor_lines_received_total[24h])
) * 86400

# Active streams per tenant β€” high stream count = expensive indexing
sum by (tenant) (loki_ingester_streams_created_total)

What often gets missed:

  • Query cost β€” some teams have dashboards that run expensive full-scan queries every 30 seconds. Track with:
    # Bytes scanned per query (Loki query-frontend)
    sum by (tenant) (rate(loki_query_frontend_bytes_processed_per_second[1h]))
    
  • Log volume vs. log value β€” a team may ingest 50 GB/day but only query 1% of it. Cross-reference ingestion with query frequency to find β€œwrite-only” log streams.

Traces (Tempo)

What to measure: spans per second and bytes ingested per service.

# Spans per service (per day)
sum by (service_name) (
  rate(tempo_distributor_spans_received_total[24h])
) * 86400

# Bytes per service (GB/day)
sum by (service_name) (
  rate(tempo_distributor_bytes_received_total[24h])
) * 86400 / 1e9

What often gets missed:

  • Span size variance β€” two services may send the same number of spans, but one attaches 50 attributes per span (including SQL queries and request bodies) and costs 10x more in storage. Track average span size:
    # Average span size per service (bytes)
    sum by (service_name) (rate(tempo_distributor_bytes_received_total[1h]))
    /
    sum by (service_name) (rate(tempo_distributor_spans_received_total[1h]))
    
  • Trace depth β€” one request from service A may generate 5 spans, while service B generates 200 (deep call chains, loops). This is visible in span count per trace but harder to query β€” consider recording it as a custom metric via the spanmetrics connector.

Attributing shared infrastructure costs

Not all costs map cleanly to a single team. Shared components need a fair split.

Shared component Attribution strategy
Ingress / API gateway Attribute to the upstream service that receives the request (use service.name from the first downstream span)
Message queues (Kafka, RabbitMQ) Split by topic/queue β€” each topic is usually owned by one team
Databases Attribute to the service that issues the query (from span db.system + service.name)
OTel Collector infra Allocate proportionally to ingestion volume per tenant
Kubernetes system components Treat as platform overhead β€” split evenly or by node usage
Grafana / query infrastructure Attribute by dashboard ownership or query origin (harder β€” see below)

Query cost attribution

This is the hardest dimension. Most backends don’t expose β€œwho ran this query.” Approaches:

  1. Grafana Cloud β€” Usage Insights shows query cost per dashboard, per user
  2. Self-hosted Grafana β€” enable query audit logging ([log] filters = rendering:debug) and parse the logs to extract dashboard UID and user
  3. Mimir / Loki β€” if using multi-tenant mode, query cost is automatically per-tenant via cortex_query_frontend_queries_total{tenant="..."}
  4. Convention β€” assign each dashboard to an owning team via folder structure (e.g., Grafana folder = team name), then attribute query cost by folder

Building the attribution report

Dashboard structure

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  COST ATTRIBUTION REPORT β€” April 2026                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Team      β”‚ Metrics  β”‚ Logs     β”‚ Traces   β”‚ Storage   β”‚ Total  β”‚
β”‚           β”‚ (series) β”‚ (GB/day) β”‚ (M spans)β”‚ (GB)      β”‚ ($/mo) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ checkout  β”‚ 120k     β”‚ 45 GB    β”‚ 12M      β”‚ 890 GB    β”‚ $2,340 β”‚
β”‚ platform  β”‚ 350k     β”‚ 12 GB    β”‚ 3M       β”‚ 420 GB    β”‚ $1,870 β”‚
β”‚ search    β”‚ 80k      β”‚ 95 GB    β”‚ 28M      β”‚ 1,200 GB  β”‚ $3,150 β”‚
β”‚ mobile    β”‚ 45k      β”‚ 8 GB     β”‚ 45M      β”‚ 650 GB    β”‚ $2,890 β”‚
β”‚ shared    β”‚ 200k     β”‚ 30 GB    β”‚ β€”        β”‚ 600 GB    β”‚ $1,200 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ TOTAL     β”‚ 795k     β”‚ 190 GB   β”‚ 88M      β”‚ 3,760 GB  β”‚$11,450 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Grafana implementation

Use a table panel with transformations:

  1. Multiple queries (one per signal, grouped by namespace)
  2. Outer join on namespace
  3. Add field from calculation β€” multiply each volume by its unit price
  4. Sort by total cost descending

Add Grafana variables for time range and team filter so managers can drill down.

Automated reporting

  • Grafana Reporting (Enterprise/Cloud) β€” schedule PDF delivery to Slack or email monthly
  • Self-hosted β€” use Grafana HTTP API to render dashboard as PNG/PDF via a cron job:
    curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
      "https://grafana.internal/render/d/cost-attribution?width=1200&height=800" \
      -o report.png
    
  • Alertmanager β€” send monthly summary via webhook to Slack with top 5 consumers

Common pitfalls

Pitfall Why it happens How to avoid
Missing labels on some signals Team added OTel SDK to app but didn’t configure resource attributes Enforce via Collector resource processor β€” inject from K8s metadata
Label cardinality in attribution Using pod as attribution label β€” pods are ephemeral, creates thousands of β€œteams” Attribute by namespace or deployment, never by pod
Sampling distorts attribution Team A samples at 1%, team B at 100% β€” raw span counts don’t reflect true traffic Normalize by sampling rate: spans * (1 / sampling_rate)
Shared services skew numbers API gateway or message bus shows as top consumer, but it’s proxying for other teams Use downstream service.name or split by route/topic
Ignoring query cost Team ingests little data but runs 50 heavy dashboards refreshing every 10s Track query metrics alongside ingestion
Point-in-time vs. cumulative Report shows β€œseries count now” but team deleted services mid-month Use avg_over_time or integrate rate over the billing period

Cost Dashboard β€” What to Include

A cost dashboard should answer five questions at a glance:

1. How much are we spending? (Executive summary)

Stat panels showing total estimated cost per signal per month, with month-over-month trend. Use Grafana dashboard variables for unit prices ($cost_per_gb_logs, $cost_per_1k_series, $cost_per_m_spans) so the dashboard adapts to your pricing model.

# Estimated monthly log cost
sum(rate(loki_distributor_bytes_received_total[24h])) * 86400 * 30 / 1e9 * $cost_per_gb_logs

# Estimated monthly metrics cost
prometheus_tsdb_head_series / 1000 * $cost_per_1k_series

# Estimated monthly traces cost
sum(rate(tempo_distributor_spans_received_total[24h])) * 86400 * 30 / 1e6 * $cost_per_m_spans

2. Who generates the most? (Breakdown per team/service)

Stacked bar chart of ingestion volume grouped by namespace or service_name, split by signal type (metrics, logs, traces).

3. What is wasted? (Waste detection)

  • Metrics nobody queries (Grafana Cloud: Adaptive Metrics; self-hosted: check Grafana query logs)
  • DEBUG/TRACE logs in production β€” often 60-80% of total log volume
  • 100% sampled traces on non-critical services
  • Health-check and readiness probe data across all signals

4. What will it cost next month? (Forecast)

Time series panel with predict_linear() projecting ingestion 30 days ahead, overlaid with a budget threshold line.

5. Storage backend costs

Table panel showing object storage breakdown: Loki chunks, Tempo blocks, Mimir blocks β€” size in TB and estimated cost per month.

OTel Collector as a Cost Gateway

The Collector pipeline is the single best place to control costs. All data flows through it before reaching backends.

Tail-based sampling (traces)

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always keep slow requests
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 2000}

      # Drop health checks entirely
      - name: drop-health
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/healthz", "/readyz", "/livez"]
          invert_match: true

      # Sample 5% of the rest
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Log filtering and attribute reduction

processors:
  filter/drop_debug:
    logs:
      exclude:
        severity_number:
          min: 1    # TRACE
          max: 8    # DEBUG

  transform/reduce_attributes:
    log_statements:
      - context: log
        statements:
          - delete_key(attributes, "http.request.body")
          - delete_key(attributes, "http.response.body")
          - truncate_all(attributes, 256)

Span attribute control

processors:
  attributes/reduce_spans:
    actions:
      - key: db.statement
        action: hash          # Hash SQL queries instead of storing full text
      - key: http.request.body
        action: delete
      - key: http.response.body
        action: delete

Routing by environment

Use different pipelines per environment β€” aggressive filtering for dev, careful sampling for production:

connectors:
  routing:
    from_attribute: deployment.environment
    table:
      - value: production
        pipelines: [traces/prod]
      - value: staging
        pipelines: [traces/staging]
    default_pipelines: [traces/dev]

# Dev: aggressive sampling (1%), short retention
# Staging: moderate sampling (10%)
# Production: tail-based sampling (100% errors, 5% normal)

Cost Alerts

Set up alerts to catch cost anomalies before they hit your invoice.

groups:
  - name: observability_cost_alerts
    rules:
      - alert: LogVolumeSpike
        expr: |
          sum(rate(loki_distributor_bytes_received_total[1h])) * 3600 / 1e9
          > 1.5 * sum(rate(loki_distributor_bytes_received_total[7d])) * 3600 / 1e9
        for: 30m
        annotations:
          summary: "Log ingestion 50%+ above 7-day average"

      - alert: CardinalityExplosion
        expr: |
          deriv(prometheus_tsdb_head_series[1h]) * 3600 > 50000
        for: 15m
        annotations:
          summary: "Gaining >50k new series per hour"

      - alert: TraceVolumeAnomaly
        expr: |
          sum(rate(tempo_distributor_spans_received_total[1h]))
          > 2 * sum(rate(tempo_distributor_spans_received_total[24h]))
        for: 15m
        annotations:
          summary: "Trace ingestion 2x above 24h average"

      - alert: ProjectedCostOverBudget
        expr: |
          predict_linear(
            sum(loki_distributor_bytes_received_total)[7d:1h],
            30 * 86400
          ) / 1e9 * 0.50 > 5000
        for: 1h
        annotations:
          summary: "Projected monthly log cost exceeds $5000"

Governance and Chargeback

Technical controls only work long-term with organizational process around them.

Monthly cycle

  1. Cost dashboard generates automated report per team (Grafana Reporting or screenshot to Slack)
  2. Compare actual vs budget per team
  3. Top 3 waste items per team β†’ ticket to the owning team with specific recommendations
  4. Teams have 2 weeks to address or justify

Quarterly review

  • Audit: what are we monitoring? What is unused? What costs the most?
  • Update sampling policies, retention periods, and filters
  • Review new services β€” are they instrumented with cost-aware defaults?
  • Update unit cost variables in the dashboard

Cost-aware defaults for new services

Define org-wide defaults so new services don’t start with expensive configurations:

Setting Default Override requires
Log level in prod INFO Team lead approval
Trace sampling 5% (tail-based) SRE approval
Metric scrape interval 30s Justification in PR
Span attributes Max 20, no bodies Automatic enforcement in Collector
Retention Per signal table above Finance approval for extensions

results matching ""

    No results matching ""