Prometheus

Prometheus

Prometheus serves as the short-term metrics hub in our stack. It does not scrape targets directly — Alloy handles all scraping and sends metrics to Prometheus via remote write.

Role in the Stack

Function Details
Short-term storage Holds recent metrics (hours) for fast queries
PromQL engine Serves as a query engine for Grafana dashboards and Drilldown
Remote write receiver Accepts metrics from Alloy and Tempo metrics generator
Remote write sender Forwards all metrics to Mimir for long-term retention
Exemplar storage Links metric data points to trace IDs for cross-signal navigation

Versions

Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter are all installed together via the kube-prometheus-stack Helm chart.

   
Chart prometheus-community/kube-prometheus-stack 85.0.1
Prometheus v3.11.3 (distroless image variant)
Alertmanager v0.32.1
prometheus-operator v0.90.1
Grafana 13.0.1 (see grafana.md)
kube-state-metrics v2.18.0
node-exporter v1.11.1 (distroless image variant)

Configuration Highlights

  • Remote write receiver: Enabled — accepts metrics pushed by Alloy
  • Native histograms: Enabled — supports the new histogram format for more efficient percentile queries
  • ServiceMonitor scraping: Disabled — Alloy handles all scraping
  • Remote write to Mimir: http://mimir-gateway.monitoring.svc.cluster.local:80/api/v1/push

What Feeds Into Prometheus

Source Signal Path
Alloy Application metrics (OTLP) Alloy OTLP receiver → remote write to Prometheus
Alloy Kubernetes metrics (kube-state-metrics, node-exporter, kubelet) Alloy scraper → remote write to Prometheus
Alloy Exporter metrics (PostgreSQL, Redis) Alloy scraper → remote write to Prometheus
Tempo Span metrics (RED: rate, errors, duration) Tempo metrics generator → remote write to Prometheus

What Prometheus Feeds

Destination Purpose
Mimir Long-term storage via remote write
Grafana Short-term queries, dashboards, exemplars
Alertmanager Alert rule evaluation and routing

Integration with Other Components

Exemplars — When an application emits a metric with a trace ID (via OTel SDK), Prometheus stores it as an exemplar. In Grafana, clicking an exemplar point on a metric graph jumps directly to the corresponding trace in Tempo.

Service Map — Grafana’s Tempo datasource uses Prometheus as the source for the Service Map (node graph) visualization, querying span metrics to render service-to-service dependencies.

Alertmanager — Prometheus evaluates alert rules and sends notifications to Alertmanager, which handles deduplication, grouping, silencing, and routing to notification channels.

Grafana Datasource

  • Type: prometheus
  • URL: http://prometheus-and-grafana-kub-prometheus.monitoring.svc.cluster.local:9090
  • Exemplars: Enabled — links to Tempo by TraceID
  • Access: NodePort 30090, Ingress at prometheus.<domain>

Alternatives to Prometheus

Prometheus is the de facto standard for metrics in the cloud-native world, but several alternatives exist — each with different trade-offs.

Feature Prometheus Mimir Thanos VictoriaMetrics InfluxDB Datadog
License Apache 2.0 AGPLv3 Apache 2.0 Apache 2.0 (single-node) MIT (OSS) / Commercial Commercial (SaaS)
Architecture Single binary Distributed microservices Sidecar + Store Gateway Single binary or cluster Single binary or cluster Fully managed SaaS
Horizontal scalability No (single node) Yes (native) Yes (via sidecars + object storage) Yes (cluster mode) Yes (Enterprise/cluster) Yes (managed)
Long-term storage Local disk only Object storage (S3, GCS, Azure Blob) Object storage (S3, GCS, Azure Blob) Local disk + optional object storage Local disk / cloud Managed
Query language PromQL PromQL PromQL MetricsQL (PromQL-compatible) InfluxQL / Flux Proprietary
Multi-tenancy No Yes (native) Partial (per-tenant label) Yes (native in enterprise) Yes Yes
High availability Manual (2 replicas + Thanos/Mimir) Built-in replication Built-in deduplication Built-in replication Built-in (Enterprise) Built-in
Resource usage Moderate Higher (multiple components) Moderate + sidecar overhead Low (very efficient) Moderate N/A (SaaS)
Setup complexity Low High Medium–High Low Low–Medium Low (managed)
Best for Single cluster, short-term metrics Large-scale, multi-tenant Grafana stack Adding HA/long-term to existing Prometheus Cost-efficient, high-volume metrics IoT, time-series beyond metrics Full SaaS observability

Mimir

Natural choice when you are already in the Grafana ecosystem and need long-term storage + multi-tenancy. This is exactly how our stack uses it: Prometheus for short-term, Mimir for long-term.

Strengths:

  • Native PromQL — no compatibility layer, no subtle query differences
  • Built for multi-tenancy from day one — tenant isolation, per-tenant limits, and billing-ready
  • Seamless Grafana integration — same team builds both, so new features (e.g., native histograms) land here first
  • Horizontally scalable with object storage (S3, GCS, Azure Blob) — no local disk bottleneck

Weaknesses:

  • Operational complexity — runs as multiple microservices (ingester, compactor, store-gateway, querier, distributor). More moving parts = more things to debug and tune
  • Resource hungry — needs significantly more CPU and memory than Prometheus or VictoriaMetrics for the same workload
  • AGPLv3 license — may be a legal blocker for some organizations, especially if embedding or offering as a service
  • Overkill for small setups — if you have one cluster and weeks of retention, Mimir adds complexity without proportional benefit
  • Grafana Labs dependency — Mimir is primarily driven by one company. Less community governance than CNCF projects

Thanos

A good option if you want to extend existing Prometheus installations with HA and long-term storage without replacing them.

Strengths:

  • Non-invasive — attaches to existing Prometheus instances via sidecars, no need to rearchitect
  • Apache 2.0 license — no AGPL concerns
  • CNCF project with broad community support
  • Deduplication handles the HA Prometheus replica problem cleanly
  • Proven at scale at companies like Improbable, Red Hat, and others

Weaknesses:

  • Sidecar model adds latency — queries span multiple Prometheus + Store Gateway components, which can be slower than a centralized system
  • Compaction can be painful — the compactor is a single point of contention and can struggle with very high cardinality datasets
  • Partial multi-tenancy — relies on external labels rather than native tenant isolation, which is less robust than Mimir’s approach
  • More operational overhead than Mimir — despite being “just sidecars”, running Thanos in production requires tuning Store Gateway, compactor, and query frontend separately
  • Losing momentum — many teams that started with Thanos are migrating to Mimir. Fewer new deployments, slower feature development

VictoriaMetrics

Excellent performance and lower resource consumption than Prometheus. Drop-in compatible via MetricsQL. Popular choice when cost-efficiency matters.

Strengths:

  • Best-in-class resource efficiency — uses 5–10x less RAM and disk than Prometheus for the same data volume
  • Simple single-binary deployment (like Prometheus) — low operational overhead
  • MetricsQL adds useful functions on top of PromQL (e.g., range_median, rollup_rate)
  • Built-in long-term storage with optional object storage — no need for a separate system like Mimir/Thanos
  • Supports multiple ingestion protocols (Prometheus remote write, InfluxDB line protocol, OpenTSDB, Graphite)

Weaknesses:

  • MetricsQL ≠ PromQL — compatible but not identical. Subtle differences in rate(), increase(), extrapolation behavior can produce different results. Dashboards and alerts written for Prometheus may silently return wrong numbers
  • Cluster mode is proprietary — single-node is Apache 2.0, but horizontal scaling, multi-tenancy, downsampling, and enterprise backups require a commercial license
  • Smaller ecosystem — Prometheus is the CNCF standard. Operators, Helm charts, kube-prometheus-stack, and most documentation assume native Prometheus. VM requires extra integration work
  • Single-vendor risk — one company controls the project. If they change licensing direction (as others have done — Redis, Elasticsearch, HashiCorp), there is no CNCF governance safety net
  • Alerting is less mature — vmalert works but is not Alertmanager. Fewer notification integrations, less battle-tested routing and silencing. Most production setups still run Alertmanager alongside
  • Hiring and knowledge — Prometheus is universally known among SREs. VictoriaMetrics expertise is rarer, meaning more onboarding time and fewer community answers to edge-case problems

InfluxDB

Better suited for IoT and general time-series workloads where PromQL is not a requirement.

Strengths:

  • Purpose-built for time-series data with a rich data model (tags, fields, measurements) — more flexible than Prometheus’s label model
  • Strong in IoT, industrial, and sensor data use cases
  • InfluxQL is SQL-like and easy to learn for teams coming from relational databases
  • Good standalone tool with built-in dashboarding (Chronograf) and alerting (Kapacitor) in the TICK stack
  • Large community and mature project (since 2013)

Weaknesses:

  • No PromQL — InfluxQL and Flux are completely different query languages. Migrating from Prometheus means rewriting every dashboard and alert from scratch
  • Flux is being deprecated — InfluxDB 3.0 dropped Flux in favor of SQL and InfluxQL, creating uncertainty for teams that invested in Flux queries
  • Weak Kubernetes integration — no native ServiceMonitor/PodMonitor support, no kube-state-metrics equivalent. Requires custom telegraf configurations for Kubernetes metrics
  • Cluster mode is commercial only — open-source InfluxDB is single-node. High availability and horizontal scaling require InfluxDB Enterprise or InfluxDB Cloud (paid)
  • Grafana integration is second-class — works, but no exemplar support, no native service map integration, no Explore Metrics/Logs correlation. The Grafana ecosystem assumes Prometheus
  • Cardinality limits — InfluxDB OSS struggles with high-cardinality data (many unique tag combinations), which is common in Kubernetes environments with dynamic pod names

Datadog

Fully managed SaaS, zero operational overhead, but vendor lock-in and significantly higher cost at scale.

Strengths:

  • Zero ops — no clusters to manage, no storage to provision, no upgrades to plan. Just install the agent
  • Unified platform — metrics, logs, traces, profiling, RUM, synthetics, security all in one place with built-in correlation
  • Excellent out-of-the-box dashboards and integrations (500+) — immediate value with minimal configuration
  • Strong alerting with anomaly detection, forecasting, and composite monitors
  • Good onboarding experience — polished UI, extensive documentation, responsive support

Weaknesses:

  • Cost — pricing is per host, per metric, per log GB, per span. At scale (hundreds of hosts, millions of custom metrics) costs grow dramatically and unpredictably. Bills of $50k–$500k+/year are common
  • Vendor lock-in — proprietary query language, proprietary data format, no data export. Migrating away means rebuilding everything: dashboards, alerts, SLOs, runbooks
  • No self-hosted option — data leaves your infrastructure. May be a blocker for regulated industries, data residency requirements, or air-gapped environments
  • Custom metrics pricing model — Prometheus-style instrumentation that freely creates labels can generate millions of custom metric series, each billed individually. Teams often have to reduce observability to control costs
  • No PromQL — teams must learn Datadog’s proprietary query syntax. Knowledge does not transfer to other tools, and hiring for “Datadog expertise” is narrower than hiring for “Prometheus/Grafana expertise”
  • Alert fatigue at scale — while monitors are powerful individually, managing hundreds of monitors across many services often requires significant investment in monitor-as-code tooling (Terraform provider, Datadog Operator)

results matching ""

    No results matching ""