Prometheus
Prometheus serves as the short-term metrics hub in our stack. It does not scrape targets directly — Alloy handles all scraping and sends metrics to Prometheus via remote write.
Role in the Stack
| Function | Details |
|---|---|
| Short-term storage | Holds recent metrics (hours) for fast queries |
| PromQL engine | Serves as a query engine for Grafana dashboards and Drilldown |
| Remote write receiver | Accepts metrics from Alloy and Tempo metrics generator |
| Remote write sender | Forwards all metrics to Mimir for long-term retention |
| Exemplar storage | Links metric data points to trace IDs for cross-signal navigation |
Versions
Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter are all installed together via the kube-prometheus-stack Helm chart.
| Chart | prometheus-community/kube-prometheus-stack 85.0.1 |
| Prometheus | v3.11.3 (distroless image variant) |
| Alertmanager | v0.32.1 |
| prometheus-operator | v0.90.1 |
| Grafana | 13.0.1 (see grafana.md) |
| kube-state-metrics | v2.18.0 |
| node-exporter | v1.11.1 (distroless image variant) |
Configuration Highlights
- Remote write receiver: Enabled — accepts metrics pushed by Alloy
- Native histograms: Enabled — supports the new histogram format for more efficient percentile queries
- ServiceMonitor scraping: Disabled — Alloy handles all scraping
- Remote write to Mimir:
http://mimir-gateway.monitoring.svc.cluster.local:80/api/v1/push
What Feeds Into Prometheus
| Source | Signal | Path |
|---|---|---|
| Alloy | Application metrics (OTLP) | Alloy OTLP receiver → remote write to Prometheus |
| Alloy | Kubernetes metrics (kube-state-metrics, node-exporter, kubelet) | Alloy scraper → remote write to Prometheus |
| Alloy | Exporter metrics (PostgreSQL, Redis) | Alloy scraper → remote write to Prometheus |
| Tempo | Span metrics (RED: rate, errors, duration) | Tempo metrics generator → remote write to Prometheus |
What Prometheus Feeds
| Destination | Purpose |
|---|---|
| Mimir | Long-term storage via remote write |
| Grafana | Short-term queries, dashboards, exemplars |
| Alertmanager | Alert rule evaluation and routing |
Integration with Other Components
Exemplars — When an application emits a metric with a trace ID (via OTel SDK), Prometheus stores it as an exemplar. In Grafana, clicking an exemplar point on a metric graph jumps directly to the corresponding trace in Tempo.
Service Map — Grafana’s Tempo datasource uses Prometheus as the source for the Service Map (node graph) visualization, querying span metrics to render service-to-service dependencies.
Alertmanager — Prometheus evaluates alert rules and sends notifications to Alertmanager, which handles deduplication, grouping, silencing, and routing to notification channels.
Grafana Datasource
- Type:
prometheus - URL:
http://prometheus-and-grafana-kub-prometheus.monitoring.svc.cluster.local:9090 - Exemplars: Enabled — links to Tempo by TraceID
- Access: NodePort 30090, Ingress at
prometheus.<domain>
Alternatives to Prometheus
Prometheus is the de facto standard for metrics in the cloud-native world, but several alternatives exist — each with different trade-offs.
| Feature | Prometheus | Mimir | Thanos | VictoriaMetrics | InfluxDB | Datadog |
|---|---|---|---|---|---|---|
| License | Apache 2.0 | AGPLv3 | Apache 2.0 | Apache 2.0 (single-node) | MIT (OSS) / Commercial | Commercial (SaaS) |
| Architecture | Single binary | Distributed microservices | Sidecar + Store Gateway | Single binary or cluster | Single binary or cluster | Fully managed SaaS |
| Horizontal scalability | No (single node) | Yes (native) | Yes (via sidecars + object storage) | Yes (cluster mode) | Yes (Enterprise/cluster) | Yes (managed) |
| Long-term storage | Local disk only | Object storage (S3, GCS, Azure Blob) | Object storage (S3, GCS, Azure Blob) | Local disk + optional object storage | Local disk / cloud | Managed |
| Query language | PromQL | PromQL | PromQL | MetricsQL (PromQL-compatible) | InfluxQL / Flux | Proprietary |
| Multi-tenancy | No | Yes (native) | Partial (per-tenant label) | Yes (native in enterprise) | Yes | Yes |
| High availability | Manual (2 replicas + Thanos/Mimir) | Built-in replication | Built-in deduplication | Built-in replication | Built-in (Enterprise) | Built-in |
| Resource usage | Moderate | Higher (multiple components) | Moderate + sidecar overhead | Low (very efficient) | Moderate | N/A (SaaS) |
| Setup complexity | Low | High | Medium–High | Low | Low–Medium | Low (managed) |
| Best for | Single cluster, short-term metrics | Large-scale, multi-tenant Grafana stack | Adding HA/long-term to existing Prometheus | Cost-efficient, high-volume metrics | IoT, time-series beyond metrics | Full SaaS observability |
Mimir
Natural choice when you are already in the Grafana ecosystem and need long-term storage + multi-tenancy. This is exactly how our stack uses it: Prometheus for short-term, Mimir for long-term.
Strengths:
- Native PromQL — no compatibility layer, no subtle query differences
- Built for multi-tenancy from day one — tenant isolation, per-tenant limits, and billing-ready
- Seamless Grafana integration — same team builds both, so new features (e.g., native histograms) land here first
- Horizontally scalable with object storage (S3, GCS, Azure Blob) — no local disk bottleneck
Weaknesses:
- Operational complexity — runs as multiple microservices (ingester, compactor, store-gateway, querier, distributor). More moving parts = more things to debug and tune
- Resource hungry — needs significantly more CPU and memory than Prometheus or VictoriaMetrics for the same workload
- AGPLv3 license — may be a legal blocker for some organizations, especially if embedding or offering as a service
- Overkill for small setups — if you have one cluster and weeks of retention, Mimir adds complexity without proportional benefit
- Grafana Labs dependency — Mimir is primarily driven by one company. Less community governance than CNCF projects
Thanos
A good option if you want to extend existing Prometheus installations with HA and long-term storage without replacing them.
Strengths:
- Non-invasive — attaches to existing Prometheus instances via sidecars, no need to rearchitect
- Apache 2.0 license — no AGPL concerns
- CNCF project with broad community support
- Deduplication handles the HA Prometheus replica problem cleanly
- Proven at scale at companies like Improbable, Red Hat, and others
Weaknesses:
- Sidecar model adds latency — queries span multiple Prometheus + Store Gateway components, which can be slower than a centralized system
- Compaction can be painful — the compactor is a single point of contention and can struggle with very high cardinality datasets
- Partial multi-tenancy — relies on external labels rather than native tenant isolation, which is less robust than Mimir’s approach
- More operational overhead than Mimir — despite being “just sidecars”, running Thanos in production requires tuning Store Gateway, compactor, and query frontend separately
- Losing momentum — many teams that started with Thanos are migrating to Mimir. Fewer new deployments, slower feature development
VictoriaMetrics
Excellent performance and lower resource consumption than Prometheus. Drop-in compatible via MetricsQL. Popular choice when cost-efficiency matters.
Strengths:
- Best-in-class resource efficiency — uses 5–10x less RAM and disk than Prometheus for the same data volume
- Simple single-binary deployment (like Prometheus) — low operational overhead
- MetricsQL adds useful functions on top of PromQL (e.g.,
range_median,rollup_rate) - Built-in long-term storage with optional object storage — no need for a separate system like Mimir/Thanos
- Supports multiple ingestion protocols (Prometheus remote write, InfluxDB line protocol, OpenTSDB, Graphite)
Weaknesses:
- MetricsQL ≠ PromQL — compatible but not identical. Subtle differences in
rate(),increase(), extrapolation behavior can produce different results. Dashboards and alerts written for Prometheus may silently return wrong numbers - Cluster mode is proprietary — single-node is Apache 2.0, but horizontal scaling, multi-tenancy, downsampling, and enterprise backups require a commercial license
- Smaller ecosystem — Prometheus is the CNCF standard. Operators, Helm charts, kube-prometheus-stack, and most documentation assume native Prometheus. VM requires extra integration work
- Single-vendor risk — one company controls the project. If they change licensing direction (as others have done — Redis, Elasticsearch, HashiCorp), there is no CNCF governance safety net
- Alerting is less mature — vmalert works but is not Alertmanager. Fewer notification integrations, less battle-tested routing and silencing. Most production setups still run Alertmanager alongside
- Hiring and knowledge — Prometheus is universally known among SREs. VictoriaMetrics expertise is rarer, meaning more onboarding time and fewer community answers to edge-case problems
InfluxDB
Better suited for IoT and general time-series workloads where PromQL is not a requirement.
Strengths:
- Purpose-built for time-series data with a rich data model (tags, fields, measurements) — more flexible than Prometheus’s label model
- Strong in IoT, industrial, and sensor data use cases
- InfluxQL is SQL-like and easy to learn for teams coming from relational databases
- Good standalone tool with built-in dashboarding (Chronograf) and alerting (Kapacitor) in the TICK stack
- Large community and mature project (since 2013)
Weaknesses:
- No PromQL — InfluxQL and Flux are completely different query languages. Migrating from Prometheus means rewriting every dashboard and alert from scratch
- Flux is being deprecated — InfluxDB 3.0 dropped Flux in favor of SQL and InfluxQL, creating uncertainty for teams that invested in Flux queries
- Weak Kubernetes integration — no native ServiceMonitor/PodMonitor support, no kube-state-metrics equivalent. Requires custom telegraf configurations for Kubernetes metrics
- Cluster mode is commercial only — open-source InfluxDB is single-node. High availability and horizontal scaling require InfluxDB Enterprise or InfluxDB Cloud (paid)
- Grafana integration is second-class — works, but no exemplar support, no native service map integration, no Explore Metrics/Logs correlation. The Grafana ecosystem assumes Prometheus
- Cardinality limits — InfluxDB OSS struggles with high-cardinality data (many unique tag combinations), which is common in Kubernetes environments with dynamic pod names
Datadog
Fully managed SaaS, zero operational overhead, but vendor lock-in and significantly higher cost at scale.
Strengths:
- Zero ops — no clusters to manage, no storage to provision, no upgrades to plan. Just install the agent
- Unified platform — metrics, logs, traces, profiling, RUM, synthetics, security all in one place with built-in correlation
- Excellent out-of-the-box dashboards and integrations (500+) — immediate value with minimal configuration
- Strong alerting with anomaly detection, forecasting, and composite monitors
- Good onboarding experience — polished UI, extensive documentation, responsive support
Weaknesses:
- Cost — pricing is per host, per metric, per log GB, per span. At scale (hundreds of hosts, millions of custom metrics) costs grow dramatically and unpredictably. Bills of $50k–$500k+/year are common
- Vendor lock-in — proprietary query language, proprietary data format, no data export. Migrating away means rebuilding everything: dashboards, alerts, SLOs, runbooks
- No self-hosted option — data leaves your infrastructure. May be a blocker for regulated industries, data residency requirements, or air-gapped environments
- Custom metrics pricing model — Prometheus-style instrumentation that freely creates labels can generate millions of custom metric series, each billed individually. Teams often have to reduce observability to control costs
- No PromQL — teams must learn Datadog’s proprietary query syntax. Knowledge does not transfer to other tools, and hiring for “Datadog expertise” is narrower than hiring for “Prometheus/Grafana expertise”
- Alert fatigue at scale — while monitors are powerful individually, managing hundreds of monitors across many services often requires significant investment in monitor-as-code tooling (Terraform provider, Datadog Operator)