Prometheus

Prometheus serves as the short-term metrics hub in our stack. It does not scrape targets directly — Alloy handles all scraping and sends metrics to Prometheus via remote write.

Role in the Stack

Function	Details
Short-term storage	Holds recent metrics (hours) for fast queries
PromQL engine	Serves as a query engine for Grafana dashboards and Drilldown
Remote write receiver	Accepts metrics from Alloy and Tempo metrics generator
Remote write sender	Forwards all metrics to Mimir for long-term retention
Exemplar storage	Links metric data points to trace IDs for cross-signal navigation

Versions

Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter are all installed together via the kube-prometheus-stack Helm chart.


Chart	`prometheus-community/kube-prometheus-stack` 85.0.1
Prometheus	v3.11.3 (distroless image variant)
Alertmanager	v0.32.1
prometheus-operator	v0.90.1
Grafana	13.0.1 (see grafana.md)
kube-state-metrics	v2.18.0
node-exporter	v1.11.1 (distroless image variant)

Configuration Highlights

Remote write receiver: Enabled — accepts metrics pushed by Alloy
Native histograms: Enabled — supports the new histogram format for more efficient percentile queries
ServiceMonitor scraping: Disabled — Alloy handles all scraping
Remote write to Mimir: http://mimir-gateway.monitoring.svc.cluster.local:80/api/v1/push

What Feeds Into Prometheus

Source	Signal	Path
Alloy	Application metrics (OTLP)	Alloy OTLP receiver → remote write to Prometheus
Alloy	Kubernetes metrics (kube-state-metrics, node-exporter, kubelet)	Alloy scraper → remote write to Prometheus
Alloy	Exporter metrics (PostgreSQL, Redis)	Alloy scraper → remote write to Prometheus
Tempo	Span metrics (RED: rate, errors, duration)	Tempo metrics generator → remote write to Prometheus

What Prometheus Feeds

Destination	Purpose
Mimir	Long-term storage via remote write
Grafana	Short-term queries, dashboards, exemplars
Alertmanager	Alert rule evaluation and routing

Integration with Other Components

Exemplars — When an application emits a metric with a trace ID (via OTel SDK), Prometheus stores it as an exemplar. In Grafana, clicking an exemplar point on a metric graph jumps directly to the corresponding trace in Tempo.

Service Map — Grafana’s Tempo datasource uses Prometheus as the source for the Service Map (node graph) visualization, querying span metrics to render service-to-service dependencies.

Alertmanager — Prometheus evaluates alert rules and sends notifications to Alertmanager, which handles deduplication, grouping, silencing, and routing to notification channels.

Grafana Datasource

Type: prometheus
URL: http://prometheus-and-grafana-kub-prometheus.monitoring.svc.cluster.local:9090
Exemplars: Enabled — links to Tempo by TraceID
Access: NodePort 30090, Ingress at prometheus.<domain>

Alternatives to Prometheus

Prometheus is the de facto standard for metrics in the cloud-native world, but several alternatives exist — each with different trade-offs.

Feature	Prometheus	Mimir	Thanos	VictoriaMetrics	InfluxDB	Datadog
License	Apache 2.0	AGPLv3	Apache 2.0	Apache 2.0 (single-node)	MIT (OSS) / Commercial	Commercial (SaaS)
Architecture	Single binary	Distributed microservices	Sidecar + Store Gateway	Single binary or cluster	Single binary or cluster	Fully managed SaaS
Horizontal scalability	No (single node)	Yes (native)	Yes (via sidecars + object storage)	Yes (cluster mode)	Yes (Enterprise/cluster)	Yes (managed)
Long-term storage	Local disk only	Object storage (S3, GCS, Azure Blob)	Object storage (S3, GCS, Azure Blob)	Local disk + optional object storage	Local disk / cloud	Managed
Query language	PromQL	PromQL	PromQL	MetricsQL (PromQL-compatible)	InfluxQL / Flux	Proprietary
Multi-tenancy	No	Yes (native)	Partial (per-tenant label)	Yes (native in enterprise)	Yes	Yes
High availability	Manual (2 replicas + Thanos/Mimir)	Built-in replication	Built-in deduplication	Built-in replication	Built-in (Enterprise)	Built-in
Resource usage	Moderate	Higher (multiple components)	Moderate + sidecar overhead	Low (very efficient)	Moderate	N/A (SaaS)
Setup complexity	Low	High	Medium–High	Low	Low–Medium	Low (managed)
Best for	Single cluster, short-term metrics	Large-scale, multi-tenant Grafana stack	Adding HA/long-term to existing Prometheus	Cost-efficient, high-volume metrics	IoT, time-series beyond metrics	Full SaaS observability

Mimir

Natural choice when you are already in the Grafana ecosystem and need long-term storage + multi-tenancy. This is exactly how our stack uses it: Prometheus for short-term, Mimir for long-term.

Strengths:

Native PromQL — no compatibility layer, no subtle query differences
Built for multi-tenancy from day one — tenant isolation, per-tenant limits, and billing-ready
Seamless Grafana integration — same team builds both, so new features (e.g., native histograms) land here first
Horizontally scalable with object storage (S3, GCS, Azure Blob) — no local disk bottleneck

Weaknesses:

Operational complexity — runs as multiple microservices (ingester, compactor, store-gateway, querier, distributor). More moving parts = more things to debug and tune
Resource hungry — needs significantly more CPU and memory than Prometheus or VictoriaMetrics for the same workload
AGPLv3 license — may be a legal blocker for some organizations, especially if embedding or offering as a service
Overkill for small setups — if you have one cluster and weeks of retention, Mimir adds complexity without proportional benefit
Grafana Labs dependency — Mimir is primarily driven by one company. Less community governance than CNCF projects

Thanos

A good option if you want to extend existing Prometheus installations with HA and long-term storage without replacing them.

Strengths:

Non-invasive — attaches to existing Prometheus instances via sidecars, no need to rearchitect
Apache 2.0 license — no AGPL concerns
CNCF project with broad community support
Deduplication handles the HA Prometheus replica problem cleanly
Proven at scale at companies like Improbable, Red Hat, and others

Weaknesses:

Sidecar model adds latency — queries span multiple Prometheus + Store Gateway components, which can be slower than a centralized system
Compaction can be painful — the compactor is a single point of contention and can struggle with very high cardinality datasets
Partial multi-tenancy — relies on external labels rather than native tenant isolation, which is less robust than Mimir’s approach
More operational overhead than Mimir — despite being “just sidecars”, running Thanos in production requires tuning Store Gateway, compactor, and query frontend separately
Losing momentum — many teams that started with Thanos are migrating to Mimir. Fewer new deployments, slower feature development

VictoriaMetrics

Excellent performance and lower resource consumption than Prometheus. Drop-in compatible via MetricsQL. Popular choice when cost-efficiency matters.

Strengths:

Best-in-class resource efficiency — uses 5–10x less RAM and disk than Prometheus for the same data volume
Simple single-binary deployment (like Prometheus) — low operational overhead
MetricsQL adds useful functions on top of PromQL (e.g., range_median, rollup_rate)
Built-in long-term storage with optional object storage — no need for a separate system like Mimir/Thanos
Supports multiple ingestion protocols (Prometheus remote write, InfluxDB line protocol, OpenTSDB, Graphite)

Weaknesses:

MetricsQL ≠ PromQL — compatible but not identical. Subtle differences in rate(), increase(), extrapolation behavior can produce different results. Dashboards and alerts written for Prometheus may silently return wrong numbers
Cluster mode is proprietary — single-node is Apache 2.0, but horizontal scaling, multi-tenancy, downsampling, and enterprise backups require a commercial license
Smaller ecosystem — Prometheus is the CNCF standard. Operators, Helm charts, kube-prometheus-stack, and most documentation assume native Prometheus. VM requires extra integration work
Single-vendor risk — one company controls the project. If they change licensing direction (as others have done — Redis, Elasticsearch, HashiCorp), there is no CNCF governance safety net
Alerting is less mature — vmalert works but is not Alertmanager. Fewer notification integrations, less battle-tested routing and silencing. Most production setups still run Alertmanager alongside
Hiring and knowledge — Prometheus is universally known among SREs. VictoriaMetrics expertise is rarer, meaning more onboarding time and fewer community answers to edge-case problems

InfluxDB

Better suited for IoT and general time-series workloads where PromQL is not a requirement.

Strengths:

Purpose-built for time-series data with a rich data model (tags, fields, measurements) — more flexible than Prometheus’s label model
Strong in IoT, industrial, and sensor data use cases
InfluxQL is SQL-like and easy to learn for teams coming from relational databases
Good standalone tool with built-in dashboarding (Chronograf) and alerting (Kapacitor) in the TICK stack
Large community and mature project (since 2013)

Weaknesses:

No PromQL — InfluxQL and Flux are completely different query languages. Migrating from Prometheus means rewriting every dashboard and alert from scratch
Flux is being deprecated — InfluxDB 3.0 dropped Flux in favor of SQL and InfluxQL, creating uncertainty for teams that invested in Flux queries
Weak Kubernetes integration — no native ServiceMonitor/PodMonitor support, no kube-state-metrics equivalent. Requires custom telegraf configurations for Kubernetes metrics
Cluster mode is commercial only — open-source InfluxDB is single-node. High availability and horizontal scaling require InfluxDB Enterprise or InfluxDB Cloud (paid)
Grafana integration is second-class — works, but no exemplar support, no native service map integration, no Explore Metrics/Logs correlation. The Grafana ecosystem assumes Prometheus
Cardinality limits — InfluxDB OSS struggles with high-cardinality data (many unique tag combinations), which is common in Kubernetes environments with dynamic pod names

Datadog

Fully managed SaaS, zero operational overhead, but vendor lock-in and significantly higher cost at scale.

Strengths:

Zero ops — no clusters to manage, no storage to provision, no upgrades to plan. Just install the agent
Unified platform — metrics, logs, traces, profiling, RUM, synthetics, security all in one place with built-in correlation
Excellent out-of-the-box dashboards and integrations (500+) — immediate value with minimal configuration
Strong alerting with anomaly detection, forecasting, and composite monitors
Good onboarding experience — polished UI, extensive documentation, responsive support

Weaknesses:

Cost — pricing is per host, per metric, per log GB, per span. At scale (hundreds of hosts, millions of custom metrics) costs grow dramatically and unpredictably. Bills of $50k–$500k+/year are common
Vendor lock-in — proprietary query language, proprietary data format, no data export. Migrating away means rebuilding everything: dashboards, alerts, SLOs, runbooks
No self-hosted option — data leaves your infrastructure. May be a blocker for regulated industries, data residency requirements, or air-gapped environments
Custom metrics pricing model — Prometheus-style instrumentation that freely creates labels can generate millions of custom metric series, each billed individually. Teams often have to reduce observability to control costs
No PromQL — teams must learn Datadog’s proprietary query syntax. Knowledge does not transfer to other tools, and hiring for “Datadog expertise” is narrower than hiring for “Prometheus/Grafana expertise”
Alert fatigue at scale — while monitors are powerful individually, managing hundreds of monitors across many services often requires significant investment in monitor-as-code tooling (Terraform provider, Datadog Operator)

Prometheus

Prometheus

Role in the Stack

Versions

Configuration Highlights

What Feeds Into Prometheus

What Prometheus Feeds

Integration with Other Components

Grafana Datasource

Alternatives to Prometheus

Mimir

Thanos

VictoriaMetrics

InfluxDB

Datadog

results matching ""

No results matching ""