Overview
Overview
Architecture
Prometheus follows a pull-based architecture with a multi-dimensional data model built for modern, dynamic infrastructure.
Core Components
┌─────────────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Retrieval │ │ Storage │ │ HTTP Server │ │
│ │ │ │ │ │ │ │
│ │ - Scraping │ │ - TSDB │ │ - PromQL API │ │
│ │ - Service │ │ - WAL │ │ - Web UI │ │
│ │ Discovery │ │ - Retention │ │ - /metrics endpoint │ │
│ └──────┬───────┘ └───────┬──────┘ └──────────────────────┘ │
│ │ │ │
└─────────┼──────────────────┼────────────────────────────────────┘
│ │
↓ ↓
┌───────────────┐ ┌──────────────┐
│ Targets │ │ Alertmanager│
│ │ │ │
│ - /metrics │ │ - Grouping │
│ - Exporters │ │ - Routing │
│ - Pushgateway │ │ - Silencing │
└───────────────┘ └──────┬───────┘
│
↓
┌──────────────┐
│ Notifications│
│ │
│ - Email │
│ - PagerDuty │
│ - Slack │
└──────────────┘
1. Prometheus Server (core component)
- Retrieval: scrapes metrics from targets via HTTP
- Storage: local time-series database (TSDB)
- Query engine: executes PromQL queries
2. Targets (monitored applications/services)
- Instrumented applications: expose
/metricsendpoint - Exporters: translate third-party metrics to Prometheus format
- Node Exporter (hardware and OS metrics)
- Blackbox Exporter (probing endpoints)
- Custom exporters (databases, message queues, etc.)
- Pushgateway: for short-lived batch jobs (not recommended for regular use)
3. Service Discovery
- Static configuration: manually defined targets
- Dynamic discovery: automatic target detection
- Kubernetes
- Consul
- EC2
- Azure
- File-based SD
- Custom integrations
4. Alertmanager (separate component)
- Receives alerts from Prometheus server
- Grouping: combines similar alerts
- Routing: sends alerts to appropriate receivers
- Silencing: temporary muting of alerts
- Inhibition: suppresses alerts based on other alerts
- Deduplication: prevents duplicate notifications
5. Visualization & Querying
- Built-in Web UI: basic graphs and expression browser
- Grafana: most popular visualization tool
- API clients: custom dashboards and integrations
6. Remote Storage (optional)
- Remote Write: send metrics to long-term storage
- Remote Read: query historical data from external systems
- Integrations: Thanos, Cortex, VictoriaMetrics, Mimir
Data Flow
1. Service Discovery → Prometheus discovers targets
2. Scraping → Prometheus pulls metrics every 15s (default)
3. Storage → Metrics stored in local TSDB
4. Evaluation → Recording rules and alerting rules evaluated
5. Alertmanager → Triggered alerts sent to Alertmanager
6. Notification → Users receive alerts via configured channels
7. Query → Users/dashboards query metrics via PromQL
Scraping process:
Target (/metrics endpoint)
↓
Prometheus HTTP GET
↓
Parsing (Prometheus text format or protobuf)
↓
Relabeling (metric_relabel_configs)
↓
Ingestion into TSDB
↓
Available for queries
Key Architectural Principles
1. Pull-based model
- Prometheus actively scrapes targets
- Advantages:
- Better control over scrape frequency and timeouts
- Easy to detect if target is down (
upmetric) - No need to configure each target with server address
- Targets can be behind NAT/firewall (with PushProx)
- Trade-offs:
- Requires network access to targets
- Short-lived jobs need Pushgateway (with caveats)
2. Multi-dimensional data model
http_requests_total{method="GET", status="200", handler="/api/users"}
- Metric name: identifies what is being measured
- Labels: dimensions for filtering and aggregating
- Flexibility: aggregate across any label dimension
3. Local storage (no clustering)
- TSDB optimized for time series data
- No distributed storage required (simpler operations)
- Retention policy: configurable data retention
- Horizontal scaling: federation or remote storage integrations
4. Powerful query language (PromQL)
rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
sum by (instance) (rate(cpu_seconds_total{mode!="idle"}[5m]))
- Functional language for time series manipulation
- Built-in functions: rate, sum, avg, quantile, etc.
- Range vectors: operate on time ranges
5. Push gateway as exception
- Only for batch jobs at service level
- Not recommended for regular applications
- Reasons: loses automatic health checking, introduces SPOF
6. Autonomous operation
- Single binary: easy deployment
- No external dependencies: runs standalone
- Configuration via YAML: simple and declarative
- Reloadable:
SIGHUPsignal reloads configuration
7. Service discovery integration
- Kubernetes: pod, service, endpoint, node discovery
- Cloud providers: AWS, Azure, GCP auto-discovery
- DNS-SD: DNS-based service discovery
- File-SD: for custom integrations
8. Alert separation
- Prometheus: evaluates alert rules, fires alerts
- Alertmanager: handles alert routing and notification
- Separation of concerns: monitoring and alert management decoupled