Overview

Overview

Architecture

Prometheus follows a pull-based architecture with a multi-dimensional data model built for modern, dynamic infrastructure.

Core Components

┌─────────────────────────────────────────────────────────────────┐
│                      Prometheus Server                          │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
│  │  Retrieval   │  │   Storage    │  │      HTTP Server      │  │
│  │              │  │              │  │                       │  │
│  │  - Scraping  │  │  - TSDB      │  │  - PromQL API        │   │
│  │  - Service   │  │  - WAL       │  │  - Web UI            │   │
│  │    Discovery │  │  - Retention │  │  - /metrics endpoint │   │
│  └──────┬───────┘  └───────┬──────┘  └──────────────────────┘   │
│         │                  │                                    │
└─────────┼──────────────────┼────────────────────────────────────┘
          │                  │
          ↓                  ↓
  ┌───────────────┐   ┌──────────────┐
  │   Targets     │   │  Alertmanager│
  │               │   │              │
  │ - /metrics    │   │ - Grouping   │
  │ - Exporters   │   │ - Routing    │
  │ - Pushgateway │   │ - Silencing  │
  └───────────────┘   └──────┬───────┘
                             │
                             ↓
                      ┌──────────────┐
                      │ Notifications│
                      │              │
                      │ - Email      │
                      │ - PagerDuty  │
                      │ - Slack      │
                      └──────────────┘

1. Prometheus Server (core component)

  • Retrieval: scrapes metrics from targets via HTTP
  • Storage: local time-series database (TSDB)
  • Query engine: executes PromQL queries

2. Targets (monitored applications/services)

  • Instrumented applications: expose /metrics endpoint
  • Exporters: translate third-party metrics to Prometheus format
    • Node Exporter (hardware and OS metrics)
    • Blackbox Exporter (probing endpoints)
    • Custom exporters (databases, message queues, etc.)
  • Pushgateway: for short-lived batch jobs (not recommended for regular use)

3. Service Discovery

  • Static configuration: manually defined targets
  • Dynamic discovery: automatic target detection
    • Kubernetes
    • Consul
    • EC2
    • Azure
    • File-based SD
    • Custom integrations

4. Alertmanager (separate component)

  • Receives alerts from Prometheus server
  • Grouping: combines similar alerts
  • Routing: sends alerts to appropriate receivers
  • Silencing: temporary muting of alerts
  • Inhibition: suppresses alerts based on other alerts
  • Deduplication: prevents duplicate notifications

5. Visualization & Querying

  • Built-in Web UI: basic graphs and expression browser
  • Grafana: most popular visualization tool
  • API clients: custom dashboards and integrations

6. Remote Storage (optional)

  • Remote Write: send metrics to long-term storage
  • Remote Read: query historical data from external systems
  • Integrations: Thanos, Cortex, VictoriaMetrics, Mimir

Data Flow

1. Service Discovery → Prometheus discovers targets
2. Scraping → Prometheus pulls metrics every 15s (default)
3. Storage → Metrics stored in local TSDB
4. Evaluation → Recording rules and alerting rules evaluated
5. Alertmanager → Triggered alerts sent to Alertmanager
6. Notification → Users receive alerts via configured channels
7. Query → Users/dashboards query metrics via PromQL

Scraping process:

Target (/metrics endpoint)
    ↓
Prometheus HTTP GET
    ↓
Parsing (Prometheus text format or protobuf)
    ↓
Relabeling (metric_relabel_configs)
    ↓
Ingestion into TSDB
    ↓
Available for queries

Key Architectural Principles

1. Pull-based model

  • Prometheus actively scrapes targets
  • Advantages:
    • Better control over scrape frequency and timeouts
    • Easy to detect if target is down (up metric)
    • No need to configure each target with server address
    • Targets can be behind NAT/firewall (with PushProx)
  • Trade-offs:
    • Requires network access to targets
    • Short-lived jobs need Pushgateway (with caveats)

2. Multi-dimensional data model

http_requests_total{method="GET", status="200", handler="/api/users"}
  • Metric name: identifies what is being measured
  • Labels: dimensions for filtering and aggregating
  • Flexibility: aggregate across any label dimension

3. Local storage (no clustering)

  • TSDB optimized for time series data
  • No distributed storage required (simpler operations)
  • Retention policy: configurable data retention
  • Horizontal scaling: federation or remote storage integrations

4. Powerful query language (PromQL)

rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
sum by (instance) (rate(cpu_seconds_total{mode!="idle"}[5m]))
  • Functional language for time series manipulation
  • Built-in functions: rate, sum, avg, quantile, etc.
  • Range vectors: operate on time ranges

5. Push gateway as exception

  • Only for batch jobs at service level
  • Not recommended for regular applications
  • Reasons: loses automatic health checking, introduces SPOF

6. Autonomous operation

  • Single binary: easy deployment
  • No external dependencies: runs standalone
  • Configuration via YAML: simple and declarative
  • Reloadable: SIGHUP signal reloads configuration

7. Service discovery integration

  • Kubernetes: pod, service, endpoint, node discovery
  • Cloud providers: AWS, Azure, GCP auto-discovery
  • DNS-SD: DNS-based service discovery
  • File-SD: for custom integrations

8. Alert separation

  • Prometheus: evaluates alert rules, fires alerts
  • Alertmanager: handles alert routing and notification
  • Separation of concerns: monitoring and alert management decoupled

results matching ""

    No results matching ""