Federation

Federation allows one Prometheus server to scrape metrics from another Prometheus server. This enables hierarchical monitoring architectures and cross-datacenter metric aggregation.

Why Federation?

Use cases:

Hierarchical monitoring - aggregate metrics from multiple datacenters
Scalability - partition monitoring across multiple Prometheus servers
Cross-cluster visibility - monitor metrics from different Kubernetes clusters
Separation of concerns - team-level Prometheus → organization-level Prometheus
Long-term storage - central Prometheus with longer retention

Types of Federation

1. Hierarchical Federation

Most common pattern - metrics flow upward through hierarchy:

┌─────────────────────────────────────────────┐
│         Global Prometheus                   │
│     (datacenter-wide aggregates)            │
│     Retention: 1 year                       │
└────────────┬────────────────────────────────┘
             │ Federate
    ┌────────┴────────┬────────────┐
    │                 │            │
┌───▼────┐      ┌────▼───┐   ┌───▼────┐
│ Prom 1 │      │ Prom 2 │   │ Prom 3 │
│  DC-1  │      │  DC-2  │   │  DC-3  │
│ 30 days│      │ 30 days│   │ 30 days│
└────────┘      └────────┘   └────────┘

Configuration example:

Lower-level Prometheus (scrapes actual targets):

# Standard scrape config
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    # ... relabeling rules ...

# Recording rules for aggregation
rule_files:
  - 'aggregation_rules.yml'

Recording rules (aggregation_rules.yml):

groups:
  - name: datacenter_aggregates
    interval: 30s
    rules:
      # Aggregate HTTP requests by datacenter
      - record: datacenter:http_requests:rate5m
        expr: sum by (datacenter, job) (rate(http_requests_total[5m]))

      # Aggregate CPU usage
      - record: datacenter:cpu_usage:avg
        expr: avg by (datacenter) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))

      # Aggregate memory usage
      - record: datacenter:memory_usage:ratio
        expr: |
          sum by (datacenter) (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          /
          sum by (datacenter) (node_memory_MemTotal_bytes)

Global Prometheus (federates from lower-level):

scrape_configs:
  - job_name: 'federate-dc1'
    scrape_interval: 60s
    scrape_timeout: 50s
    honor_labels: true
    metrics_path: '/federate'

    params:
      'match[]':
        # Federate only aggregated metrics
        - '{__name__=~"datacenter:.*"}'
        # And critical alerts
        - '{__name__=~"ALERTS.*"}'

    static_configs:
      - targets:
          - 'prometheus-dc1.example.com:9090'
        labels:
          datacenter: 'dc1'

  - job_name: 'federate-dc2'
    scrape_interval: 60s
    honor_labels: true
    metrics_path: '/federate'

    params:
      'match[]':
        - '{__name__=~"datacenter:.*"}'

    static_configs:
      - targets:
          - 'prometheus-dc2.example.com:9090'
        labels:
          datacenter: 'dc2'

2. Cross-Service Federation

Pattern: Multiple Prometheus servers monitoring different services, one central instance for global view:

┌──────────────────────────────────────────────┐
│       Central Prometheus                     │
│   (cross-service queries & dashboards)       │
└──────┬──────────────┬──────────────┬─────────┘
       │              │              │
   ┌───▼────┐    ┌───▼────┐    ┌───▼────┐
   │ Team A │    │ Team B │    │ Team C │
   │ Prom   │    │ Prom   │    │ Prom   │
   └────────┘    └────────┘    └────────┘

Configuration:

scrape_configs:
  - job_name: 'federate-team-a'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="team-a-services"}'
        - '{job="team-a-databases"}'
    static_configs:
      - targets: ['team-a-prometheus:9090']
        labels:
          team: 'team-a'

  - job_name: 'federate-team-b'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="team-b-services"}'
    static_configs:
      - targets: ['team-b-prometheus:9090']
        labels:
          team: 'team-b'

Federation Configuration Details

The `/federate` Endpoint

URL: http://prometheus:9090/federate

Query parameters:

match[] - repeated parameter with PromQL series selectors

Example request:

curl -G http://prometheus:9090/federate \
  --data-urlencode 'match[]={job="prometheus"}' \
  --data-urlencode 'match[]={__name__=~"job:.*"}'

`honor_labels` Parameter

CRITICAL SETTING for federation:

honor_labels: true  # Keep labels from source Prometheus

Without honor_labels: true:

Source labels are prefixed with exported_
Original job becomes exported_job
Original instance becomes exported_instance

With honor_labels: true:

Original labels preserved as-is
Essential for maintaining label consistency across hierarchy

Match Patterns

Match all metrics:

params:
  'match[]':
    - '{__name__=~".+"}'

Match specific jobs:

params:
  'match[]':
    - '{job="node-exporter"}'
    - '{job="blackbox-exporter"}'

Match by metric name pattern:

params:
  'match[]':
    - '{__name__=~"node_.*"}'
    - '{__name__=~"container_.*"}'

Match pre-aggregated metrics:

params:
  'match[]':
    - '{__name__=~".*:.*:.*"}'  # Matches recording rule naming convention

Match multiple conditions:

params:
  'match[]':
    - '{job="api",environment="production"}'
    - '{__name__=~"up|http_requests_total"}'

Federation Best Practices

1. Federate Only What You Need

❌ BAD - Federate everything:

params:
  'match[]':
    - '{__name__=~".+"}'  # Millions of series!

✅ GOOD - Federate aggregates:

params:
  'match[]':
    - '{__name__=~"datacenter:.*"}'  # Only pre-aggregated metrics
    - '{__name__="up"}'               # Health checks

2. Use Recording Rules for Aggregation

Lower-level Prometheus:

groups:
  - name: federation_prep
    interval: 30s
    rules:
      # Pre-aggregate for federation
      - record: cluster:http_requests:rate5m
        expr: sum by (cluster, method, status) (rate(http_requests_total[5m]))

      - record: cluster:cpu_usage:avg
        expr: avg by (cluster) (instance:cpu_usage:rate5m)

Benefits:

Reduced data volume
Faster federation scrapes
Lower network bandwidth
Consistent aggregation logic

3. Adjust Scrape Intervals

Federation scrapes can be slower:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 60s      # Longer than normal (usually 15s)
    scrape_timeout: 50s       # Must be < scrape_interval
    metrics_path: '/federate'

4. Monitor Federation Health

Metrics to track:

# Federation scrape duration
prometheus_target_interval_length_seconds{job="federate"}

# Federation scrape success
up{job="federate"}

# Samples scraped via federation
scrape_samples_scraped{job="federate"}

# Federation endpoint timeouts
increase(prometheus_target_scrapes_exceeded_sample_limit_total{job="federate"}[5m])

Alerts:

groups:
  - name: federation
    rules:
      - alert: FederationDown
        expr: up{job=~"federate.*"} == 0
        for: 5m
        annotations:
          summary: "Federation target  is down"

      - alert: FederationSlow
        expr: prometheus_target_interval_length_seconds{job=~"federate.*",quantile="0.99"} > 50
        for: 10m
        annotations:
          summary: "Federation scrape from  is slow"

5. Plan Your Hierarchy

Good hierarchy design:

Global (1 instance)
  ↑
Region (3-5 instances)
  ↑
Cluster (10-50 instances)
  ↑
Targets (scrape actual services)

Each level aggregates for the level above:

Cluster: raw metrics → cluster aggregates
Region: cluster aggregates → region aggregates
Global: region aggregates → global aggregates

6. Label Consistency

Ensure consistent labels across levels:

# Lower-level Prometheus
external_labels:
  datacenter: 'us-east-1'
  cluster: 'prod-k8s-1'
  region: 'us-east'

# Mid-level Prometheus
external_labels:
  datacenter: 'us-east-1'
  region: 'us-east'

# Global Prometheus
external_labels:
  environment: 'production'

7. Consider Alternatives

Modern alternatives to federation:

Thanos:

Unlimited retention via object storage
Global query view across multiple Prometheus
Deduplication and downsampling
No federation needed

Cortex/Mimir:

Horizontally scalable
Multi-tenancy support
Long-term storage
High availability

VictoriaMetrics:

High performance
Lower resource usage
Compatible with Prometheus
Built-in clustering

When to use federation:

✅ Simple hierarchies (2-3 levels)
✅ Low data volume
✅ Already have multiple Prometheus instances
✅ Need cross-cluster aggregates

When to use alternatives:

❌ Complex hierarchies (4+ levels)
❌ High data volume (millions of series)
❌ Need long-term storage (>1 year)
❌ Multi-tenant requirements

Federation Anti-Patterns

❌ Don’t federate raw high-cardinality metrics

# BAD - Federating all container metrics
params:
  'match[]':
    - '{__name__=~"container_.*"}'

❌ Don’t create deep hierarchies

Global → Region → Zone → Cluster → Namespace → Service
# Too many levels! 3 levels max recommended.

❌ Don’t use federation for long-term storage

# BAD - Central Prometheus with 5-year retention
# USE - Remote write to dedicated TSDB

❌ Don’t federate without aggregation

# BAD - No recording rules, just raw metrics
# GOOD - Pre-aggregate, then federate aggregates

Federation Example: Multi-Datacenter Setup

Scenario: 3 datacenters, 1 global view

DC-1 Prometheus (dc1-prometheus.yaml):

global:
  external_labels:
    datacenter: 'dc1'
    region: 'us-east'

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    # ... standard pod scraping ...

rule_files:
  - 'federation_aggregates.yml'

Federation aggregates (federation_aggregates.yml):

groups:
  - name: dc_aggregates
    interval: 30s
    rules:
      - record: dc:http_requests:rate5m
        expr: sum by (datacenter, service, status) (rate(http_requests_total[5m]))

      - record: dc:http_request_duration:p95
        expr: histogram_quantile(0.95, sum by (datacenter, service, le) (rate(http_request_duration_seconds_bucket[5m])))

      - record: dc:cpu_usage:avg
        expr: avg by (datacenter) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))

Global Prometheus (global-prometheus.yaml):

global:
  external_labels:
    environment: 'production'

scrape_configs:
  - job_name: 'federate-dc1'
    honor_labels: true
    metrics_path: '/federate'
    scrape_interval: 60s
    params:
      'match[]':
        - '{__name__=~"dc:.*"}'
        - '{__name__="up"}'
    static_configs:
      - targets: ['dc1-prometheus:9090']

  - job_name: 'federate-dc2'
    honor_labels: true
    metrics_path: '/federate'
    scrape_interval: 60s
    params:
      'match[]':
        - '{__name__=~"dc:.*"}'
        - '{__name__="up"}'
    static_configs:
      - targets: ['dc2-prometheus:9090']

  - job_name: 'federate-dc3'
    honor_labels: true
    metrics_path: '/federate'
    scrape_interval: 60s
    params:
      'match[]':
        - '{__name__=~"dc:.*"}'
        - '{__name__="up"}'
    static_configs:
      - targets: ['dc3-prometheus:9090']

rule_files:
  - 'global_aggregates.yml'

Global aggregates (global_aggregates.yml):

groups:
  - name: global_aggregates
    interval: 60s
    rules:
      - record: global:http_requests:rate5m
        expr: sum by (service, status) (dc:http_requests:rate5m)

      - record: global:cpu_usage:avg
        expr: avg(dc:cpu_usage:avg)

This setup provides:

✅ Datacenter-level granularity at DC Prometheus
✅ Global cross-datacenter view
✅ Efficient data transfer (only aggregates)
✅ Independent DC operations (DC failures don’t affect others)
✅ Scalable architecture

Federation

Federation

Federation

Why Federation?

Types of Federation

1. Hierarchical Federation

2. Cross-Service Federation

Federation Configuration Details

The `/federate` Endpoint

`honor_labels` Parameter

Match Patterns

Federation Best Practices

1. Federate Only What You Need

2. Use Recording Rules for Aggregation

3. Adjust Scrape Intervals

4. Monitor Federation Health

5. Plan Your Hierarchy

6. Label Consistency

7. Consider Alternatives

Federation Anti-Patterns

Federation Example: Multi-Datacenter Setup

results matching ""

No results matching ""

Federation

Federation

Why Federation?

Types of Federation

1. Hierarchical Federation

2. Cross-Service Federation

Federation Configuration Details

The /federate Endpoint

honor_labels Parameter

Match Patterns

Federation Best Practices

1. Federate Only What You Need

2. Use Recording Rules for Aggregation

3. Adjust Scrape Intervals

4. Monitor Federation Health

5. Plan Your Hierarchy

6. Label Consistency

7. Consider Alternatives

Federation Anti-Patterns

Federation Example: Multi-Datacenter Setup

results matching ""

No results matching ""

The `/federate` Endpoint

`honor_labels` Parameter