Remote Write

Remote Write

Remote Write is Prometheus’s mechanism for sending collected metrics to external storage systems. This enables long-term storage, cross-datacenter replication, and integration with other monitoring systems.

Why Remote Write?

Prometheus’s local storage limitations:

  • Limited retention - typically 15-30 days (disk space constraints)
  • Single node - no built-in high availability
  • No replication - data loss if server fails
  • Vertical scaling only - can’t distribute load across multiple servers

Remote Write benefits:

  • Unlimited retention - long-term storage in dedicated systems
  • High availability - replicate to multiple endpoints
  • Horizontal scaling - distribute metrics across multiple backends
  • Cross-datacenter replication - disaster recovery
  • Cost optimization - use cheaper storage for historical data
  • Integration - connect to various observability platforms

How Remote Write Works

┌─────────────────────────────────────────────────────┐
│              Prometheus Server                      │
│                                                     │
│  ┌──────────┐    ┌──────────┐    ┌──────────────┐ │
│  │ Scraping │───→│ Local    │───→│ Remote Write │ │
│  │          │    │ TSDB     │    │ Queue        │ │
│  └──────────┘    └──────────┘    └──────┬───────┘ │
└─────────────────────────────────────────┼─────────┘
                                          │
                                          │ HTTP POST
                                          ↓
                        ┌─────────────────────────────┐
                        │   Remote Storage System     │
                        │   (Thanos, Mimir, etc.)     │
                        └─────────────────────────────┘

Process:

  1. Prometheus scrapes metrics and stores locally
  2. Metrics are queued for remote write
  3. Queue batches metrics and compresses (Snappy)
  4. Sends via HTTP POST to remote endpoint(s)
  5. Remote system acknowledges receipt
  6. Queue marks data as sent

Configuration

Basic Remote Write Setup

# prometheus.yml
remote_write:
  - url: "https://remote-storage.example.com/api/v1/write"

    # Optional: Add external labels to all series
    external_labels:
      cluster: 'prod-cluster-1'
      region: 'us-east-1'

    # Optional: Write relabeling
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop

Multiple Remote Write Endpoints

remote_write:
  # Primary long-term storage
  - url: "https://thanos.example.com/api/v1/receive"
    queue_config:
      capacity: 10000
      max_shards: 50
      min_shards: 1
      max_samples_per_send: 5000
      batch_send_deadline: 5s

    metadata_config:
      send: true
      send_interval: 1m

  # Secondary analytics platform
  - url: "https://analytics.example.com/write"
    basic_auth:
      username: 'prometheus'
      password_file: /etc/prometheus/password

    # Only send specific metrics
    write_relabel_configs:
      - source_labels: [__name__]
        regex: '(http_requests_total|http_request_duration_seconds_.*)'
        action: keep

  # Third-party monitoring service
  - url: "https://monitoring-service.com/api/prom/push"
    bearer_token_file: /etc/prometheus/bearer_token

    remote_timeout: 30s
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 10s

Configuration Parameters

Connection Settings

remote_write:
  - url: "https://remote-storage.example.com/api/v1/write"

    # Timeout for HTTP requests
    remote_timeout: 30s

    # HTTP headers
    headers:
      X-Custom-Header: "value"

    # Proxy URL
    proxy_url: "http://proxy.example.com:8080"

    # Follow HTTP redirects
    follow_redirects: true

    # HTTP protocol version
    http2: true  # Use HTTP/2 (default: true)

Authentication

Basic Auth:

remote_write:
  - url: "https://remote-storage.example.com/write"
    basic_auth:
      username: 'prometheus'
      password: 'secret'
      # OR use password_file:
      password_file: /etc/prometheus/password

Bearer Token:

remote_write:
  - url: "https://remote-storage.example.com/write"
    bearer_token: "your-token-here"
    # OR use bearer_token_file:
    bearer_token_file: /etc/prometheus/bearer_token

OAuth2:

remote_write:
  - url: "https://remote-storage.example.com/write"
    oauth2:
      client_id: "prometheus"
      client_secret: "secret"
      token_url: "https://auth.example.com/oauth/token"
      scopes:
        - "metrics.write"
      endpoint_params:
        audience: "monitoring"

TLS:

remote_write:
  - url: "https://remote-storage.example.com/write"
    tls_config:
      ca_file: /etc/prometheus/ca.pem
      cert_file: /etc/prometheus/client-cert.pem
      key_file: /etc/prometheus/client-key.pem
      insecure_skip_verify: false
      server_name: "remote-storage.example.com"

Sigv4 (AWS):

remote_write:
  - url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write"
    sigv4:
      region: us-east-1
      access_key: "AKIAIOSFODNN7EXAMPLE"
      secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      # OR use AWS profile:
      profile: "default"
      role_arn: "arn:aws:iam::123456789012:role/PrometheusRole"

Queue Configuration

Critical for performance and reliability:

remote_write:
  - url: "https://remote-storage.example.com/write"
    queue_config:
      # Queue capacity (number of samples)
      capacity: 10000  # Default: 2500

      # Maximum number of concurrent shards
      max_shards: 200  # Default: 200

      # Minimum number of shards
      min_shards: 1    # Default: 1

      # Maximum samples per request
      max_samples_per_send: 5000  # Default: 500

      # Time to wait before sending (even if batch not full)
      batch_send_deadline: 5s     # Default: 5s

      # Minimum time between shard calculations
      min_backoff: 30ms           # Default: 30ms

      # Maximum time between retries
      max_backoff: 5s             # Default: 5s

      # Maximum number of retry attempts
      retry_on_http_429: true     # Default: true

Queue tuning guidelines:

High throughput (millions of samples/sec):

queue_config:
  capacity: 100000
  max_shards: 500
  max_samples_per_send: 10000
  batch_send_deadline: 10s

Low latency (real-time streaming):

queue_config:
  capacity: 5000
  max_shards: 50
  max_samples_per_send: 500
  batch_send_deadline: 1s

Resource-constrained (limited CPU/memory):

queue_config:
  capacity: 1000
  max_shards: 10
  max_samples_per_send: 100
  batch_send_deadline: 10s

Metadata Configuration

remote_write:
  - url: "https://remote-storage.example.com/write"
    metadata_config:
      # Send metric metadata (TYPE, HELP)
      send: true

      # How often to send metadata
      send_interval: 1m

      # Maximum samples per metadata request
      max_samples_per_send: 500

Write Relabeling

Filter metrics before sending:

remote_write:
  - url: "https://remote-storage.example.com/write"
    write_relabel_configs:
      # Drop metrics by name
      - source_labels: [__name__]
        regex: 'go_.*|process_.*'
        action: drop

      # Keep only specific jobs
      - source_labels: [job]
        regex: 'kubernetes-.*|node-exporter'
        action: keep

      # Drop high-cardinality labels
      - regex: 'pod_uid|container_id'
        action: labeldrop

      # Rename labels
      - source_labels: [__name__]
        regex: 'old_metric_name'
        replacement: 'new_metric_name'
        target_label: __name__

      # Add labels
      - target_label: environment
        replacement: 'production'

Remote Write Versions

Remote Write 1.0 (Classic)

Protocol:

  • Protobuf encoding (snappy compressed)
  • HTTP POST to /api/v1/write
  • Series sent as repeated timestamps/values

Limitations:

  • No out-of-order writes support
  • Limited compression efficiency
  • No native histogram support (initially)

Remote Write 2.0

Introduced: 2023 (Prometheus 2.40+)

Improvements:

  • Better compression - up to 50% reduction in bandwidth
  • Out-of-order samples - handle late-arriving data
  • Native histograms - full support
  • Metadata optimization - deduplicated metadata
  • Backward compatible - servers auto-negotiate version

Enable Remote Write 2.0:

remote_write:
  - url: "https://remote-storage.example.com/write"
    protobuf_message: "prometheus.WriteRequest"  # Default: auto-negotiation

    # Send native histograms
    send_native_histograms: true

    # Send exemplars
    send_exemplars: true

Monitoring Remote Write

Key Metrics

Queue status:

# Current queue size
prometheus_remote_storage_samples_pending

# Queue capacity utilization
prometheus_remote_storage_samples_pending /
prometheus_remote_storage_queue_capacity

# Shards in use
prometheus_remote_storage_shards

# Dropped samples due to full queue
rate(prometheus_remote_storage_samples_dropped_total[5m])

Throughput:

# Samples sent per second
rate(prometheus_remote_storage_samples_total[5m])

# Samples failed
rate(prometheus_remote_storage_samples_failed_total[5m])

# Samples retried
rate(prometheus_remote_storage_samples_retried_total[5m])

# Bytes sent
rate(prometheus_remote_storage_bytes_total[5m])

Latency:

# Send latency histogram
histogram_quantile(0.99,
  rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
)

# Queue duration (how long samples wait)
histogram_quantile(0.99,
  rate(prometheus_remote_storage_queue_duration_seconds_bucket[5m])
)

Success rate:

# Write success rate
rate(prometheus_remote_storage_succeeded_samples_total[5m])
/
rate(prometheus_remote_storage_samples_total[5m])

# Error rate
rate(prometheus_remote_storage_failed_samples_total[5m])

Alerts for Remote Write

groups:
  - name: remote_write
    rules:
      - alert: RemoteWriteBehind
        expr: |
          (
            prometheus_remote_storage_samples_pending
            /
            prometheus_remote_storage_queue_capacity
          ) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Remote write queue  is  full"
          description: "Remote write is struggling to keep up. Consider increasing queue capacity or shards."

      - alert: RemoteWriteDropping
        expr: rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Remote write  is dropping samples"
          description: " samples/sec are being dropped. Queue is full."

      - alert: RemoteWriteFailing
        expr: |
          rate(prometheus_remote_storage_failed_samples_total[5m])
          /
          rate(prometheus_remote_storage_samples_total[5m])
          > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Remote write  has  failure rate"

      - alert: RemoteWriteSlow
        expr: |
          histogram_quantile(0.99,
            rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Remote write  is slow (p99: s)"

      - alert: RemoteWriteDown
        expr: up{job="remote-storage"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Remote write endpoint  is down"

Best Practices

1. Use Write Relabeling to Reduce Volume

Filter unnecessary metrics:

write_relabel_configs:
  # Drop debug metrics
  - source_labels: [__name__]
    regex: '.*_debug_.*'
    action: drop

  # Drop high-cardinality labels
  - regex: 'user_id|session_id|request_id'
    action: labeldrop

  # Keep only important metrics
  - source_labels: [__name__]
    regex: '(up|.*_total|.*_errors|.*_duration_.*)'
    action: keep

2. Configure External Labels

Add cluster/datacenter context:

global:
  external_labels:
    cluster: 'prod-k8s-1'
    datacenter: 'us-east-1'
    environment: 'production'

Benefits:

  • Global query filtering
  • Multi-cluster aggregation
  • Deduplication in HA setups

3. Tune Queue for Your Workload

Calculate required capacity:

samples_per_second = total_series × (1 / scrape_interval)
required_capacity = samples_per_second × max_acceptable_delay_seconds

Example:

  • 1M series, 15s scrape interval = 66,666 samples/sec
  • Max 60s delay acceptable = 4M samples capacity
queue_config:
  capacity: 4000000
  max_shards: 200
  max_samples_per_send: 10000

4. Use Multiple Endpoints for HA

remote_write:
  # Primary
  - url: "https://storage-1.example.com/write"
    queue_config:
      capacity: 10000
      max_shards: 50

  # Secondary (same data for HA)
  - url: "https://storage-2.example.com/write"
    queue_config:
      capacity: 10000
      max_shards: 50

  # Analytics (filtered data)
  - url: "https://analytics.example.com/write"
    write_relabel_configs:
      - source_labels: [__name__]
        regex: '(business_.*|user_.*)'
        action: keep

5. Monitor Queue Health

Dashboard queries:

# Queue fullness by endpoint
prometheus_remote_storage_samples_pending
/ ignoring(remote_name, url) group_left
prometheus_remote_storage_queue_capacity

# Send rate by endpoint
sum by (url) (rate(prometheus_remote_storage_samples_total[5m]))

# Shard count per endpoint
prometheus_remote_storage_shards

# Latency percentiles
histogram_quantile(0.50, sum by (le, url) (
  rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
))

6. Handle Backpressure

If remote write can’t keep up:

Option 1: Increase resources

queue_config:
  capacity: 50000      # Increase buffer
  max_shards: 500      # More parallelism

Option 2: Reduce data

write_relabel_configs:
  - source_labels: [__name__]
    regex: 'unnecessary_.*'
    action: drop

Option 3: Downsample at source

scrape_configs:
  - job_name: 'low-priority'
    scrape_interval: 60s  # Scrape less frequently

7. Use Remote Write for Specific Use Cases

✅ Good use cases:

  • Long-term storage (>30 days)
  • Cross-datacenter replication
  • Compliance/audit logs
  • Integration with commercial platforms
  • Multi-tenant data isolation

❌ Avoid for:

  • Real-time querying (use local storage)
  • High-frequency updates (sub-second)
  • Temporary dev/test environments

Troubleshooting

Remote Write Queue Growing

Symptoms:

prometheus_remote_storage_samples_pending > 5000

Causes:

  1. Remote endpoint slow/down
  2. Too few shards
  3. Network issues
  4. Insufficient queue capacity

Solutions:

queue_config:
  max_shards: 200        # Increase parallelism
  capacity: 50000        # Increase buffer
  max_samples_per_send: 10000  # Larger batches

Samples Being Dropped

Symptoms:

rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0

Causes:

  • Queue full
  • Can’t keep up with scrape rate

Solutions:

  1. Increase queue capacity
  2. Filter metrics (write_relabel_configs)
  3. Reduce scrape frequency
  4. Add more remote write endpoints

High Error Rate

Symptoms:

rate(prometheus_remote_storage_failed_samples_total[5m]) > 100

Causes:

  1. Authentication failures
  2. Remote endpoint errors (5xx)
  3. Network connectivity
  4. Invalid data format

Debug:

# Check Prometheus logs
tail -f /var/log/prometheus/prometheus.log | grep "remote_write"

# Test endpoint manually
curl -X POST https://remote-storage.example.com/write \
  -H "Content-Type: application/x-protobuf" \
  -H "Content-Encoding: snappy" \
  --data-binary @sample.pb

Slow Remote Write

Symptoms:

histogram_quantile(0.99,
  rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
) > 10

Causes:

  1. Network latency
  2. Remote endpoint overloaded
  3. Too few shards
  4. Large batches

Solutions:

queue_config:
  max_shards: 100
  max_samples_per_send: 1000  # Smaller batches
  batch_send_deadline: 5s

remote_timeout: 30s  # Increase timeout

Security Considerations

1. Use TLS:

remote_write:
  - url: "https://secure-storage.example.com/write"
    tls_config:
      ca_file: /etc/prometheus/ca.pem

2. Authenticate:

remote_write:
  - url: "https://storage.example.com/write"
    bearer_token_file: /etc/prometheus/token  # Don't embed secrets

3. Network policies:

  • Restrict Prometheus → remote write endpoint traffic
  • Use VPN/private networks for cross-datacenter
  • Enable firewall rules

4. Audit logging:

  • Monitor failed authentication attempts
  • Track unusual traffic patterns
  • Alert on configuration changes

5. Least privilege:

  • Use separate credentials per Prometheus instance
  • Grant only write permissions (not read/admin)
  • Rotate credentials regularly

Remote Write vs Federation

Aspect Remote Write Federation
Direction Push (Prometheus → Storage) Pull (Global Prom ← Local Prom)
Latency Real-time (seconds) Periodic (scrape interval)
Storage Remote system Local TSDB
Use case Long-term storage, HA Hierarchical aggregation
Data volume All samples Typically aggregates only
Complexity Simple config Requires recording rules
Network Outbound HTTP Inbound scrape

When to use Remote Write:

  • Need long-term storage (>90 days)
  • Want managed/cloud storage
  • Require high availability
  • Multiple destinations

When to use Federation:

  • Building hierarchies
  • Need pull-based model
  • Want to aggregate before sending
  • Firewall restrictions

results matching ""

    No results matching ""