Remote Write

Remote Write is Prometheus’s mechanism for sending collected metrics to external storage systems. This enables long-term storage, cross-datacenter replication, and integration with other monitoring systems.

Why Remote Write?

Prometheus’s local storage limitations:

Limited retention - typically 15-30 days (disk space constraints)
Single node - no built-in high availability
No replication - data loss if server fails
Vertical scaling only - can’t distribute load across multiple servers

Remote Write benefits:

Unlimited retention - long-term storage in dedicated systems
High availability - replicate to multiple endpoints
Horizontal scaling - distribute metrics across multiple backends
Cross-datacenter replication - disaster recovery
Cost optimization - use cheaper storage for historical data
Integration - connect to various observability platforms

How Remote Write Works

Process:

Prometheus scrapes metrics and stores locally
Metrics are queued for remote write
Queue batches metrics and compresses (Snappy)
Sends via HTTP POST to remote endpoint(s)
Remote system acknowledges receipt
Queue marks data as sent

Configuration

Basic Remote Write Setup

# prometheus.yml
remote_write:
  - url: "https://remote-storage.example.com/api/v1/write"

    # Optional: Add external labels to all series
    external_labels:
      cluster: 'prod-cluster-1'
      region: 'us-east-1'

    # Optional: Write relabeling
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop

Multiple Remote Write Endpoints

remote_write:
  # Primary long-term storage
  - url: "https://thanos.example.com/api/v1/receive"
    queue_config:
      capacity: 10000
      max_shards: 50
      min_shards: 1
      max_samples_per_send: 5000
      batch_send_deadline: 5s

    metadata_config:
      send: true
      send_interval: 1m

  # Secondary analytics platform
  - url: "https://analytics.example.com/write"
    basic_auth:
      username: 'prometheus'
      password_file: /etc/prometheus/password

    # Only send specific metrics
    write_relabel_configs:
      - source_labels: [__name__]
        regex: '(http_requests_total|http_request_duration_seconds_.*)'
        action: keep

  # Third-party monitoring service
  - url: "https://monitoring-service.com/api/prom/push"
    bearer_token_file: /etc/prometheus/bearer_token

    remote_timeout: 30s
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 10s

Configuration Parameters

Connection Settings

remote_write:
  - url: "https://remote-storage.example.com/api/v1/write"

    # Timeout for HTTP requests
    remote_timeout: 30s

    # HTTP headers
    headers:
      X-Custom-Header: "value"

    # Proxy URL
    proxy_url: "http://proxy.example.com:8080"

    # Follow HTTP redirects
    follow_redirects: true

    # HTTP protocol version
    http2: true  # Use HTTP/2 (default: true)

Authentication

Basic Auth:

remote_write:
  - url: "https://remote-storage.example.com/write"
    basic_auth:
      username: 'prometheus'
      password: 'secret'
      # OR use password_file:
      password_file: /etc/prometheus/password

Bearer Token:

remote_write:
  - url: "https://remote-storage.example.com/write"
    bearer_token: "your-token-here"
    # OR use bearer_token_file:
    bearer_token_file: /etc/prometheus/bearer_token

OAuth2:

remote_write:
  - url: "https://remote-storage.example.com/write"
    oauth2:
      client_id: "prometheus"
      client_secret: "secret"
      token_url: "https://auth.example.com/oauth/token"
      scopes:
        - "metrics.write"
      endpoint_params:
        audience: "monitoring"

TLS:

remote_write:
  - url: "https://remote-storage.example.com/write"
    tls_config:
      ca_file: /etc/prometheus/ca.pem
      cert_file: /etc/prometheus/client-cert.pem
      key_file: /etc/prometheus/client-key.pem
      insecure_skip_verify: false
      server_name: "remote-storage.example.com"

Sigv4 (AWS):

remote_write:
  - url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write"
    sigv4:
      region: us-east-1
      access_key: "AKIAIOSFODNN7EXAMPLE"
      secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      # OR use AWS profile:
      profile: "default"
      role_arn: "arn:aws:iam::123456789012:role/PrometheusRole"

Queue Configuration

Critical for performance and reliability:

remote_write:
  - url: "https://remote-storage.example.com/write"
    queue_config:
      # Queue capacity (number of samples)
      capacity: 10000  # Default: 2500

      # Maximum number of concurrent shards
      max_shards: 200  # Default: 200

      # Minimum number of shards
      min_shards: 1    # Default: 1

      # Maximum samples per request
      max_samples_per_send: 5000  # Default: 500

      # Time to wait before sending (even if batch not full)
      batch_send_deadline: 5s     # Default: 5s

      # Minimum time between shard calculations
      min_backoff: 30ms           # Default: 30ms

      # Maximum time between retries
      max_backoff: 5s             # Default: 5s

      # Maximum number of retry attempts
      retry_on_http_429: true     # Default: true

Queue tuning guidelines:

High throughput (millions of samples/sec):

queue_config:
  capacity: 100000
  max_shards: 500
  max_samples_per_send: 10000
  batch_send_deadline: 10s

Low latency (real-time streaming):

queue_config:
  capacity: 5000
  max_shards: 50
  max_samples_per_send: 500
  batch_send_deadline: 1s

Resource-constrained (limited CPU/memory):

queue_config:
  capacity: 1000
  max_shards: 10
  max_samples_per_send: 100
  batch_send_deadline: 10s

Metadata Configuration

remote_write:
  - url: "https://remote-storage.example.com/write"
    metadata_config:
      # Send metric metadata (TYPE, HELP)
      send: true

      # How often to send metadata
      send_interval: 1m

      # Maximum samples per metadata request
      max_samples_per_send: 500

Write Relabeling

Filter metrics before sending:

remote_write:
  - url: "https://remote-storage.example.com/write"
    write_relabel_configs:
      # Drop metrics by name
      - source_labels: [__name__]
        regex: 'go_.*|process_.*'
        action: drop

      # Keep only specific jobs
      - source_labels: [job]
        regex: 'kubernetes-.*|node-exporter'
        action: keep

      # Drop high-cardinality labels
      - regex: 'pod_uid|container_id'
        action: labeldrop

      # Rename labels
      - source_labels: [__name__]
        regex: 'old_metric_name'
        replacement: 'new_metric_name'
        target_label: __name__

      # Add labels
      - target_label: environment
        replacement: 'production'

Remote Write Versions

Remote Write 1.0 (Classic)

Protocol:

Protobuf encoding (snappy compressed)
HTTP POST to /api/v1/write
Series sent as repeated timestamps/values

Limitations:

No out-of-order writes support
Limited compression efficiency
No native histogram support (initially)

Remote Write 2.0

Introduced: 2023 (Prometheus 2.40+)

Improvements:

Better compression - up to 50% reduction in bandwidth
Out-of-order samples - handle late-arriving data
Native histograms - full support
Metadata optimization - deduplicated metadata
Backward compatible - servers auto-negotiate version

Enable Remote Write 2.0:

remote_write:
  - url: "https://remote-storage.example.com/write"
    protobuf_message: "prometheus.WriteRequest"  # Default: auto-negotiation

    # Send native histograms
    send_native_histograms: true

    # Send exemplars
    send_exemplars: true

Monitoring Remote Write

Key Metrics

Queue status:

# Current queue size
prometheus_remote_storage_samples_pending

# Queue capacity utilization
prometheus_remote_storage_samples_pending /
prometheus_remote_storage_queue_capacity

# Shards in use
prometheus_remote_storage_shards

# Dropped samples due to full queue
rate(prometheus_remote_storage_samples_dropped_total[5m])

Throughput:

# Samples sent per second
rate(prometheus_remote_storage_samples_total[5m])

# Samples failed
rate(prometheus_remote_storage_samples_failed_total[5m])

# Samples retried
rate(prometheus_remote_storage_samples_retried_total[5m])

# Bytes sent
rate(prometheus_remote_storage_bytes_total[5m])

Latency:

# Send latency histogram
histogram_quantile(0.99,
  rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
)

# Queue duration (how long samples wait)
histogram_quantile(0.99,
  rate(prometheus_remote_storage_queue_duration_seconds_bucket[5m])
)

Success rate:

# Write success rate
rate(prometheus_remote_storage_succeeded_samples_total[5m])
/
rate(prometheus_remote_storage_samples_total[5m])

# Error rate
rate(prometheus_remote_storage_failed_samples_total[5m])

Alerts for Remote Write

groups:
  - name: remote_write
    rules:
      - alert: RemoteWriteBehind
        expr: |
          (
            prometheus_remote_storage_samples_pending
            /
            prometheus_remote_storage_queue_capacity
          ) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Remote write queue  is  full"
          description: "Remote write is struggling to keep up. Consider increasing queue capacity or shards."

      - alert: RemoteWriteDropping
        expr: rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Remote write  is dropping samples"
          description: " samples/sec are being dropped. Queue is full."

      - alert: RemoteWriteFailing
        expr: |
          rate(prometheus_remote_storage_failed_samples_total[5m])
          /
          rate(prometheus_remote_storage_samples_total[5m])
          > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Remote write  has  failure rate"

      - alert: RemoteWriteSlow
        expr: |
          histogram_quantile(0.99,
            rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Remote write  is slow (p99: s)"

      - alert: RemoteWriteDown
        expr: up{job="remote-storage"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Remote write endpoint  is down"

Best Practices

1. Use Write Relabeling to Reduce Volume

Filter unnecessary metrics:

write_relabel_configs:
  # Drop debug metrics
  - source_labels: [__name__]
    regex: '.*_debug_.*'
    action: drop

  # Drop high-cardinality labels
  - regex: 'user_id|session_id|request_id'
    action: labeldrop

  # Keep only important metrics
  - source_labels: [__name__]
    regex: '(up|.*_total|.*_errors|.*_duration_.*)'
    action: keep

2. Configure External Labels

Add cluster/datacenter context:

global:
  external_labels:
    cluster: 'prod-k8s-1'
    datacenter: 'us-east-1'
    environment: 'production'

Benefits:

Global query filtering
Multi-cluster aggregation
Deduplication in HA setups

3. Tune Queue for Your Workload

Calculate required capacity:

samples_per_second = total_series × (1 / scrape_interval)
required_capacity = samples_per_second × max_acceptable_delay_seconds

Example:

1M series, 15s scrape interval = 66,666 samples/sec
Max 60s delay acceptable = 4M samples capacity

queue_config:
  capacity: 4000000
  max_shards: 200
  max_samples_per_send: 10000

4. Use Multiple Endpoints for HA

remote_write:
  # Primary
  - url: "https://storage-1.example.com/write"
    queue_config:
      capacity: 10000
      max_shards: 50

  # Secondary (same data for HA)
  - url: "https://storage-2.example.com/write"
    queue_config:
      capacity: 10000
      max_shards: 50

  # Analytics (filtered data)
  - url: "https://analytics.example.com/write"
    write_relabel_configs:
      - source_labels: [__name__]
        regex: '(business_.*|user_.*)'
        action: keep

5. Monitor Queue Health

Dashboard queries:

# Queue fullness by endpoint
prometheus_remote_storage_samples_pending
/ ignoring(remote_name, url) group_left
prometheus_remote_storage_queue_capacity

# Send rate by endpoint
sum by (url) (rate(prometheus_remote_storage_samples_total[5m]))

# Shard count per endpoint
prometheus_remote_storage_shards

# Latency percentiles
histogram_quantile(0.50, sum by (le, url) (
  rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
))

6. Handle Backpressure

If remote write can’t keep up:

Option 1: Increase resources

queue_config:
  capacity: 50000      # Increase buffer
  max_shards: 500      # More parallelism

Option 2: Reduce data

write_relabel_configs:
  - source_labels: [__name__]
    regex: 'unnecessary_.*'
    action: drop

Option 3: Downsample at source

scrape_configs:
  - job_name: 'low-priority'
    scrape_interval: 60s  # Scrape less frequently

7. Use Remote Write for Specific Use Cases

✅ Good use cases:

Long-term storage (>30 days)
Cross-datacenter replication
Compliance/audit logs
Integration with commercial platforms
Multi-tenant data isolation

❌ Avoid for:

Real-time querying (use local storage)
High-frequency updates (sub-second)
Temporary dev/test environments

Troubleshooting

Remote Write Queue Growing

Symptoms:

prometheus_remote_storage_samples_pending > 5000

Causes:

Remote endpoint slow/down
Too few shards
Network issues
Insufficient queue capacity

Solutions:

queue_config:
  max_shards: 200        # Increase parallelism
  capacity: 50000        # Increase buffer
  max_samples_per_send: 10000  # Larger batches

Samples Being Dropped

Symptoms:

rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0

Causes:

Queue full
Can’t keep up with scrape rate

Solutions:

Increase queue capacity
Filter metrics (write_relabel_configs)
Reduce scrape frequency
Add more remote write endpoints

High Error Rate

Symptoms:

rate(prometheus_remote_storage_failed_samples_total[5m]) > 100

Causes:

Authentication failures
Remote endpoint errors (5xx)
Network connectivity
Invalid data format

Debug:

# Check Prometheus logs
tail -f /var/log/prometheus/prometheus.log | grep "remote_write"

# Test endpoint manually
curl -X POST https://remote-storage.example.com/write \
  -H "Content-Type: application/x-protobuf" \
  -H "Content-Encoding: snappy" \
  --data-binary @sample.pb

Slow Remote Write

Symptoms:

histogram_quantile(0.99,
  rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
) > 10

Causes:

Network latency
Remote endpoint overloaded
Too few shards
Large batches

Solutions:

queue_config:
  max_shards: 100
  max_samples_per_send: 1000  # Smaller batches
  batch_send_deadline: 5s

remote_timeout: 30s  # Increase timeout

Security Considerations

1. Use TLS:

remote_write:
  - url: "https://secure-storage.example.com/write"
    tls_config:
      ca_file: /etc/prometheus/ca.pem

2. Authenticate:

remote_write:
  - url: "https://storage.example.com/write"
    bearer_token_file: /etc/prometheus/token  # Don't embed secrets

3. Network policies:

Restrict Prometheus → remote write endpoint traffic
Use VPN/private networks for cross-datacenter
Enable firewall rules

4. Audit logging:

Monitor failed authentication attempts
Track unusual traffic patterns
Alert on configuration changes

5. Least privilege:

Use separate credentials per Prometheus instance
Grant only write permissions (not read/admin)
Rotate credentials regularly

Remote Write vs Federation

Aspect	Remote Write	Federation
Direction	Push (Prometheus → Storage)	Pull (Global Prom ← Local Prom)
Latency	Real-time (seconds)	Periodic (scrape interval)
Storage	Remote system	Local TSDB
Use case	Long-term storage, HA	Hierarchical aggregation
Data volume	All samples	Typically aggregates only
Complexity	Simple config	Requires recording rules
Network	Outbound HTTP	Inbound scrape

When to use Remote Write:

Need long-term storage (>90 days)
Want managed/cloud storage
Require high availability
Multiple destinations

When to use Federation:

Building hierarchies
Need pull-based model
Want to aggregate before sending
Firewall restrictions

Remote Write

Remote Write

Remote Write

Why Remote Write?

How Remote Write Works

Configuration

Basic Remote Write Setup

Multiple Remote Write Endpoints

Configuration Parameters

Connection Settings

Authentication

Queue Configuration

Metadata Configuration

Write Relabeling

Remote Write Versions

Remote Write 1.0 (Classic)

Remote Write 2.0

Monitoring Remote Write

Key Metrics

Alerts for Remote Write

Best Practices

1. Use Write Relabeling to Reduce Volume

2. Configure External Labels

3. Tune Queue for Your Workload

4. Use Multiple Endpoints for HA

5. Monitor Queue Health

6. Handle Backpressure

7. Use Remote Write for Specific Use Cases

Troubleshooting

Remote Write Queue Growing

Samples Being Dropped

High Error Rate

Slow Remote Write

Security Considerations

Remote Write vs Federation

results matching ""

No results matching ""