Remote Write
Remote Write
Remote Write is Prometheus’s mechanism for sending collected metrics to external storage systems. This enables long-term storage, cross-datacenter replication, and integration with other monitoring systems.
Why Remote Write?
Prometheus’s local storage limitations:
- Limited retention - typically 15-30 days (disk space constraints)
- Single node - no built-in high availability
- No replication - data loss if server fails
- Vertical scaling only - can’t distribute load across multiple servers
Remote Write benefits:
- Unlimited retention - long-term storage in dedicated systems
- High availability - replicate to multiple endpoints
- Horizontal scaling - distribute metrics across multiple backends
- Cross-datacenter replication - disaster recovery
- Cost optimization - use cheaper storage for historical data
- Integration - connect to various observability platforms
How Remote Write Works
┌─────────────────────────────────────────────────────┐
│ Prometheus Server │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Scraping │───→│ Local │───→│ Remote Write │ │
│ │ │ │ TSDB │ │ Queue │ │
│ └──────────┘ └──────────┘ └──────┬───────┘ │
└─────────────────────────────────────────┼─────────┘
│
│ HTTP POST
↓
┌─────────────────────────────┐
│ Remote Storage System │
│ (Thanos, Mimir, etc.) │
└─────────────────────────────┘
Process:
- Prometheus scrapes metrics and stores locally
- Metrics are queued for remote write
- Queue batches metrics and compresses (Snappy)
- Sends via HTTP POST to remote endpoint(s)
- Remote system acknowledges receipt
- Queue marks data as sent
Configuration
Basic Remote Write Setup
# prometheus.yml
remote_write:
- url: "https://remote-storage.example.com/api/v1/write"
# Optional: Add external labels to all series
external_labels:
cluster: 'prod-cluster-1'
region: 'us-east-1'
# Optional: Write relabeling
write_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*'
action: drop
Multiple Remote Write Endpoints
remote_write:
# Primary long-term storage
- url: "https://thanos.example.com/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 50
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
metadata_config:
send: true
send_interval: 1m
# Secondary analytics platform
- url: "https://analytics.example.com/write"
basic_auth:
username: 'prometheus'
password_file: /etc/prometheus/password
# Only send specific metrics
write_relabel_configs:
- source_labels: [__name__]
regex: '(http_requests_total|http_request_duration_seconds_.*)'
action: keep
# Third-party monitoring service
- url: "https://monitoring-service.com/api/prom/push"
bearer_token_file: /etc/prometheus/bearer_token
remote_timeout: 30s
queue_config:
max_samples_per_send: 1000
batch_send_deadline: 10s
Configuration Parameters
Connection Settings
remote_write:
- url: "https://remote-storage.example.com/api/v1/write"
# Timeout for HTTP requests
remote_timeout: 30s
# HTTP headers
headers:
X-Custom-Header: "value"
# Proxy URL
proxy_url: "http://proxy.example.com:8080"
# Follow HTTP redirects
follow_redirects: true
# HTTP protocol version
http2: true # Use HTTP/2 (default: true)
Authentication
Basic Auth:
remote_write:
- url: "https://remote-storage.example.com/write"
basic_auth:
username: 'prometheus'
password: 'secret'
# OR use password_file:
password_file: /etc/prometheus/password
Bearer Token:
remote_write:
- url: "https://remote-storage.example.com/write"
bearer_token: "your-token-here"
# OR use bearer_token_file:
bearer_token_file: /etc/prometheus/bearer_token
OAuth2:
remote_write:
- url: "https://remote-storage.example.com/write"
oauth2:
client_id: "prometheus"
client_secret: "secret"
token_url: "https://auth.example.com/oauth/token"
scopes:
- "metrics.write"
endpoint_params:
audience: "monitoring"
TLS:
remote_write:
- url: "https://remote-storage.example.com/write"
tls_config:
ca_file: /etc/prometheus/ca.pem
cert_file: /etc/prometheus/client-cert.pem
key_file: /etc/prometheus/client-key.pem
insecure_skip_verify: false
server_name: "remote-storage.example.com"
Sigv4 (AWS):
remote_write:
- url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write"
sigv4:
region: us-east-1
access_key: "AKIAIOSFODNN7EXAMPLE"
secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# OR use AWS profile:
profile: "default"
role_arn: "arn:aws:iam::123456789012:role/PrometheusRole"
Queue Configuration
Critical for performance and reliability:
remote_write:
- url: "https://remote-storage.example.com/write"
queue_config:
# Queue capacity (number of samples)
capacity: 10000 # Default: 2500
# Maximum number of concurrent shards
max_shards: 200 # Default: 200
# Minimum number of shards
min_shards: 1 # Default: 1
# Maximum samples per request
max_samples_per_send: 5000 # Default: 500
# Time to wait before sending (even if batch not full)
batch_send_deadline: 5s # Default: 5s
# Minimum time between shard calculations
min_backoff: 30ms # Default: 30ms
# Maximum time between retries
max_backoff: 5s # Default: 5s
# Maximum number of retry attempts
retry_on_http_429: true # Default: true
Queue tuning guidelines:
High throughput (millions of samples/sec):
queue_config:
capacity: 100000
max_shards: 500
max_samples_per_send: 10000
batch_send_deadline: 10s
Low latency (real-time streaming):
queue_config:
capacity: 5000
max_shards: 50
max_samples_per_send: 500
batch_send_deadline: 1s
Resource-constrained (limited CPU/memory):
queue_config:
capacity: 1000
max_shards: 10
max_samples_per_send: 100
batch_send_deadline: 10s
Metadata Configuration
remote_write:
- url: "https://remote-storage.example.com/write"
metadata_config:
# Send metric metadata (TYPE, HELP)
send: true
# How often to send metadata
send_interval: 1m
# Maximum samples per metadata request
max_samples_per_send: 500
Write Relabeling
Filter metrics before sending:
remote_write:
- url: "https://remote-storage.example.com/write"
write_relabel_configs:
# Drop metrics by name
- source_labels: [__name__]
regex: 'go_.*|process_.*'
action: drop
# Keep only specific jobs
- source_labels: [job]
regex: 'kubernetes-.*|node-exporter'
action: keep
# Drop high-cardinality labels
- regex: 'pod_uid|container_id'
action: labeldrop
# Rename labels
- source_labels: [__name__]
regex: 'old_metric_name'
replacement: 'new_metric_name'
target_label: __name__
# Add labels
- target_label: environment
replacement: 'production'
Remote Write Versions
Remote Write 1.0 (Classic)
Protocol:
- Protobuf encoding (snappy compressed)
- HTTP POST to
/api/v1/write - Series sent as repeated timestamps/values
Limitations:
- No out-of-order writes support
- Limited compression efficiency
- No native histogram support (initially)
Remote Write 2.0
Introduced: 2023 (Prometheus 2.40+)
Improvements:
- Better compression - up to 50% reduction in bandwidth
- Out-of-order samples - handle late-arriving data
- Native histograms - full support
- Metadata optimization - deduplicated metadata
- Backward compatible - servers auto-negotiate version
Enable Remote Write 2.0:
remote_write:
- url: "https://remote-storage.example.com/write"
protobuf_message: "prometheus.WriteRequest" # Default: auto-negotiation
# Send native histograms
send_native_histograms: true
# Send exemplars
send_exemplars: true
Monitoring Remote Write
Key Metrics
Queue status:
# Current queue size
prometheus_remote_storage_samples_pending
# Queue capacity utilization
prometheus_remote_storage_samples_pending /
prometheus_remote_storage_queue_capacity
# Shards in use
prometheus_remote_storage_shards
# Dropped samples due to full queue
rate(prometheus_remote_storage_samples_dropped_total[5m])
Throughput:
# Samples sent per second
rate(prometheus_remote_storage_samples_total[5m])
# Samples failed
rate(prometheus_remote_storage_samples_failed_total[5m])
# Samples retried
rate(prometheus_remote_storage_samples_retried_total[5m])
# Bytes sent
rate(prometheus_remote_storage_bytes_total[5m])
Latency:
# Send latency histogram
histogram_quantile(0.99,
rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
)
# Queue duration (how long samples wait)
histogram_quantile(0.99,
rate(prometheus_remote_storage_queue_duration_seconds_bucket[5m])
)
Success rate:
# Write success rate
rate(prometheus_remote_storage_succeeded_samples_total[5m])
/
rate(prometheus_remote_storage_samples_total[5m])
# Error rate
rate(prometheus_remote_storage_failed_samples_total[5m])
Alerts for Remote Write
groups:
- name: remote_write
rules:
- alert: RemoteWriteBehind
expr: |
(
prometheus_remote_storage_samples_pending
/
prometheus_remote_storage_queue_capacity
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Remote write queue is full"
description: "Remote write is struggling to keep up. Consider increasing queue capacity or shards."
- alert: RemoteWriteDropping
expr: rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Remote write is dropping samples"
description: " samples/sec are being dropped. Queue is full."
- alert: RemoteWriteFailing
expr: |
rate(prometheus_remote_storage_failed_samples_total[5m])
/
rate(prometheus_remote_storage_samples_total[5m])
> 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Remote write has failure rate"
- alert: RemoteWriteSlow
expr: |
histogram_quantile(0.99,
rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Remote write is slow (p99: s)"
- alert: RemoteWriteDown
expr: up{job="remote-storage"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Remote write endpoint is down"
Best Practices
1. Use Write Relabeling to Reduce Volume
Filter unnecessary metrics:
write_relabel_configs:
# Drop debug metrics
- source_labels: [__name__]
regex: '.*_debug_.*'
action: drop
# Drop high-cardinality labels
- regex: 'user_id|session_id|request_id'
action: labeldrop
# Keep only important metrics
- source_labels: [__name__]
regex: '(up|.*_total|.*_errors|.*_duration_.*)'
action: keep
2. Configure External Labels
Add cluster/datacenter context:
global:
external_labels:
cluster: 'prod-k8s-1'
datacenter: 'us-east-1'
environment: 'production'
Benefits:
- Global query filtering
- Multi-cluster aggregation
- Deduplication in HA setups
3. Tune Queue for Your Workload
Calculate required capacity:
samples_per_second = total_series × (1 / scrape_interval)
required_capacity = samples_per_second × max_acceptable_delay_seconds
Example:
- 1M series, 15s scrape interval = 66,666 samples/sec
- Max 60s delay acceptable = 4M samples capacity
queue_config:
capacity: 4000000
max_shards: 200
max_samples_per_send: 10000
4. Use Multiple Endpoints for HA
remote_write:
# Primary
- url: "https://storage-1.example.com/write"
queue_config:
capacity: 10000
max_shards: 50
# Secondary (same data for HA)
- url: "https://storage-2.example.com/write"
queue_config:
capacity: 10000
max_shards: 50
# Analytics (filtered data)
- url: "https://analytics.example.com/write"
write_relabel_configs:
- source_labels: [__name__]
regex: '(business_.*|user_.*)'
action: keep
5. Monitor Queue Health
Dashboard queries:
# Queue fullness by endpoint
prometheus_remote_storage_samples_pending
/ ignoring(remote_name, url) group_left
prometheus_remote_storage_queue_capacity
# Send rate by endpoint
sum by (url) (rate(prometheus_remote_storage_samples_total[5m]))
# Shard count per endpoint
prometheus_remote_storage_shards
# Latency percentiles
histogram_quantile(0.50, sum by (le, url) (
rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
))
6. Handle Backpressure
If remote write can’t keep up:
Option 1: Increase resources
queue_config:
capacity: 50000 # Increase buffer
max_shards: 500 # More parallelism
Option 2: Reduce data
write_relabel_configs:
- source_labels: [__name__]
regex: 'unnecessary_.*'
action: drop
Option 3: Downsample at source
scrape_configs:
- job_name: 'low-priority'
scrape_interval: 60s # Scrape less frequently
7. Use Remote Write for Specific Use Cases
✅ Good use cases:
- Long-term storage (>30 days)
- Cross-datacenter replication
- Compliance/audit logs
- Integration with commercial platforms
- Multi-tenant data isolation
❌ Avoid for:
- Real-time querying (use local storage)
- High-frequency updates (sub-second)
- Temporary dev/test environments
Troubleshooting
Remote Write Queue Growing
Symptoms:
prometheus_remote_storage_samples_pending > 5000
Causes:
- Remote endpoint slow/down
- Too few shards
- Network issues
- Insufficient queue capacity
Solutions:
queue_config:
max_shards: 200 # Increase parallelism
capacity: 50000 # Increase buffer
max_samples_per_send: 10000 # Larger batches
Samples Being Dropped
Symptoms:
rate(prometheus_remote_storage_samples_dropped_total[5m]) > 0
Causes:
- Queue full
- Can’t keep up with scrape rate
Solutions:
- Increase queue capacity
- Filter metrics (write_relabel_configs)
- Reduce scrape frequency
- Add more remote write endpoints
High Error Rate
Symptoms:
rate(prometheus_remote_storage_failed_samples_total[5m]) > 100
Causes:
- Authentication failures
- Remote endpoint errors (5xx)
- Network connectivity
- Invalid data format
Debug:
# Check Prometheus logs
tail -f /var/log/prometheus/prometheus.log | grep "remote_write"
# Test endpoint manually
curl -X POST https://remote-storage.example.com/write \
-H "Content-Type: application/x-protobuf" \
-H "Content-Encoding: snappy" \
--data-binary @sample.pb
Slow Remote Write
Symptoms:
histogram_quantile(0.99,
rate(prometheus_remote_storage_sent_batch_duration_seconds_bucket[5m])
) > 10
Causes:
- Network latency
- Remote endpoint overloaded
- Too few shards
- Large batches
Solutions:
queue_config:
max_shards: 100
max_samples_per_send: 1000 # Smaller batches
batch_send_deadline: 5s
remote_timeout: 30s # Increase timeout
Security Considerations
1. Use TLS:
remote_write:
- url: "https://secure-storage.example.com/write"
tls_config:
ca_file: /etc/prometheus/ca.pem
2. Authenticate:
remote_write:
- url: "https://storage.example.com/write"
bearer_token_file: /etc/prometheus/token # Don't embed secrets
3. Network policies:
- Restrict Prometheus → remote write endpoint traffic
- Use VPN/private networks for cross-datacenter
- Enable firewall rules
4. Audit logging:
- Monitor failed authentication attempts
- Track unusual traffic patterns
- Alert on configuration changes
5. Least privilege:
- Use separate credentials per Prometheus instance
- Grant only write permissions (not read/admin)
- Rotate credentials regularly
Remote Write vs Federation
| Aspect | Remote Write | Federation |
|---|---|---|
| Direction | Push (Prometheus → Storage) | Pull (Global Prom ← Local Prom) |
| Latency | Real-time (seconds) | Periodic (scrape interval) |
| Storage | Remote system | Local TSDB |
| Use case | Long-term storage, HA | Hierarchical aggregation |
| Data volume | All samples | Typically aggregates only |
| Complexity | Simple config | Requires recording rules |
| Network | Outbound HTTP | Inbound scrape |
When to use Remote Write:
- Need long-term storage (>90 days)
- Want managed/cloud storage
- Require high availability
- Multiple destinations
When to use Federation:
- Building hierarchies
- Need pull-based model
- Want to aggregate before sending
- Firewall restrictions