Grafana Alloy
Grafana Alloy is an OpenTelemetry Collector distribution by Grafana Labs that acts as the single collection point for all observability signals in our setup. It replaces the need for separate Prometheus scrapers, Promtail for logs, and standalone OTel Collectors.
Role in the Stack
| Signal | Collection Method | Runs on | Destination |
|---|---|---|---|
| Metrics | ServiceMonitor scraping (30s interval) + pod annotation scraping | gateway (Deployment) | Prometheus, Mimir |
| Traces | OTLP receiver (gRPC :4317, HTTP :4318) | gateway (Deployment) | Tempo |
| Logs (OTLP) | OTLP receiver | gateway (Deployment) | Loki |
| Logs (pod tail) | Kubernetes pod log tailing (/var/log/pods/) |
collector (DaemonSet) | Loki |
| Profiles (eBPF) | eBPF kernel sampling (97 Hz) | collector (DaemonSet) | Pyroscope |
| Profiles (SDK) | Pyroscope SDK scraping | gateway (Deployment) | Pyroscope |
Key decision: Prometheus is configured as a receiver only (no scraping). All metric collection flows through Alloy.
Versions
| Chart | grafana/alloy 1.8.1 |
| App | Alloy v1.16.1 |
Deployment
Alloy is split into two Helm releases so node-local concerns and cluster-scoped concerns can scale independently:
| Release | Controller | Scaling | Owns |
|---|---|---|---|
alloy |
Deployment | HPA (2–6 replicas, CPU 70%) + clustering | OTLP receiver (4317/4318), ServiceMonitor scraping, pod-annotation scraping, pull-based pyroscope, all fan-out to backends |
alloy-collector |
DaemonSet (1 pod / node) | implicit (one per node) | eBPF CPU profiling, pod-log tailing from /var/log/pods |
Why split? eBPF needs hostPID, BPF capabilities, and host filesystem mounts — every node needs a pod. OTLP receiving and metric scraping are CPU-bound and benefit from horizontal scaling; pinning them to one pod per node wastes resources on idle nodes and bottlenecks on busy ones.
Gateway (alloy) — Deployment + Clustering + HPA
- Controller: Deployment,
replicas: 2initially - HPA:
minReplicas: 2,maxReplicas: 6, target CPU 70% / memory 80% - Clustering: enabled — chart auto-creates a headless service
alloy-cluster.monitoring.svc.cluster.local. Replicas form a hash ring via DNS discovery and shard scrape targets so each target is owned by exactly one replica. - Components opted into clustering:
prometheus.operator.servicemonitors,prometheus.scrape "annotated_pods",pyroscope.scrape "cpu". OTLP receiver does not shard at the application layer — the K8salloyService round-robins new gRPC/HTTP connections across replicas. - Security: non-privileged, drops all capabilities. Runs as UID 473.
- Resources: 1Gi–4Gi memory, 500m–4000m CPU.
- UI: port 12345, exposed via ingress at
alloy.<your-domain>. - Service name preserved: still
alloy.monitoring.svc.cluster.local:4317/4318, so producers (otel-demo, telemetrygen, etc.) need no changes.
Collector (alloy-collector) — DaemonSet
- Controller: DaemonSet,
hostPID: true - Privileged: Yes — required for eBPF
- Capabilities:
SYS_ADMIN,SYS_PTRACE,SYS_RESOURCE,PERFMON,BPF - Resources: 512Mi–2Gi memory, 200m–2000m CPU
- Init container: sets
perf_event_paranoid=-1 - No clustering: each pod is already scoped to its node via
discovery.kubernetesfield selectorspec.nodeName=$HOSTNAME - No OTLP receiver, no Prometheus scraping — those live on the gateway
Metrics Collection
Alloy discovers metrics targets through two mechanisms:
1. ServiceMonitor scraping — discovers all ServiceMonitors cluster-wide, resolving endpoints and scraping at 30s intervals. This is the primary mechanism.
2. Pod annotation scraping — fallback for pods without ServiceMonitors. Pods with prometheus.io/scrape: "true" are automatically scraped. Monitoring stack pods (prometheus, alloy, mimir, loki, tempo, pyroscope) are excluded to avoid duplication.
Both paths support native histograms (protobuf scraping).
Remote write targets:
- Prometheus (short-term):
prometheus-and-grafana-kub-prometheus:9090 - Mimir (long-term):
mimir-gateway:80
Trace Collection
- Receives OTLP traces via gRPC (:4317) and HTTP (:4318)
- Adds
k8s.namespace.nameattribute to all spans for Kubernetes context - Forwards to Tempo via OTLP gRPC
Log Collection
Two parallel log streams:
1. Kubernetes pod logs — tails /var/log/pods/ on each node, parses CRI format, maps container labels to Loki labels.
2. OTLP logs — receives structured logs from applications via OTLP, enriches with Kubernetes metadata (namespace, pod, container), maps OpenTelemetry severity to Loki’s detected_level.
Both streams are sent to Loki via native Loki protocol.
Profiling
eBPF profiling (all processes, no instrumentation needed):
- Sample rate: 97 Hz
- Collects both kernel and user-space stacks
- Python-specific profiling enabled
- Covers every process on the node — including services that have no SDK instrumentation
Pyroscope SDK scraping (richer data for instrumented services):
- Discovers pods with annotation
profiles.grafana.com/cpu_scrape: "true" - Scrapes CPU, memory, mutex, block, and goroutine profiles
- Provides language-specific profile types (JFR for Java, pprof for Go, etc.)
Volume Mounts
| Mount | Purpose |
|---|---|
/sys/fs/bpf |
BPF filesystem for pinned maps and programs |
/sys/kernel/debug |
Debugfs for kprobes/uprobes |
/sys/kernel/btf |
BTF type information for CO-RE |
/var/log/pods |
Kubernetes pod logs |
/run/containerd |
Container runtime socket for PID-to-pod mapping |
Integration Points
Applications ──OTLP──→ Alloy ──→ Tempo (traces)
──→ Loki (logs)
──→ Prometheus → Mimir (metrics)
──→ Pyroscope (profiles)
K8s components ──scrape──→ Alloy ──→ Prometheus → Mimir
All processes ──eBPF──→ Alloy ──→ Pyroscope
The collector DaemonSet is the only component that needs to run privileged — the gateway Deployment, and all backends, run as regular pods.
Clustering
Clustering is how multiple gateway replicas cooperate instead of duplicating each other’s work. Without it, three Alloy replicas all discovering the same 200 ServiceMonitors would each scrape all 200 targets — triple the load on the scraped services, triple the metric volume sent to Prometheus, and three times the samples-per-second counted by Mimir. With clustering enabled, the three replicas form a single logical collector: each target is owned by exactly one replica, and the 200 targets split roughly 67/67/66.
The collector DaemonSet does not use clustering — each DaemonSet pod is already implicitly sharded to its own node by a Kubernetes field selector (spec.nodeName=$HOSTNAME). Only the gateway Deployment runs clustered.
Peer Discovery
Alloy replicas find each other through a headless Kubernetes Service that the Helm chart creates automatically when alloy.clustering.enabled: true:
alloy-cluster.monitoring.svc.cluster.local
A headless Service returns the pod IPs of all replicas when you resolve its DNS name, instead of a single ClusterIP. Alloy uses this to bootstrap its peer list, then the HashiCorp memberlist library takes over: replicas gossip on port 7946 (TCP + UDP) to maintain a live view of who’s in the cluster, detect failures, and agree on membership.
| Service | Purpose | Port(s) |
|---|---|---|
alloy (ClusterIP) |
OTLP ingress for producers; UI ingress | 4317, 4318, 12345 |
alloy-cluster (headless) |
Peer discovery + gossip | 12345, 4317, 4318 (not used for cluster); gossip goes over 7946 via pod IPs |
Sharding — Consistent Hashing
Once the peer list is stable, Alloy builds a hash ring over the active replicas. For every scrape target, it computes a hash of the target’s identity (labels) and maps it to a point on the ring — the nearest replica clockwise owns that target. This is the same consistent-hashing approach used by Cortex / Mimir / Loki ingesters.
Properties:
- Exactly-once ownership: under a stable membership, each target is scraped by exactly one replica.
- Minimal disruption on scale events: adding or removing a replica only redistributes targets near the changed ring position — most targets stay where they were. With
Ntargets and a scale fromktok+1replicas, roughlyN/(k+1)targets change owners. - Deterministic: every replica independently computes the same ring, so they all agree on ownership without a coordinator.
Opting Components into Clustering
Clustering is opt-in per component. In our alloy-gateway.values.yaml, these blocks add clustering { enabled = true }:
| Component | Why it’s clustered |
|---|---|
prometheus.operator.servicemonitors "default" |
Fan out ServiceMonitor targets (the majority of scrape load) |
prometheus.scrape "annotated_pods" |
Fan out pod-annotation targets |
prometheus.scrape "native_histograms" |
One static target — harmless to cluster, but won’t split further |
pyroscope.scrape "cpu" |
Fan out SDK-based profile scraping |
The OTLP receiver (otelcol.receiver.otlp "default") is not application-clustered. Incoming OTLP traffic is load-balanced by the regular alloy Kubernetes Service, which round-robins new gRPC/HTTP connections across replicas. This works for OTLP because each request is independent; there’s no shared-state decision like “who scrapes this target.”
Behavior During Scale Events
Scale-up (HPA adds a replica):
- New pod starts, joins
alloy-clustervia DNS. - Memberlist gossip propagates the new peer to existing replicas (~1–2 s).
- Every replica recomputes the hash ring. Some targets transfer to the new replica on their next scrape tick.
- A brief window exists (~one scrape interval) where a handful of targets might be scraped twice or zero times during the cutover. Metric samples dedupe cleanly at the Prometheus remote-write layer; pull-based profiles tolerate brief gaps.
Scale-down (HPA removes a replica):
- Pod gets SIGTERM, marks itself as leaving the cluster.
- Peers update their ring and pick up the orphaned targets within gossip latency (~2 s).
- Kubernetes waits for the pod’s
terminationGracePeriodSeconds(default 30 s in the chart) so in-flight OTLP requests can drain.
Ungraceful replica loss (OOMKill, node failure):
- Memberlist’s failure detector marks the peer dead after a few missed pings (default ~10 s).
- Targets redistribute. Data loss is limited to the in-flight buffer on the dead replica (for Prometheus remote_write, that’s the WAL segments that hadn’t been shipped yet — recoverable on pod restart; for OTLP, anything not already accepted is lost at the sender retry level).
Inspecting and Debugging
The live cluster ring is exposed at the Alloy UI endpoint:
kubectl port-forward -n monitoring svc/alloy 12345:12345
curl http://localhost:12345/api/v0/web/cluster | jq
Returns each peer’s name, advertise address, gossip state, and observed round-trip time.
Useful Alloy metrics to watch:
| Metric | What it tells you |
|---|---|
cluster_node_peers |
Current peer count (should match HPA’s current replica count) |
cluster_node_info |
Per-node membership state, emits once per peer |
cluster_transport_tx_packets_total / rx_packets_total |
Gossip traffic volume — sudden drops indicate split-brain |
cluster_transport_stream_tx_bytes_total |
Larger messages (full state syncs) — spikes on joins/leaves |
prometheus_remote_storage_samples_total |
Per-replica sample throughput — should be roughly equal across replicas under steady load |
A quick health check: kubectl get hpa alloy -n monitoring should show all replicas (REPLICAS column) and the per-peer metrics above should show the same count. Mismatch indicates a pod that joined K8s but not the cluster — usually a network-policy issue blocking port 7946.
Failure Modes to Know About
- Split-brain via network partition: if gossip packets can’t cross between a subset of pods, both halves of the partition will treat themselves as authoritative and re-scrape targets. Mitigation: same-namespace, same-cluster traffic should never be partitioned in practice; we don’t use NetworkPolicies restricting port 7946.
- Clock skew: memberlist tolerates small skew but heavy skew (>30 s between nodes) can cause flapping peer status. AKS node NTP keeps this a non-issue.
- Gossip port blocked: if anything drops 7946, peers see each other as down. The Alloy UI
/api/v0/web/clusterendpoint on each pod will disagree about membership — that’s the smoking gun.
Scaling Behavior
Gateway (HPA) — the Deployment scales 2 → 6 replicas based on CPU 70% / memory 80%. Each new replica joins the cluster and the hash ring redistributes scrape targets within seconds. Scrape work and OTLP CPU load are split across all replicas, so 3× the load needs roughly 3× the replicas, not 3× the pod size.
Collector (DaemonSet) — automatically scales with the number of nodes. There is no HPA because eBPF profiling and pod-log tailing are inherently per-node: each pod handles only its own node’s processes and /var/log/pods directory. Adding nodes adds collectors. Removing nodes removes them. To handle more per-node load (very chatty containers), increase the collector’s resource limits rather than its replica count.
Inspect the cluster ring: curl http://<alloy-pod>:12345/api/v0/web/cluster shows the live peer list and which targets each replica owns.
Performance Test Results — Log Ingestion
We ran a scaling test to measure Alloy’s behavior under increasing log ingestion load. The test uses telemetrygen to generate OTLP logs at controlled rates, sending them to Alloy’s gRPC receiver (:4317). Each rate was sustained for 2 minutes with a 60-second cooldown between rounds.
Test environment: AKS cluster, Alloy as DaemonSet (256Mi–1Gi memory, 100m–1000m CPU), Loki as log backend.
Scaling Results
| Target Rate | Throughput | Received/s | Sent/s | Loss | Avg CPU | Max CPU | Avg Mem (MiB) | Queue% |
|---|---|---|---|---|---|---|---|---|
| 500 logs/s | 0.08 MB/s | 259 | 282 | 0% | 0.114 | 0.111 | 1,729 | 0% |
| 2,000 logs/s | 0.31 MB/s | 942 | 1,015 | 0% | 0.256 | 0.248 | 1,707 | 0% |
| 5,000 logs/s | 0.78 MB/s | 1,940 | 2,008 | 0% | 0.576 | 0.536 | 1,710 | 0% |
| 10,000 logs/s | 1.56 MB/s | 5,415 | 5,826 | 0% | 1.079 | 1.079 | 3,187 | 69.4% |
Note: “Loss” column shows 0% refused logs at all rates — no data was dropped. The negative loss% in raw data indicates Alloy was still draining buffered logs when measurements were taken.
Key Findings
CPU scales linearly: CPU usage grows proportionally with log rate — from ~0.11 cores at 500 logs/s to ~1.08 cores at 10,000 logs/s.
Memory stays flat until saturation: Memory holds steady around 1,710 MiB for rates up to 5,000 logs/s, then jumps to 3,187 MiB (77.8% of limit) at 10,000 logs/s due to queue backpressure.
Queue backpressure at 10K logs/s: The exporter queue hit 69.4% capacity (694/1000) at 10,000 logs/s, indicating Loki’s write path was becoming the bottleneck — not Alloy itself.
Zero data loss: No refused or failed log records at any rate. The OTEL receiver accepted everything; the exporter successfully delivered everything to Loki.
Peak Values (from time-series samples during the test)
| Metric | Peak Value |
|---|---|
| Alloy Max CPU (cores) | 1.42 |
| Alloy Max Memory (MiB) | 3,244 |
| OTEL Receiver Accepted logs/s | 9,912 |
| OTEL Processor In/s | 25,649 |
| Batch Bytes/s | 4.5 MB/s |
| Exporter Queue Size | 848 / 1,000 |
| Loki Distributor Lines Recv/s | 20,306 |
| Loki Distributor Bytes Recv/s | 6.4 MB/s |
| Loki Write Dropped Entries/s | 0 |
Recommendations
- Up to 5,000 logs/s: Default resource limits (1Gi memory, 1000m CPU) are sufficient. Queue stays empty, memory is stable.
- 10,000+ logs/s: Increase Alloy memory limit beyond 2Gi and consider increasing exporter queue capacity (
queue_size > 1000). Monitorotelcol_exporter_queue_sizefor backpressure. - Loki is the bottleneck: At high rates, Loki ingestion speed limits overall throughput. Scale Loki ingesters before increasing Alloy resources.
The full test workflow is available at .github/workflows/performance-test.yml. Test artifacts (CSVs with per-sample time series) are attached to each run.
Post-Split Results — 2026-04-13
Re-ran the same scaling test against the new gateway Deployment (2 replicas, HPA on) to compare. Rates: 5 k → 15 k → 30 k → 50 k logs/s, 3 min per round.
Gateway Scaling Results (time-series peaks)
Throughput measured at the Loki distributor (loki_distributor_bytes_received_total) — actual wire traffic leaving Alloy, including the resource attributes / k8s labels Alloy adds to each record.
| Target Rate | Throughput | Peak Accepted/s (gateway) | Peak Loki Dist Lines/s | Max CPU (total cores) | Max Memory (MiB) | Dropped | Queue% |
|---|---|---|---|---|---|---|---|
| 5,000 | 7.63 MB/s | 17,000 | 22,300 | 0.49 | 967 | 0 | 0% |
| 15,000 | 16.56 MB/s | 38,300 | 49,300 | 0.80 | 955 | 0 | 0% |
| 30,000 | 29.04 MB/s | 55,500 | 87,500 | 0.85 | 950 | 0 | 0% |
| 50,000 | 36.13 MB/s | 80,700 | 106,000 | 0.94 | 968 | 0 | 0% |
Note: “Peak Accepted/s” exceeds target rate because telemetrygen fans out to multiple sender replicas (each capped at 5 k/s), and Alloy’s 1 min rate averaging captures the composite input from overlapping senders. The true per-rate throughput is the Loki Dist Lines/s column.
Key Takeaways vs the Old DaemonSet
- Throughput: old DaemonSet saturated at ~10 k logs/s with a 69.4 % exporter queue. The new gateway handled 5× that (50 k logs/s target, ~80 k actual accepted) with zero queue backpressure and zero drops.
- CPU per pod: distributed roughly evenly across 2 replicas — 0.94 cores total at peak load ≈ 0.47 cores per pod, vs 1.08 cores on a single DaemonSet pod at 10 k.
- Memory: held at ~965 MiB per pod (vs 3.2 GiB on the saturated DaemonSet). No queue backpressure means no buffer growth.
- Loki is the bottleneck that matters now: Loki distributor happily accepted 106 k lines/s in this test. Chunk-flush pressure on the ingester would be the next thing to watch if load increased.
HPA Did Not Scale Up in This Test
The HPA stayed at minReplicas: 2 the entire run because:
- Peak CPU per pod was ~0.47 cores. The HPA target is 70 % of the CPU request (
500m), i.e. 350 m per pod. Load was over threshold — but… - The Kubernetes HPA has a default scale-up stabilization window of ~5 minutes (
--horizontal-pod-autoscaler-downscale-stabilization, and the new-pod-ready wait). Each test round was only 3 min, so CPU pressure came and went before the HPA triggered.
To actually exercise HPA scaling, use longer durations:
gh workflow run "Performance Test - Log Ingestion" --ref main \
-f rates="30000 50000" \
-f duration="10m"
Sustained 10 min at 50 k should push replica count to 3–4.
Post-Split Results — 2026-04-13 (Round 2, Higher Load)
Ran the test again at 50 k → 100 k → 150 k logs/s with 6 min per round (twice as long as round 1) specifically to find the throughput ceiling and to see HPA actually move. Gateway HPA was already at REPLICAS=6 (maxReplicas) at the start because earlier runs had scaled it up and cooldown hadn’t fully released.
Results (time-series peaks)
Throughput column is measured at the Loki distributor (loki_distributor_bytes_received_total), which is the actual wire traffic leaving Alloy. It’s higher than raw payload × rate because Alloy adds resource attributes / k8s labels before forwarding.
| Target Rate | Throughput | Peak Accepted/s (gateway) | Peak Loki Dist Lines/s | Max CPU per pod | Loki Flush Queue | Dropped |
|---|---|---|---|---|---|---|
| 50,000 | 31.97 MB/s | 59,913 | 99,120 | 1.27 cores | 1 | 0 |
| 100,000 | 44.64 MB/s | 114,054 | 122,301 | 1.21 cores | 28–29 | 0 |
| 150,000 | 46.19 MB/s | 124,225 | 123,779 | 1.28 cores | 33–37 | 0 |
The Real Ceiling
Throughput plateaus at ~120–125 k logs/s regardless of whether the target is 100 k or 150 k. Both rounds produced essentially the same peak — telemetrygen could push above 100 k, but more load beyond that didn’t translate to more throughput. This is the single-cluster ceiling for the stack as currently sized.
The ceiling sits in two places:
- Gateway per-pod CPU: the hottest replica hits ~1.28 cores during steady load. With 6 replicas × 1.28 = ~7.7 cores of work being done. Live
kubectl get hpa alloyduring the 50 k round showedcpu: 212%/70%— per-pod CPU usage of 212 % of the 500 m request (i.e. 1.06 cores/pod avg). So the gateway is doing work, just well under its 4000 m limit. - Loki ingester flush path:
loki_ingester_flush_queue_lengthgrew from 0 → 37 and stopped rising at 37, which is the pathological sign. Flush rate matches input rate (chunks aren’t accumulating unboundedly), but they’re queued longer than they should be. Post-test also showedloki_ingester_chunks_flush_failures_totalat 0.48/s — a small but non-zero stream of failed chunk flushes to Azure Blob. Retries succeed, so no entries drop, but P99 Loki request duration climbed to 435 ms (from ~20 ms baseline).
What Didn’t Happen (Importantly)
- Zero dropped entries end-to-end.
loki_write_dropped_entries_total = 0,otelcol_receiver_refused = 0,otelcol_exporter_send_failed = 0. - Exporter queue stayed at 0 on Alloy. The gateway is not backpressured internally — whatever throughput it can achieve, it ships immediately.
- HPA didn’t scale up further — not because it couldn’t, but because it was already at
maxReplicas=6from earlier runs. RaisingmaxReplicasto 10 would give the gateway more parallelism and push the ceiling higher. - Alloy memory remained flat at ~965 MiB per pod across all rounds. No buffer growth means no hidden backpressure.
Measurement Note — Don’t Trust the per-Rate CSV Row at High Rates
The scaling-results-*.csv summary row for 100 k shows received_rate=11,340 (11 %!) and loss_pct=69.79. The time-series samples during the same window show Alloy receiving 106 k–114 k/s steadily. The per-rate summary caught a cooldown-adjacent slice of the 1-min rate window, not the active-load period. Always cross-check the per-rate row against the perf-metrics-samples.csv time-series before concluding anything about loss.
A measurement-window bug worth fixing in load_test/run-scaling-test.sh: the summary should average over the middle of each duration window, not the end, to avoid catching the ramp-down.
Takeaways
- The gateway scales well: 6 replicas comfortably handle 120 k logs/s with headroom on CPU (1.28/4.0 cores per pod) and memory (969/4096 MiB).
- Loki (Distributed, 3 ingester replicas) is the actual ceiling in this setup. To go above 120 k we need to scale
ingester.replicas(the write-path bottleneck) before touching Alloy.distributoris stateless and rarely the constraint. - Loss is asymptotic, not catastrophic. Even when chunk-flush to Azure Blob fails occasionally, nothing drops — Loki’s WAL + retry preserves every entry. End-to-end integrity survives the pressure regime.
- 150 k target was not reached by the sender either — load generation tops out around 125 k in this test harness. To actually push 150 k, raise
MAX_RATE_PER_REPLICAinrun-scaling-test.shand add a bigger node pool (~30 sender pods need somewhere to run).
Next Experiments Justified by These Numbers
- Scale Loki ingester to 4 replicas and repeat the 100 k/150 k run. Flush queue should clear; P99 latency should drop back toward ~50 ms.
- Raise gateway
maxReplicasto 10,MAX_RATE_PER_REPLICAto 8000, and tryrates="150000 200000". Sender-side capacity becomes the next question. - Enable OTLP gzip compression on telemetrygen to confirm the 75 MB/s network-bound prediction for the 500 k tier.
Next Steps — Pushing the Load Higher
The results above were measured on the old single-pod DaemonSet. After the gateway/collector split, the gateway scales horizontally (2 → 6 replicas at CPU 70% / memory 80%) and targets are sharded across replicas. This changes the scaling behavior: instead of growing a single pod vertically, we add replicas and each new one immediately takes its share of the work.
This section describes how to re-baseline with larger loads to find the new saturation points.
Recommended Load Tiers
| Tier | Rate | What you’re exercising |
|---|---|---|
| Smoke | 5,000 logs/s | Baseline — 2 replicas barely notice. Sanity-check the whole pipeline (OTLP → processor → Loki). |
| Per-replica saturation | 15,000 logs/s | Matches the old single-pod saturation point (~1.4 cores). With 2 replicas it splits to ~7.5k each — still comfortable, HPA stays at minReplicas. |
| HPA scale-up trigger | 30,000 logs/s | Each replica needs ~1.5 cores, crossing the 70% CPU target. HPA should add replicas within ~1 minute. Watch the ring redistribute. |
| Full scale-out | 50,000 logs/s | Drives the gateway to maxReplicas=6. Loki distributor/ingester pressure starts becoming visible. |
| Loki bottleneck hunt | 75,000–100,000 logs/s | Gateway has headroom (raise maxReplicas first). Now you’re measuring Loki — distributor CPU, ingester memory, write-path queue depth. |
| Loki ingester scale-out | 150,000 logs/s | Single Loki ingester replica can’t keep up — chunk flush queue grows, loki_ingester_flush_queue_length climbs. Scale loki.ingester.replicas to 3 before this tier. |
| Distributor saturation | 250,000 logs/s | Loki distributor CPU becomes the wall. Scale loki.distributor.replicas to 3+, watch loki_distributor_ingester_appends_total vs loki_distributor_lines_received_total for fan-out efficiency. |
| Network-bound regime | 500,000 logs/s | At ~75 MB/s serialized OTLP, you’re using a noticeable slice of the AKS node’s NIC. Consider topology-aware routing (pods on same nodes as Loki ingesters) and enable OTLP gzip compression on the producer side. |
| Storage-bound regime | 1,000,000 logs/s | Azure Blob egress on the Loki backend becomes visible in loki_azure_blob_request_duration_seconds. Raise loki.ingester.chunk_target_size so fewer/larger blobs are written; consider premium storage SKU. |
| Cluster-scale load | 2,000,000+ logs/s | You’ve outgrown the single-AKS-cluster setup. Add a dedicated node pool for the monitoring namespace (ensures gateway replicas don’t compete with app workloads for CPU/network), shard Loki by tenant, and consider a Mimir-style read/write split for metrics derived from these logs. |
Rate per telemetrygen sender is capped at 5,000 logs/s (see MAX_RATE_PER_REPLICA in load_test/run-scaling-test.sh); the script fan-outs into multiple sender replicas automatically. At 500 k/s that’s 100 sender replicas — check the load-test-scaling namespace quota before starting.
Tier-by-tier Prep Checklist
Don’t jump straight to a high tier — each step above assumes the previous tier’s prerequisites are in place. Before running tier N, complete the prep for tiers 1…N−1.
| Tier | Prep required before running |
|---|---|
| 150 k | ingester.replicas: 3; verify replication factor in Loki config matches (RF=3 needs ≥3 ingesters). |
| 250 k | distributor.replicas: 3+; bump distributor.resources.limits.cpu to 4000m. Confirm loki.limits_config.ingestion_rate_mb (and ingestion_burst_size_mb) are high enough — defaults will throttle hard at this rate. |
| 500 k | Enable OTLP gzip compression on producers (OTEL_EXPORTER_OTLP_COMPRESSION=gzip); raise Alloy maxReplicas to ≥8; consider scrape_interval-side batching on producers. Verify AKS node SKU has ≥8 Gbps NIC (Standard_D8_v5 or similar). |
| 1 M | Loki storage: switch Azure Blob SKU from Standard_LRS to Premium_LRS or use tiered storage; raise chunk_target_size to 2 MiB. Watch loki_azure_blob_egress_bytes_total and loki_azure_blob_request_duration_seconds p99. |
| 2 M+ | Dedicate a monitoring node pool with taints/tolerations so Alloy + Loki + Mimir run on hardware separate from applications. Consider multi-region replication if your SLA requires it. |
Running the Test
From the GitHub UI (Actions → “Performance Test - Log Ingestion” → Run workflow):
rates: 5000 15000 30000 50000
duration: 3m
delete_otel_namespace: true
delete_otel_namespace=true removes the demo app so only the scaling traffic hits Alloy — cleaner numbers. Set it to false to measure realistic mixed load.
Or from CLI:
gh workflow run "Performance Test - Log Ingestion" --ref main \
-f rates="5000 15000 30000 50000 75000" \
-f duration="3m"
Each round: 30 s stabilize + duration of sustained traffic + 60 s cooldown. Total runtime ≈ (rounds × (duration + 90 s)) plus ~2 min setup and ~2 min metrics collection.
What to Watch
Gateway HPA:
kubectl get hpa alloy -n monitoring -w
At 30 k you should see REPLICAS climb from 2 to 3–4. At 50 k, 5–6. If it pegs at maxReplicas while CPU is still at 100 %, raise the max in alloy-gateway.values.yaml and redeploy.
Cluster ring:
kubectl port-forward -n monitoring svc/alloy 12345:12345
curl -s http://localhost:12345/api/v0/web/cluster | jq '.peers | length'
The peer count should match HPA’s current replica count. Mismatch means a pod joined K8s but not the Alloy cluster (port 7946 blocked, or a network policy).
Exporter queue saturation (the first internal warning sign before Loki drops):
max(otelcol_exporter_queue_size) / max(otelcol_exporter_queue_capacity)
Under 0.3 — healthy. 0.3–0.7 — Loki is getting slow. Above 0.8 — sustained backpressure, samples start dropping at the exporter retry limit.
Loki write path — usually the actual bottleneck once the gateway scales out:
sum(rate(loki_distributor_lines_received_total[1m]))
sum(rate(loki_write_dropped_entries_total[1m]))
sum(loki_ingester_flush_queue_length)
Non-zero loki_write_dropped_entries_total means Loki is refusing ingestion, not Alloy failing.
Prometheus remote_write shard health (Alloy → Prometheus path for its own metrics):
rate(prometheus_remote_storage_samples_failed_total[1m])
prometheus_remote_storage_shards
When the Bottleneck Moves to Loki
This is the expected transition after the Alloy split. Order of operations to push further:
- Scale Loki ingesters first — set
loki.ingester.replicashigher in the Loki values file. Ingesters are stateful (own a portion of streams), so this is where most throughput bottlenecks sit. - Scale the Loki distributor — if
loki_distributor_lines_received_totalplateaus while ingesters have headroom. - Raise Alloy
maxReplicas— only if gateway CPU stays ≥ 70 % at the current max and Loki has headroom. - Tune the OTLP batch processor — increase
send_batch_sizeandsend_batch_max_sizeinalloy-gateway.values.yamlto reduce per-request overhead at very high rates.
At this stage the gateway is genuinely horizontally scalable. Don’t grow pod-size; grow replica count and downstream capacity.
Interpreting the CSVs
The workflow uploads two artifacts per run:
load_test/scaling-results-*.csv— one row per rate tested (throughput, loss %, CPU, memory, queue %)/tmp/perf-metrics-samples.csv— per-sample time series (30 s interval) covering Alloy self-metrics, OTLP receiver/processor/exporter, Loki distributor/ingester/write path, and ES doc counts if Elasticsearch is enabled
Peak-value table in the GitHub Actions job summary is derived from the time-series CSV. When a new test completes, replace the “Scaling Results” table above with the latest numbers so the documented baseline stays accurate.