Grafana Alloy

Grafana Alloy is an OpenTelemetry Collector distribution by Grafana Labs that acts as the single collection point for all observability signals in our setup. It replaces the need for separate Prometheus scrapers, Promtail for logs, and standalone OTel Collectors.

Role in the Stack

Signal	Collection Method	Runs on	Destination
Metrics	ServiceMonitor scraping (30s interval) + pod annotation scraping	gateway (Deployment)	Prometheus, Mimir
Traces	OTLP receiver (gRPC :4317, HTTP :4318)	gateway (Deployment)	Tempo
Logs (OTLP)	OTLP receiver	gateway (Deployment)	Loki
Logs (pod tail)	Kubernetes pod log tailing (`/var/log/pods/`)	collector (DaemonSet)	Loki
Profiles (eBPF)	eBPF kernel sampling (97 Hz)	collector (DaemonSet)	Pyroscope
Profiles (SDK)	Pyroscope SDK scraping	gateway (Deployment)	Pyroscope

Key decision: Prometheus is configured as a receiver only (no scraping). All metric collection flows through Alloy.

Versions


Chart	`grafana/alloy` 1.8.1
App	Alloy v1.16.1

Deployment

Alloy is split into two Helm releases so node-local concerns and cluster-scoped concerns can scale independently:

Release	Controller	Scaling	Owns
`alloy`	Deployment	HPA (2–6 replicas, CPU 70%) + clustering	OTLP receiver (4317/4318), ServiceMonitor scraping, pod-annotation scraping, pull-based pyroscope, all fan-out to backends
`alloy-collector`	DaemonSet (1 pod / node)	implicit (one per node)	eBPF CPU profiling, pod-log tailing from `/var/log/pods`

Why split? eBPF needs hostPID, BPF capabilities, and host filesystem mounts — every node needs a pod. OTLP receiving and metric scraping are CPU-bound and benefit from horizontal scaling; pinning them to one pod per node wastes resources on idle nodes and bottlenecks on busy ones.

Gateway (`alloy`) — Deployment + Clustering + HPA

Controller: Deployment, replicas: 2 initially
HPA: minReplicas: 2, maxReplicas: 6, target CPU 70% / memory 80%
Clustering: enabled — chart auto-creates a headless service alloy-cluster.monitoring.svc.cluster.local. Replicas form a hash ring via DNS discovery and shard scrape targets so each target is owned by exactly one replica.
Components opted into clustering: prometheus.operator.servicemonitors, prometheus.scrape "annotated_pods", pyroscope.scrape "cpu". OTLP receiver does not shard at the application layer — the K8s alloy Service round-robins new gRPC/HTTP connections across replicas.
Security: non-privileged, drops all capabilities. Runs as UID 473.
Resources: 1Gi–4Gi memory, 500m–4000m CPU.
UI: port 12345, exposed via ingress at alloy.<your-domain>.
Service name preserved: still alloy.monitoring.svc.cluster.local:4317/4318, so producers (otel-demo, telemetrygen, etc.) need no changes.

Collector (`alloy-collector`) — DaemonSet

Controller: DaemonSet, hostPID: true
Privileged: Yes — required for eBPF
Capabilities: SYS_ADMIN, SYS_PTRACE, SYS_RESOURCE, PERFMON, BPF
Resources: 512Mi–2Gi memory, 200m–2000m CPU
Init container: sets perf_event_paranoid=-1
No clustering: each pod is already scoped to its node via discovery.kubernetes field selector spec.nodeName=$HOSTNAME
No OTLP receiver, no Prometheus scraping — those live on the gateway

Metrics Collection

Alloy discovers metrics targets through two mechanisms:

1. ServiceMonitor scraping — discovers all ServiceMonitors cluster-wide, resolving endpoints and scraping at 30s intervals. This is the primary mechanism.

2. Pod annotation scraping — fallback for pods without ServiceMonitors. Pods with prometheus.io/scrape: "true" are automatically scraped. Monitoring stack pods (prometheus, alloy, mimir, loki, tempo, pyroscope) are excluded to avoid duplication.

Both paths support native histograms (protobuf scraping).

Remote write targets:

Prometheus (short-term): prometheus-and-grafana-kub-prometheus:9090
Mimir (long-term): mimir-gateway:80

Trace Collection

Receives OTLP traces via gRPC (:4317) and HTTP (:4318)
Adds k8s.namespace.name attribute to all spans for Kubernetes context
Forwards to Tempo via OTLP gRPC

Log Collection

Two parallel log streams:

1. Kubernetes pod logs — tails /var/log/pods/ on each node, parses CRI format, maps container labels to Loki labels.

2. OTLP logs — receives structured logs from applications via OTLP, enriches with Kubernetes metadata (namespace, pod, container), maps OpenTelemetry severity to Loki’s detected_level.

Both streams are sent to Loki via native Loki protocol.

Profiling

eBPF profiling (all processes, no instrumentation needed):

Sample rate: 97 Hz
Collects both kernel and user-space stacks
Python-specific profiling enabled
Covers every process on the node — including services that have no SDK instrumentation

Pyroscope SDK scraping (richer data for instrumented services):

Discovers pods with annotation profiles.grafana.com/cpu_scrape: "true"
Scrapes CPU, memory, mutex, block, and goroutine profiles
Provides language-specific profile types (JFR for Java, pprof for Go, etc.)

Volume Mounts

Mount	Purpose
`/sys/fs/bpf`	BPF filesystem for pinned maps and programs
`/sys/kernel/debug`	Debugfs for kprobes/uprobes
`/sys/kernel/btf`	BTF type information for CO-RE
`/var/log/pods`	Kubernetes pod logs
`/run/containerd`	Container runtime socket for PID-to-pod mapping

Integration Points

Applications ──OTLP──→ Alloy ──→ Tempo (traces)
                                ──→ Loki (logs)
                                ──→ Prometheus → Mimir (metrics)
                                ──→ Pyroscope (profiles)

K8s components ──scrape──→ Alloy ──→ Prometheus → Mimir

All processes ──eBPF──→ Alloy ──→ Pyroscope

The collector DaemonSet is the only component that needs to run privileged — the gateway Deployment, and all backends, run as regular pods.

Clustering

Clustering is how multiple gateway replicas cooperate instead of duplicating each other’s work. Without it, three Alloy replicas all discovering the same 200 ServiceMonitors would each scrape all 200 targets — triple the load on the scraped services, triple the metric volume sent to Prometheus, and three times the samples-per-second counted by Mimir. With clustering enabled, the three replicas form a single logical collector: each target is owned by exactly one replica, and the 200 targets split roughly 67/67/66.

The collector DaemonSet does not use clustering — each DaemonSet pod is already implicitly sharded to its own node by a Kubernetes field selector (spec.nodeName=$HOSTNAME). Only the gateway Deployment runs clustered.

Peer Discovery

Alloy replicas find each other through a headless Kubernetes Service that the Helm chart creates automatically when alloy.clustering.enabled: true:

alloy-cluster.monitoring.svc.cluster.local

A headless Service returns the pod IPs of all replicas when you resolve its DNS name, instead of a single ClusterIP. Alloy uses this to bootstrap its peer list, then the HashiCorp memberlist library takes over: replicas gossip on port 7946 (TCP + UDP) to maintain a live view of who’s in the cluster, detect failures, and agree on membership.

Service	Purpose	Port(s)
`alloy` (ClusterIP)	OTLP ingress for producers; UI ingress	4317, 4318, 12345
`alloy-cluster` (headless)	Peer discovery + gossip	12345, 4317, 4318 (not used for cluster); gossip goes over 7946 via pod IPs

Sharding — Consistent Hashing

Once the peer list is stable, Alloy builds a hash ring over the active replicas. For every scrape target, it computes a hash of the target’s identity (labels) and maps it to a point on the ring — the nearest replica clockwise owns that target. This is the same consistent-hashing approach used by Cortex / Mimir / Loki ingesters.

Properties:

Exactly-once ownership: under a stable membership, each target is scraped by exactly one replica.
Minimal disruption on scale events: adding or removing a replica only redistributes targets near the changed ring position — most targets stay where they were. With N targets and a scale from k to k+1 replicas, roughly N/(k+1) targets change owners.
Deterministic: every replica independently computes the same ring, so they all agree on ownership without a coordinator.

Opting Components into Clustering

Clustering is opt-in per component. In our alloy-gateway.values.yaml, these blocks add clustering { enabled = true }:

Component	Why it’s clustered
`prometheus.operator.servicemonitors "default"`	Fan out ServiceMonitor targets (the majority of scrape load)
`prometheus.scrape "annotated_pods"`	Fan out pod-annotation targets
`prometheus.scrape "native_histograms"`	One static target — harmless to cluster, but won’t split further
`pyroscope.scrape "cpu"`	Fan out SDK-based profile scraping

The OTLP receiver (otelcol.receiver.otlp "default") is not application-clustered. Incoming OTLP traffic is load-balanced by the regular alloy Kubernetes Service, which round-robins new gRPC/HTTP connections across replicas. This works for OTLP because each request is independent; there’s no shared-state decision like “who scrapes this target.”

Behavior During Scale Events

Scale-up (HPA adds a replica):

New pod starts, joins alloy-cluster via DNS.
Memberlist gossip propagates the new peer to existing replicas (~1–2 s).
Every replica recomputes the hash ring. Some targets transfer to the new replica on their next scrape tick.
A brief window exists (~one scrape interval) where a handful of targets might be scraped twice or zero times during the cutover. Metric samples dedupe cleanly at the Prometheus remote-write layer; pull-based profiles tolerate brief gaps.

Scale-down (HPA removes a replica):

Pod gets SIGTERM, marks itself as leaving the cluster.
Peers update their ring and pick up the orphaned targets within gossip latency (~2 s).
Kubernetes waits for the pod’s terminationGracePeriodSeconds (default 30 s in the chart) so in-flight OTLP requests can drain.

Ungraceful replica loss (OOMKill, node failure):

Memberlist’s failure detector marks the peer dead after a few missed pings (default ~10 s).
Targets redistribute. Data loss is limited to the in-flight buffer on the dead replica (for Prometheus remote_write, that’s the WAL segments that hadn’t been shipped yet — recoverable on pod restart; for OTLP, anything not already accepted is lost at the sender retry level).

Inspecting and Debugging

The live cluster ring is exposed at the Alloy UI endpoint:

kubectl port-forward -n monitoring svc/alloy 12345:12345
curl http://localhost:12345/api/v0/web/cluster | jq

Returns each peer’s name, advertise address, gossip state, and observed round-trip time.

Useful Alloy metrics to watch:

Metric	What it tells you
`cluster_node_peers`	Current peer count (should match HPA’s current replica count)
`cluster_node_info`	Per-node membership state, emits once per peer
`cluster_transport_tx_packets_total` / `rx_packets_total`	Gossip traffic volume — sudden drops indicate split-brain
`cluster_transport_stream_tx_bytes_total`	Larger messages (full state syncs) — spikes on joins/leaves
`prometheus_remote_storage_samples_total`	Per-replica sample throughput — should be roughly equal across replicas under steady load

A quick health check: kubectl get hpa alloy -n monitoring should show all replicas (REPLICAS column) and the per-peer metrics above should show the same count. Mismatch indicates a pod that joined K8s but not the cluster — usually a network-policy issue blocking port 7946.

Failure Modes to Know About

Split-brain via network partition: if gossip packets can’t cross between a subset of pods, both halves of the partition will treat themselves as authoritative and re-scrape targets. Mitigation: same-namespace, same-cluster traffic should never be partitioned in practice; we don’t use NetworkPolicies restricting port 7946.
Clock skew: memberlist tolerates small skew but heavy skew (>30 s between nodes) can cause flapping peer status. AKS node NTP keeps this a non-issue.
Gossip port blocked: if anything drops 7946, peers see each other as down. The Alloy UI /api/v0/web/cluster endpoint on each pod will disagree about membership — that’s the smoking gun.

Scaling Behavior

Gateway (HPA) — the Deployment scales 2 → 6 replicas based on CPU 70% / memory 80%. Each new replica joins the cluster and the hash ring redistributes scrape targets within seconds. Scrape work and OTLP CPU load are split across all replicas, so 3× the load needs roughly 3× the replicas, not 3× the pod size.

Collector (DaemonSet) — automatically scales with the number of nodes. There is no HPA because eBPF profiling and pod-log tailing are inherently per-node: each pod handles only its own node’s processes and /var/log/pods directory. Adding nodes adds collectors. Removing nodes removes them. To handle more per-node load (very chatty containers), increase the collector’s resource limits rather than its replica count.

Inspect the cluster ring: curl http://<alloy-pod>:12345/api/v0/web/cluster shows the live peer list and which targets each replica owns.

Performance Test Results — Log Ingestion

We ran a scaling test to measure Alloy’s behavior under increasing log ingestion load. The test uses telemetrygen to generate OTLP logs at controlled rates, sending them to Alloy’s gRPC receiver (:4317). Each rate was sustained for 2 minutes with a 60-second cooldown between rounds.

Test environment: AKS cluster, Alloy as DaemonSet (256Mi–1Gi memory, 100m–1000m CPU), Loki as log backend.

Scaling Results

Target Rate	Throughput	Received/s	Sent/s	Loss	Avg CPU	Max CPU	Avg Mem (MiB)	Queue%
500 logs/s	0.08 MB/s	259	282	0%	0.114	0.111	1,729	0%
2,000 logs/s	0.31 MB/s	942	1,015	0%	0.256	0.248	1,707	0%
5,000 logs/s	0.78 MB/s	1,940	2,008	0%	0.576	0.536	1,710	0%
10,000 logs/s	1.56 MB/s	5,415	5,826	0%	1.079	1.079	3,187	69.4%

Note: “Loss” column shows 0% refused logs at all rates — no data was dropped. The negative loss% in raw data indicates Alloy was still draining buffered logs when measurements were taken.

Key Findings

CPU scales linearly: CPU usage grows proportionally with log rate — from ~0.11 cores at 500 logs/s to ~1.08 cores at 10,000 logs/s.

Memory stays flat until saturation: Memory holds steady around 1,710 MiB for rates up to 5,000 logs/s, then jumps to 3,187 MiB (77.8% of limit) at 10,000 logs/s due to queue backpressure.

Queue backpressure at 10K logs/s: The exporter queue hit 69.4% capacity (694/1000) at 10,000 logs/s, indicating Loki’s write path was becoming the bottleneck — not Alloy itself.

Zero data loss: No refused or failed log records at any rate. The OTEL receiver accepted everything; the exporter successfully delivered everything to Loki.

Peak Values (from time-series samples during the test)

Metric	Peak Value
Alloy Max CPU (cores)	1.42
Alloy Max Memory (MiB)	3,244
OTEL Receiver Accepted logs/s	9,912
OTEL Processor In/s	25,649
Batch Bytes/s	4.5 MB/s
Exporter Queue Size	848 / 1,000
Loki Distributor Lines Recv/s	20,306
Loki Distributor Bytes Recv/s	6.4 MB/s
Loki Write Dropped Entries/s	0

Recommendations

Up to 5,000 logs/s: Default resource limits (1Gi memory, 1000m CPU) are sufficient. Queue stays empty, memory is stable.
10,000+ logs/s: Increase Alloy memory limit beyond 2Gi and consider increasing exporter queue capacity (queue_size > 1000). Monitor otelcol_exporter_queue_size for backpressure.
Loki is the bottleneck: At high rates, Loki ingestion speed limits overall throughput. Scale Loki ingesters before increasing Alloy resources.

The full test workflow is available at .github/workflows/performance-test.yml. Test artifacts (CSVs with per-sample time series) are attached to each run.

Post-Split Results — 2026-04-13

Re-ran the same scaling test against the new gateway Deployment (2 replicas, HPA on) to compare. Rates: 5 k → 15 k → 30 k → 50 k logs/s, 3 min per round.

Gateway Scaling Results (time-series peaks)

Throughput measured at the Loki distributor (loki_distributor_bytes_received_total) — actual wire traffic leaving Alloy, including the resource attributes / k8s labels Alloy adds to each record.

Target Rate	Throughput	Peak Accepted/s (gateway)	Peak Loki Dist Lines/s	Max CPU (total cores)	Max Memory (MiB)	Queue%
5,000	7.63 MB/s	17,000	22,300	0.49	967	0%
15,000	16.56 MB/s	38,300	49,300	0.80	955	0%
30,000	29.04 MB/s	55,500	87,500	0.85	950	0%
50,000	36.13 MB/s	80,700	106,000	0.94	968	0%

Note: “Peak Accepted/s” exceeds target rate because telemetrygen fans out to multiple sender replicas (each capped at 5 k/s), and Alloy’s 1 min rate averaging captures the composite input from overlapping senders. The true per-rate throughput is the Loki Dist Lines/s column.

Key Takeaways vs the Old DaemonSet

Throughput: old DaemonSet saturated at ~10 k logs/s with a 69.4 % exporter queue. The new gateway handled 5× that (50 k logs/s target, ~80 k actual accepted) with zero queue backpressure and zero drops.
CPU per pod: distributed roughly evenly across 2 replicas — 0.94 cores total at peak load ≈ 0.47 cores per pod, vs 1.08 cores on a single DaemonSet pod at 10 k.
Memory: held at ~965 MiB per pod (vs 3.2 GiB on the saturated DaemonSet). No queue backpressure means no buffer growth.
Loki is the bottleneck that matters now: Loki distributor happily accepted 106 k lines/s in this test. Chunk-flush pressure on the ingester would be the next thing to watch if load increased.

HPA Did Not Scale Up in This Test

The HPA stayed at minReplicas: 2 the entire run because:

Peak CPU per pod was ~0.47 cores. The HPA target is 70 % of the CPU request (500m), i.e. 350 m per pod. Load was over threshold — but…
The Kubernetes HPA has a default scale-up stabilization window of ~5 minutes (--horizontal-pod-autoscaler-downscale-stabilization, and the new-pod-ready wait). Each test round was only 3 min, so CPU pressure came and went before the HPA triggered.

To actually exercise HPA scaling, use longer durations:

gh workflow run "Performance Test - Log Ingestion" --ref main \
  -f rates="30000 50000" \
  -f duration="10m"

Sustained 10 min at 50 k should push replica count to 3–4.

Post-Split Results — 2026-04-13 (Round 2, Higher Load)

Ran the test again at 50 k → 100 k → 150 k logs/s with 6 min per round (twice as long as round 1) specifically to find the throughput ceiling and to see HPA actually move. Gateway HPA was already at REPLICAS=6 (maxReplicas) at the start because earlier runs had scaled it up and cooldown hadn’t fully released.

Results (time-series peaks)

Throughput column is measured at the Loki distributor (loki_distributor_bytes_received_total), which is the actual wire traffic leaving Alloy. It’s higher than raw payload × rate because Alloy adds resource attributes / k8s labels before forwarding.

Target Rate	Throughput	Peak Accepted/s (gateway)	Peak Loki Dist Lines/s	Max CPU per pod	Loki Flush Queue
50,000	31.97 MB/s	59,913	99,120	1.27 cores	1
100,000	44.64 MB/s	114,054	122,301	1.21 cores	28–29
150,000	46.19 MB/s	124,225	123,779	1.28 cores	33–37

The Real Ceiling

Throughput plateaus at ~120–125 k logs/s regardless of whether the target is 100 k or 150 k. Both rounds produced essentially the same peak — telemetrygen could push above 100 k, but more load beyond that didn’t translate to more throughput. This is the single-cluster ceiling for the stack as currently sized.

The ceiling sits in two places:

Gateway per-pod CPU: the hottest replica hits ~1.28 cores during steady load. With 6 replicas × 1.28 = ~7.7 cores of work being done. Live kubectl get hpa alloy during the 50 k round showed cpu: 212%/70% — per-pod CPU usage of 212 % of the 500 m request (i.e. 1.06 cores/pod avg). So the gateway is doing work, just well under its 4000 m limit.
Loki ingester flush path: loki_ingester_flush_queue_length grew from 0 → 37 and stopped rising at 37, which is the pathological sign. Flush rate matches input rate (chunks aren’t accumulating unboundedly), but they’re queued longer than they should be. Post-test also showed loki_ingester_chunks_flush_failures_total at 0.48/s — a small but non-zero stream of failed chunk flushes to Azure Blob. Retries succeed, so no entries drop, but P99 Loki request duration climbed to 435 ms (from ~20 ms baseline).

What Didn’t Happen (Importantly)

Zero dropped entries end-to-end. loki_write_dropped_entries_total = 0, otelcol_receiver_refused = 0, otelcol_exporter_send_failed = 0.
Exporter queue stayed at 0 on Alloy. The gateway is not backpressured internally — whatever throughput it can achieve, it ships immediately.
HPA didn’t scale up further — not because it couldn’t, but because it was already at maxReplicas=6 from earlier runs. Raising maxReplicas to 10 would give the gateway more parallelism and push the ceiling higher.
Alloy memory remained flat at ~965 MiB per pod across all rounds. No buffer growth means no hidden backpressure.

Measurement Note — Don’t Trust the per-Rate CSV Row at High Rates

The scaling-results-*.csv summary row for 100 k shows received_rate=11,340 (11 %!) and loss_pct=69.79. The time-series samples during the same window show Alloy receiving 106 k–114 k/s steadily. The per-rate summary caught a cooldown-adjacent slice of the 1-min rate window, not the active-load period. Always cross-check the per-rate row against the perf-metrics-samples.csv time-series before concluding anything about loss.

A measurement-window bug worth fixing in load_test/run-scaling-test.sh: the summary should average over the middle of each duration window, not the end, to avoid catching the ramp-down.

Takeaways

The gateway scales well: 6 replicas comfortably handle 120 k logs/s with headroom on CPU (1.28/4.0 cores per pod) and memory (969/4096 MiB).
Loki (Distributed, 3 ingester replicas) is the actual ceiling in this setup. To go above 120 k we need to scale ingester.replicas (the write-path bottleneck) before touching Alloy. distributor is stateless and rarely the constraint.
Loss is asymptotic, not catastrophic. Even when chunk-flush to Azure Blob fails occasionally, nothing drops — Loki’s WAL + retry preserves every entry. End-to-end integrity survives the pressure regime.
150 k target was not reached by the sender either — load generation tops out around 125 k in this test harness. To actually push 150 k, raise MAX_RATE_PER_REPLICA in run-scaling-test.sh and add a bigger node pool (~30 sender pods need somewhere to run).

Next Experiments Justified by These Numbers

Scale Loki ingester to 4 replicas and repeat the 100 k/150 k run. Flush queue should clear; P99 latency should drop back toward ~50 ms.
Raise gateway maxReplicas to 10, MAX_RATE_PER_REPLICA to 8000, and try rates="150000 200000". Sender-side capacity becomes the next question.
Enable OTLP gzip compression on telemetrygen to confirm the 75 MB/s network-bound prediction for the 500 k tier.

Next Steps — Pushing the Load Higher

The results above were measured on the old single-pod DaemonSet. After the gateway/collector split, the gateway scales horizontally (2 → 6 replicas at CPU 70% / memory 80%) and targets are sharded across replicas. This changes the scaling behavior: instead of growing a single pod vertically, we add replicas and each new one immediately takes its share of the work.

This section describes how to re-baseline with larger loads to find the new saturation points.

Recommended Load Tiers

Tier	Rate	What you’re exercising
Smoke	5,000 logs/s	Baseline — 2 replicas barely notice. Sanity-check the whole pipeline (OTLP → processor → Loki).
Per-replica saturation	15,000 logs/s	Matches the old single-pod saturation point (~1.4 cores). With 2 replicas it splits to ~7.5k each — still comfortable, HPA stays at minReplicas.
HPA scale-up trigger	30,000 logs/s	Each replica needs ~1.5 cores, crossing the 70% CPU target. HPA should add replicas within ~1 minute. Watch the ring redistribute.
Full scale-out	50,000 logs/s	Drives the gateway to maxReplicas=6. Loki distributor/ingester pressure starts becoming visible.
Loki bottleneck hunt	75,000–100,000 logs/s	Gateway has headroom (raise `maxReplicas` first). Now you’re measuring Loki — distributor CPU, ingester memory, write-path queue depth.
Loki ingester scale-out	150,000 logs/s	Single Loki ingester replica can’t keep up — chunk flush queue grows, `loki_ingester_flush_queue_length` climbs. Scale `loki.ingester.replicas` to 3 before this tier.
Distributor saturation	250,000 logs/s	Loki distributor CPU becomes the wall. Scale `loki.distributor.replicas` to 3+, watch `loki_distributor_ingester_appends_total` vs `loki_distributor_lines_received_total` for fan-out efficiency.
Network-bound regime	500,000 logs/s	At ~75 MB/s serialized OTLP, you’re using a noticeable slice of the AKS node’s NIC. Consider topology-aware routing (pods on same nodes as Loki ingesters) and enable OTLP gzip compression on the producer side.
Storage-bound regime	1,000,000 logs/s	Azure Blob egress on the Loki backend becomes visible in `loki_azure_blob_request_duration_seconds`. Raise `loki.ingester.chunk_target_size` so fewer/larger blobs are written; consider premium storage SKU.
Cluster-scale load	2,000,000+ logs/s	You’ve outgrown the single-AKS-cluster setup. Add a dedicated node pool for the monitoring namespace (ensures gateway replicas don’t compete with app workloads for CPU/network), shard Loki by tenant, and consider a Mimir-style read/write split for metrics derived from these logs.

Rate per telemetrygen sender is capped at 5,000 logs/s (see MAX_RATE_PER_REPLICA in load_test/run-scaling-test.sh); the script fan-outs into multiple sender replicas automatically. At 500 k/s that’s 100 sender replicas — check the load-test-scaling namespace quota before starting.

Tier-by-tier Prep Checklist

Don’t jump straight to a high tier — each step above assumes the previous tier’s prerequisites are in place. Before running tier N, complete the prep for tiers 1…N−1.

Tier	Prep required before running
150 k	`ingester.replicas: 3`; verify replication factor in Loki config matches (RF=3 needs ≥3 ingesters).
250 k	`distributor.replicas: 3`+; bump `distributor.resources.limits.cpu` to 4000m. Confirm `loki.limits_config.ingestion_rate_mb` (and `ingestion_burst_size_mb`) are high enough — defaults will throttle hard at this rate.
500 k	Enable OTLP gzip compression on producers (`OTEL_EXPORTER_OTLP_COMPRESSION=gzip`); raise Alloy `maxReplicas` to ≥8; consider `scrape_interval`-side batching on producers. Verify AKS node SKU has ≥8 Gbps NIC (`Standard_D8_v5` or similar).
1 M	Loki storage: switch Azure Blob SKU from `Standard_LRS` to `Premium_LRS` or use tiered storage; raise `chunk_target_size` to 2 MiB. Watch `loki_azure_blob_egress_bytes_total` and `loki_azure_blob_request_duration_seconds` p99.
2 M+	Dedicate a monitoring node pool with taints/tolerations so Alloy + Loki + Mimir run on hardware separate from applications. Consider multi-region replication if your SLA requires it.

Running the Test

From the GitHub UI (Actions → “Performance Test - Log Ingestion” → Run workflow):

rates:     5000 15000 30000 50000
duration:  3m
delete_otel_namespace: true

delete_otel_namespace=true removes the demo app so only the scaling traffic hits Alloy — cleaner numbers. Set it to false to measure realistic mixed load.

Or from CLI:

gh workflow run "Performance Test - Log Ingestion" --ref main \
  -f rates="5000 15000 30000 50000 75000" \
  -f duration="3m"

Each round: 30 s stabilize + duration of sustained traffic + 60 s cooldown. Total runtime ≈ (rounds × (duration + 90 s)) plus ~2 min setup and ~2 min metrics collection.

What to Watch

Gateway HPA:

kubectl get hpa alloy -n monitoring -w

At 30 k you should see REPLICAS climb from 2 to 3–4. At 50 k, 5–6. If it pegs at maxReplicas while CPU is still at 100 %, raise the max in alloy-gateway.values.yaml and redeploy.

Cluster ring:

kubectl port-forward -n monitoring svc/alloy 12345:12345
curl -s http://localhost:12345/api/v0/web/cluster | jq '.peers | length'

The peer count should match HPA’s current replica count. Mismatch means a pod joined K8s but not the Alloy cluster (port 7946 blocked, or a network policy).

Exporter queue saturation (the first internal warning sign before Loki drops):

max(otelcol_exporter_queue_size) / max(otelcol_exporter_queue_capacity)

Under 0.3 — healthy. 0.3–0.7 — Loki is getting slow. Above 0.8 — sustained backpressure, samples start dropping at the exporter retry limit.

Loki write path — usually the actual bottleneck once the gateway scales out:

sum(rate(loki_distributor_lines_received_total[1m]))
sum(rate(loki_write_dropped_entries_total[1m]))
sum(loki_ingester_flush_queue_length)

Non-zero loki_write_dropped_entries_total means Loki is refusing ingestion, not Alloy failing.

Prometheus remote_write shard health (Alloy → Prometheus path for its own metrics):

rate(prometheus_remote_storage_samples_failed_total[1m])
prometheus_remote_storage_shards

When the Bottleneck Moves to Loki

This is the expected transition after the Alloy split. Order of operations to push further:

Scale Loki ingesters first — set loki.ingester.replicas higher in the Loki values file. Ingesters are stateful (own a portion of streams), so this is where most throughput bottlenecks sit.
Scale the Loki distributor — if loki_distributor_lines_received_total plateaus while ingesters have headroom.
Raise Alloy maxReplicas — only if gateway CPU stays ≥ 70 % at the current max and Loki has headroom.
Tune the OTLP batch processor — increase send_batch_size and send_batch_max_size in alloy-gateway.values.yaml to reduce per-request overhead at very high rates.

At this stage the gateway is genuinely horizontally scalable. Don’t grow pod-size; grow replica count and downstream capacity.

Interpreting the CSVs

The workflow uploads two artifacts per run:

load_test/scaling-results-*.csv — one row per rate tested (throughput, loss %, CPU, memory, queue %)
/tmp/perf-metrics-samples.csv — per-sample time series (30 s interval) covering Alloy self-metrics, OTLP receiver/processor/exporter, Loki distributor/ingester/write path, and ES doc counts if Elasticsearch is enabled

Peak-value table in the GitHub Actions job summary is derived from the time-series CSV. When a new test completes, replace the “Scaling Results” table above with the latest numbers so the documented baseline stays accurate.