Architecture Overview

Putting it all together - Observability Architecture

flowchart TB
    subgraph sources ["Data Sources"]
        app["Applications\n(instrumented with OTel SDK)"]

        subgraph k8s ["Kubernetes"]
            ksm["kube-state-metrics"]
            node["node-exporter"]
            kubelet["kubelet"]
        end

        subgraph exporters ["Exporters"]
            pg["PostgreSQL Exporter\n:9187"]
            redis["Redis Exporter\n:9121"]
        end
    end

    subgraph alloy_box ["Grafana Alloy — Unified Collection Layer"]
        direction LR
        otel_recv["OTLP Receiver\n:4317 gRPC / :4318 HTTP"]
        scraper["Prometheus Scraper\nServiceMonitors\nPod Annotations"]
        log_scrape["Log Scraper\nPod stdout/stderr"]
        ebpf_prof["eBPF Profiler\n97 Hz sampling"]
    end

    %% Applications → Alloy (OTLP)
    app -- "OTLP\n(traces, metrics, logs)" --> otel_recv

    %% Kubernetes metrics → Alloy (scrape)
    k8s -- "scrape metrics" --> scraper
    exporters -- "scrape metrics" --> scraper

    %% Alloy scrapes logs from pods
    app -. "stdout/stderr\n(pod logs)" .-> log_scrape

    %% eBPF profiles all processes
    app -. "kernel-level\nstack traces" .-> ebpf_prof

    prometheus["Prometheus\n(short-term metrics)\n:9090"]
    loki["Loki\n(logs)"]
    tempo["Tempo\n(traces)\n:4317"]
    pyroscope["Pyroscope\n(profiles)\n:4040"]
    mimir["Mimir\n(long-term metrics)"]

    %% Alloy → backends
    otel_recv -- "metrics" --> prometheus
    scraper -- "metrics\n(remote write)" --> prometheus
    otel_recv -- "logs (OTLP)" --> loki
    log_scrape -- "logs" --> loki
    otel_recv -- "traces (OTLP)" --> tempo
    ebpf_prof -- "profiles" --> pyroscope

    %% Metrics from Traces
    tempo -- "Span metrics generator\n(RED metrics, service graphs,\nTraceQL metrics)" --> prometheus

    prometheus -- "remote write" --> mimir

    grafana["Grafana\n:3000"]

    mimir --> grafana
    prometheus --> grafana
    loki --> grafana
    tempo --> grafana
    pyroscope --> grafana

    %% Cross-signal links in Grafana
    grafana -. "Traces → Logs\n(TraceID correlation)" .-> loki
    grafana -. "Traces → Profiles\n(service_name mapping)" .-> pyroscope
    grafana -. "Traces → Metrics\n(span metrics queries)" .-> prometheus

    %% Styling
    style alloy_box fill:#f59e0b,stroke:#d97706,color:#000
    style sources fill:#e5e7eb,stroke:#9ca3af,color:#000
    style grafana fill:#10b981,stroke:#059669,color:#fff
    style prometheus fill:#3b82f6,stroke:#2563eb,color:#fff
    style loki fill:#3b82f6,stroke:#2563eb,color:#fff
    style tempo fill:#3b82f6,stroke:#2563eb,color:#fff
    style pyroscope fill:#3b82f6,stroke:#2563eb,color:#fff
    style mimir fill:#3b82f6,stroke:#2563eb,color:#fff

Signal Flow Summary

Signal Source Collector Backend Long-term
Metrics Apps (OTLP), K8s components, Exporters Alloy (scrape + OTLP) Prometheus Mimir
Logs App stdout/stderr, OTLP structured logs Alloy (file tail + OTLP) Loki Loki (Azure Blob)
Traces Apps (OTLP) Alloy (OTLP passthrough) Tempo Tempo (Azure Blob)
Profiles All processes (eBPF), SDK-instrumented apps Alloy (eBPF + scrape) Pyroscope Pyroscope (Azure Blob)
Metrics from Traces Trace spans Tempo metrics generator Mimir Mimir (Azure Blob)

Cross-Signal Correlations

The stack is wired so that you can navigate between signals without leaving Grafana:

From To How
TraceLogs TraceID derived field in Loki Click a trace span → see matching logs (±1h time window)
TraceMetrics Span metrics (RED) generated by Tempo Click a trace span → see rate/error/duration metrics for that service
TraceProfile service_name mapping to Pyroscope Click a trace span → see CPU/memory profile for that service at that time
TracePostgreSQL db.statement attribute extraction Click a trace span with DB query → run the SQL directly against PostgreSQL datasource
TraceRedis db.statement attribute extraction Click a trace span with Redis command → run HGET/GET directly against Redis datasource
MetricTrace Exemplars on metrics Click an exemplar point on a metric graph → jump to the corresponding trace
LogTrace TraceID field in structured logs Click a TraceID in a log line → open the full trace in Tempo

Key Design Decisions

  1. Alloy as the single collection point — Prometheus does not scrape directly. All metric collection flows through Alloy, which gives a unified configuration point and enables eBPF profiling from the same DaemonSet.

  2. Dual metrics path — Prometheus holds short-term metrics (hours), Mimir holds long-term (weeks/months) in Azure Blob Storage. Both are queryable as Grafana datasources.

  3. Tempo metrics generator — Tempo extracts RED metrics, service graphs, and TraceQL metrics from traces and writes them to Mimir. This means you get metrics-based alerting on traces without manual instrumentation.

  4. eBPF profiling by default — Every process on every node is profiled at 97 Hz with zero code changes. SDK-based profiling adds richer data (goroutines, locks, exceptions) for instrumented services.

results matching ""

    No results matching ""