OpenTelemetry Architecture

πŸ—οΈ Main OTel Components

  1. SDK & API – libraries in applications
  2. Instrumentation Libraries – ready-made integrations
  3. OpenTelemetry Collector – central data processing point

πŸ—ΊοΈ Observability Architecture in Our Solution

flowchart TB
    subgraph sources ["Data Sources"]
        app["Applications\n(instrumented with OTel SDK)"]

        subgraph k8s ["Kubernetes"]
            ksm["kube-state-metrics"]
            node["node-exporter"]
            kubelet["kubelet"]
        end

        subgraph exporters ["Exporters"]
            pg["PostgreSQL Exporter\n:9187"]
            redis["Redis Exporter\n:9121"]
        end

        nodes["All node processes\n(eBPF target)"]
    end

    subgraph alloy_gw ["alloy β€” Deployment + HPA (2–6) + Clustering"]
        direction LR
        otel_recv["OTLP Receiver\n:4317 gRPC / :4318 HTTP"]
        scraper["Prometheus Scraper\nServiceMonitors + Pod Annotations\n(targets sharded across replicas)"]
        sdk_prof["pyroscope.scrape\n(SDK profiles, sharded)"]
    end

    subgraph alloy_col ["alloy-collector β€” DaemonSet (1 / node)"]
        direction LR
        log_scrape["Log Scraper\nPod stdout/stderr\n(/var/log/pods on this node)"]
        ebpf_prof["pyroscope.ebpf\n(node-local CPU profiling)"]
    end

    %% Applications β†’ Alloy gateway (OTLP)
    app -- "OTLP\n(traces, metrics, logs)\nalloy.monitoring:4317/4318" --> otel_recv

    %% Kubernetes metrics β†’ Alloy gateway (scrape)
    k8s -- "scrape metrics" --> scraper
    exporters -- "scrape metrics" --> scraper

    %% SDK profiles pulled by gateway
    app -. "/debug/pprof scrape" .-> sdk_prof

    %% Pod logs collected by per-node DaemonSet
    app -. "stdout/stderr\n(pod logs)" .-> log_scrape

    %% eBPF profiles all node processes
    nodes -. "kernel sampling 97 Hz" .-> ebpf_prof

    prometheus["Prometheus:9090"]
    loki["Loki"]
    tempo["Tempo:4317"]
    pyroscope["Pyroscope:4040"]

    %% Gateway β†’ backends
    otel_recv -- "metrics" --> prometheus
    scraper -- "metrics\n(remote write)" --> prometheus
    otel_recv -- "logs (OTLP)" --> loki
    otel_recv -- "traces (OTLP)" --> tempo
    sdk_prof -- "profiles" --> pyroscope

    %% Collector β†’ backends
    log_scrape -- "logs" --> loki
    ebpf_prof -- "profiles" --> pyroscope

    %% Metrics from Traces
    tempo -- "Span metrics generator\n(RED metrics: rate, errors, duration)" --> prometheus

    mimir["Mimir\n(long-term metrics)"]

    prometheus -- "remote write" --> mimir

    mimir --> grafana
    grafana["Grafana :3000"]

    prometheus --> grafana
    loki --> grafana
    tempo --> grafana
    pyroscope --> grafana

    %% Styling
    style alloy_gw fill:#f59e0b,stroke:#d97706,color:#000
    style alloy_col fill:#fbbf24,stroke:#d97706,color:#000

    style sources fill:#e5e7eb,stroke:#9ca3af,color:#000

    style grafana fill:#10b981,stroke:#059669,color:#fff
    style prometheus fill:#3b82f6,stroke:#2563eb,color:#fff
    style loki fill:#3b82f6,stroke:#2563eb,color:#fff
    style tempo fill:#3b82f6,stroke:#2563eb,color:#fff
    style pyroscope fill:#3b82f6,stroke:#2563eb,color:#fff
    style mimir fill:#3b82f6,stroke:#2563eb,color:#fff

Two Alloy releases: cluster-scoped work (OTLP receiving, Prometheus scraping, SDK profile pulling) runs on the horizontally-scaled alloy Deployment with HPA + clustering. Node-scoped work (pod log tailing, eBPF profiling) runs on the alloy-collector DaemonSet because it needs host filesystem and kernel access. Producers continue to use the unchanged alloy.monitoring:4317/4318 service name; the K8s Service round-robins OTLP connections across gateway replicas.

Key Flows

1. Metrics β€” The gateway scrapes metrics from both applications (ServiceMonitors, pod annotations) and Kubernetes components (kube-state-metrics, node-exporter, kubelet, PostgreSQL/Redis exporters) and sends them to Prometheus via remote write. Scrape targets are sharded across gateway replicas via Alloy clustering β€” each target is owned by exactly one replica, so scaling out splits the load instead of duplicating it.

2. Logs β€” Application OTLP logs (structured, with trace context) are received by the gateway. Pod stdout/stderr logs are tailed from /var/log/pods by the collector DaemonSet (one pod per node, each handling only its own node’s logs). Both streams go to Loki.

3. Profiles β€” Two paths: SDK-instrumented services expose /debug/pprof and are scraped by the gateway (sharded). All other processes β€” including those with no instrumentation β€” are profiled by the collector DaemonSet’s eBPF probe (97 Hz kernel sampling). Both feed Pyroscope.

4. Metrics from Traces β€” Tempo automatically generates RED metrics (Rate, Errors, Duration) from spans using the built-in span metrics generator and sends them back to Prometheus. This allows creating alerts and dashboards based on traces without manual metric instrumentation.

🧱 OpenTelemetry Collector

The Collector is a service that sits between applications and observability backends.

Application β†’ OTel SDK β†’ OTel Collector β†’ Grafana / Tempo / Prometheus / Loki

βš™οΈ Collector Architecture

alt text

The Collector consists of 3 key parts:

Component Role
Receivers receive data (OTLP, Jaeger, Prometheus, Zipkin)
Processors process data (batch, sampling, transformations)
Exporters send data to backends

🧭 Collector Deployment Modes

Mode Description Example Use Case
Agent mode collector running locally on a node collecting data from a single host
Gateway mode central collector gathering data from multiple sources scaled environments
Hybrid mode combination of both approaches large distributed systems

πŸ”„ Grafana Alloy vs OpenTelemetry Collector

Grafana Alloy (formerly Grafana Agent) and OpenTelemetry Collector serve a similar role β€” they collect, process, and forward telemetry data. However, they differ in philosophy and ecosystem.

Feature OpenTelemetry Collector Grafana Alloy
Project CNCF (vendor-neutral) Grafana Labs (open source)
Configuration YAML (pipelines: receivers β†’ processors β†’ exporters) River (HCL-like, declarative, with typing)
Signals Traces, Metrics, Logs Traces, Metrics, Logs, Profiles (Pyroscope)
Prometheus Scraping Yes (receiver prometheus) Native β€” full compatibility with prometheus.scrape
Grafana Stack Integration Requires exporter configuration Out-of-the-box (Loki, Tempo, Mimir, Pyroscope)
Debug UI None (CLI/logs) Built-in UI with component graph (localhost:12345)
Clustering None (requires external load balancer) Built-in β€” automatic target sharding between instances
Distributions otelcol-core, otelcol-contrib, custom builder (ocb) Single binary β€” all components in one
Config conversion β€” alloy convert β€” automatic migration from OTel Collector, Prometheus, Promtail

When to choose OTel Collector?

  • Multi-vendor environment β€” data goes to different backends (Datadog, Jaeger, Elastic, Grafana)
  • You want to stay with the CNCF standard without vendor lock-in
  • You need custom builds with selected components (ocb)

When to choose Grafana Alloy?

  • Stack based on Grafana (Loki, Tempo, Mimir, Pyroscope)
  • You need profiling (native Pyroscope integration)
  • You want clustering and automatic target sharding without external tools
  • You prefer declarative configuration (River) over YAML pipelines

Example β€” the same pipeline in both tools

OpenTelemetry Collector (YAML):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"

processors:
  batch:
    timeout: 5s

exporters:
  otlp:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Grafana Alloy (River):

otelcol.receiver.otlp "default" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }
  output {
    traces = [otelcol.processor.batch.default.input]
  }
}

otelcol.processor.batch "default" {
  timeout = "5s"
  output {
    traces = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls {
      insecure = true
    }
  }
}

4️⃣ Instrumentation Methods

🎯 Instrumentation Goal

Collecting telemetry data from applications in an automatic or manual way, to have a complete picture of system behavior.

🧰 1. Application Instrumentation (Manual)

  • You add code to the application (@WithSpan, Tracer.startSpan(), etc.)
  • Most of the heavy lifting is done by libraries.
  • βœ… Advantages:
    • full control
    • precise data
  • ❌ Disadvantages:
    • time-consuming
    • requires code maintenance
# Install NuGet packages
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
// Program.cs
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .WithTracing(builder => builder
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4318");
        }));

var app = builder.Build();

βš™οΈ 2. Auto-Instrumentation

  • Language agent that automatically tracks calls (e.g., HTTP, DB, Kafka)
  • βœ… Advantages:
    • quick start
    • no code changes
  • ❌ Disadvantages:
    • limited flexibility
    • framework-dependent

Example for Java:

# 1. Download OpenTelemetry Java Agent
wget -O opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# 2. Run application with agent
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-java-app \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4318 \
  -Dotel.exporter.otlp.protocol=http/protobuf \
  -jar my-application.jar

Java auto-instrumentation covers:

  • HTTP clients/servers (OkHttp, Apache HttpClient, Spring WebMVC)
  • Database drivers (JDBC, MongoDB, Redis)
  • Messaging (Kafka, RabbitMQ, JMS)
  • Frameworks (Spring Boot, Quarkus, Micronaut)

Example for .NET:

# 1. Install OpenTelemetry .NET Automatic Instrumentation
# Download and install from GitHub releases
wget -O otel-dotnet-auto-install.sh \
  https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/releases/latest/download/otel-dotnet-auto-install.sh
chmod +x otel-dotnet-auto-install.sh
./otel-dotnet-auto-install.sh

# 2. Set environment variables
export CORECLR_ENABLE_PROFILING=1
export CORECLR_PROFILER={918728DD-259F-4A6A-AC2B-B85E1B658318}
export CORECLR_PROFILER_PATH=/opt/opentelemetry/OpenTelemetry.AutoInstrumentation.Native.so
export DOTNET_ADDITIONAL_DEPS=/opt/opentelemetry/AdditionalDeps
export DOTNET_SHARED_STORE=/opt/opentelemetry/store
export DOTNET_STARTUP_HOOKS=/opt/opentelemetry/net/OpenTelemetry.AutoInstrumentation.StartupHook.dll
export OTEL_DOTNET_AUTO_HOME=/opt/opentelemetry

# 3. Configure OpenTelemetry
export OTEL_SERVICE_NAME=my-dotnet-app
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

# 4. Run the application
dotnet MyApplication.dll

NET auto-instrumentation covers:

  • HTTP clients/servers (HttpClient, ASP.NET Core)
  • Database providers (Entity Framework, SqlClient, MongoDB)
  • Messaging (Azure Service Bus, RabbitMQ, Kafka)
  • gRPC clients/servers

🧱 3. OBI – OpenTelemetry Binary Instrumentation

  • Ready-made binary or sidecar that intercepts telemetry data
  • βœ… Advantages:
    • ideal for legacy systems
    • quick deployment
  • ❌ Disadvantages:
    • less control
    • harder debugging

🧩 Integration with Grafana Stack

  • Prometheus / Mimir β†’ metrics
  • Loki β†’ logs
  • Tempo β†’ traces
  • Grafana β†’ visualization and data correlation

πŸ“¦ Collector in Kubernetes

  • Deployed as:
    • DaemonSet – one agent per node
    • Sidecar – alongside the application
    • Deployment – in gateway mode
  • Configuration in YAML (receivers, processors, exporters)

πŸ”¬ OpenTelemetry in Practice β€” Demo Application

The demo application at github.com/ProtopiaTech/opentelemetry-demo shows how real services are instrumented with OpenTelemetry. Three services demonstrate three distinct integration patterns.

Pattern 1: Manual SDK Setup (Go)

Service: src/product-catalog (gRPC service)

What to look at Where
SDK initialization (TracerProvider, MeterProvider, LoggerProvider) main.go β€” main() function
OTLP gRPC exporter configuration main.go β€” provider setup
gRPC auto-instrumentation otelgrpc.NewServerHandler(), otelgrpc.NewClientHandler()
Custom span attributes app.product.id, app.product.name, app.products.count
Structured logging via OTel otelslog bridge β€” logs carry trace context automatically
Runtime metrics runtime.Start() for Go runtime statistics

Key takeaway: In Go, you initialize providers explicitly and wire them into gRPC server/client options. Auto-instrumentation handles the span lifecycle; you add custom attributes for business context.

Pattern 2: Auto-Instrumentation (Python)

Service: src/recommendation (gRPC service)

What to look at Where
Auto-instrumentation setup Dockerfile β€” entrypoint uses opentelemetry-instrument
Dependencies requirements.txt β€” opentelemetry-distro, opentelemetry-bootstrap
Custom metric (counter) recommendation_server.py β€” app_recommendations_counter
Manual spans alongside auto-instrumentation tracer.start_as_current_span("get_product_list")
Log trace context injection logger.py β€” custom JSON formatter injects trace/span IDs

Key takeaway: Python auto-instrumentation requires zero code changes for HTTP, gRPC, and DB calls. You add the opentelemetry-instrument wrapper at startup and get spans automatically. Custom counters and manual spans are added on top for business-specific telemetry.

Pattern 3: Explicit Instrumentors + Baggage (Python)

Service: src/load-generator (Locust-based)

What to look at Where
Manual SDK init with explicit instrumentors locustfile.py β€” Jinja2Instrumentor, RequestsInstrumentor, URLLib3Instrumentor
Baggage propagation session.id, synthetic_request β€” carried across service boundaries
Per-task manual spans user_browse_product, user_checkout_single, etc. with business attributes
System metrics collection SystemMetricsInstrumentor for host-level metrics

Key takeaway: When auto-instrumentation doesn’t cover your framework (Locust), you instantiate specific instrumentors manually. Baggage lets you attach metadata (e.g., session ID, synthetic flag) that propagates through the entire request chain.

What You Get for Free vs What You Add

Β  Auto-instrumentation (free) Manual additions
Traces HTTP/gRPC spans, DB query spans, messaging spans Custom spans for business operations, extra attributes (product.id, user.id)
Metrics Runtime metrics (GC, goroutines, threads), HTTP request duration/count Custom counters and histograms for business KPIs
Logs Trace context injection (trace ID, span ID in every log line) Structured fields, custom log levels, business event logs

The Export Flow

All three services follow the same pattern:

Application (OTel SDK)
    β†’ OTLP gRPC :4317
        β†’ Grafana Alloy (Collector)
            β†’ Prometheus (metrics)
            β†’ Loki (logs)
            β†’ Tempo (traces)
            β†’ Pyroscope (profiles)

Key Environment Variables

Every OTel-instrumented service is configured via environment variables β€” no vendor-specific code needed:

Variable Purpose Example
OTEL_SERVICE_NAME Identifies the service in traces/metrics product-catalog
OTEL_EXPORTER_OTLP_ENDPOINT Where to send telemetry http://alloy:4317
OTEL_EXPORTER_OTLP_PROTOCOL Wire protocol grpc or http/protobuf
OTEL_TRACES_SAMPLER Sampling strategy parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG Sampling rate 0.1 (10%)
OTEL_RESOURCE_ATTRIBUTES Extra resource attributes deployment.environment=production
OTEL_LOGS_EXPORTER Log exporter type otlp
OTEL_METRICS_EXPORTER Metrics exporter type otlp

These variables are standardized by the OTel specification. They work the same way regardless of language or backend.

πŸ“š Additional Resources

results matching ""

    No results matching ""