Logs

🪵 Introduction

What are they for?

What exactly happened

General Characteristics

  • Text
  • No uniform structure
  • Collected from many sources
  • Writing is NOT transactional

Classic Classification

  • System / Infrastructure logs
    • Tracking the operation of the environment and resources (servers, containers, network)
    • Container start/stop, server restart, disk errors, out of memory
  • Application logs
    • Operation of the application itself, its components, and business logic
    • Application startup, request processing, operation results, code errors
  • Access / HTTP logs
    • Monitoring network traffic, user requests, and APIs
    • HTTP request/response, method, URL, status code, execution time
  • Security / Audit logs
    • Recording events related to security and auditing
    • User login, failed login attempts, permission changes, operations on sensitive data
  • Business / Domain logs
    • Tracking key domain events (useful in analytics)
    • Order placement, course start, successful payment
  • Diagnostic / Debug / Trace logs
    • Detailed data for error and performance analysis
    • Method entry/exit, runtime variables, SQL query details

Formats

Plain Text (Unstructured logs)

Example:

2025-10-27 13:14:05 ERROR Payment failed for user 1234: Timeout connecting to gateway

Advantages:

  • Easy to read for humans
  • Simple to create (plain strings)

Disadvantages:

  • Difficult to parse programmatically
  • Unsuitable for structural queries and analytics (e.g., in Loki or Elasticsearch)

Typical use: Legacy systems, CLI tools, small embedded devices (e.g., microcontrollers, ESP sensors in Home Assistant)

Syslog Format (RFC 3164 / RFC 5424)

<34>1 2025-10-27T13:14:05Z myapp.example.com myapp 12345 ID47 [exampleSDID@32473 iut="3" eventSource="App"] Payment failed

Structure:

  • <PRI> (facility and severity)
  • Version
  • Timestamp
  • Hostname
  • Application name
  • Process ID
  • Message ID
  • Structured Data
  • Message text

Advantages:

  • Standardized, supported by most infrastructure
  • Works across systems and network devices
  • Compatible with rsyslog, journald, etc.

Disadvantages:

  • Limited structure; requires parsers to extract metadata
  • Often inconsistent implementations

Typical use: Linux system logs (/var/log/syslog), network devices, daemons, container runtimes

Common Log Format (CLF / NCSA)

127.0.0.1 - james [27/Oct/2025:13:14:05 +0000] "GET /index.html HTTP/1.1" 200 1043

Structure:

host ident authuser [date] "request" status bytes

Advantages:

  • Standardized for web servers (Apache, NGINX)
  • Works with many log analyzers

Disadvantages:

  • Fixed fields — not extensible
  • No JSON-like structure for modern metadata

Typical use: Web server access logs, HTTP request analytics

Combined Log Format (CLF Extension)

Adds referrer and user-agent fields:

127.0.0.1 - james [27/Oct/2025:13:14:05 +0000] "GET /index.html HTTP/1.1" 200 1043 "https://example.com/start" "Mozilla/5.0"

Typical use: Web traffic monitoring and analytics (NGINX, Apache, IIS)

Structured Logging (Key-Value Pairs)

level=error ts=2025-10-27T13:14:05Z msg="payment failed" user=1234 error="timeout" duration=2000ms

Advantages:

  • Still human-readable
  • Easy to parse with regex or log shippers (Promtail, Fluentd)

Disadvantages:

  • No strict schema; field naming conventions vary
  • Escaping quotes/spaces can be problematic

Typical use: Go (Zap, Logrus), systemd-journald, Grafana components

JSON Structured Logging

{
  "timestamp": "2025-10-27T13:14:05Z",
  "level": "error",
  "service": "payment-service",
  "user_id": 1234,
  "message": "Timeout connecting to gateway",
  "trace_id": "a1b2c3d4e5"
}

Advantages:

  • Machine and human readable
  • Ideal for log aggregation systems (Loki, Elasticsearch, Splunk, Datadog)
  • Enables querying and filtering based on fields
  • Works well with OpenTelemetry and Grafana Tempo correlation

Disadvantages:

  • Slightly larger data sizes
  • Harder to read in plain terminals

Typical use: Modern cloud-native and microservices systems (Kubernetes, containers, distributed tracing)

OpenTelemetry Log Data Model (OTel format)

{
  "time_unix_nano": "1730030045000000000",
  "severity_text": "ERROR",
  "body": "Payment failed for user 1234: timeout",
  "attributes": {
    "service.name": "payment-service",
    "user.id": "1234",
    "error.code": "ETIMEOUT",
    "trace_id": "a1b2c3d4e5"
  }
}

Advantages:

  • Standardized structure across telemetry types (metrics, traces, logs)
  • Supports context linking (trace/span IDs)
  • Portable across backends (Grafana Loki, OTLP receivers, Datadog, etc.)

Disadvantages:

  • Newer ecosystem; not all logging frameworks natively support OTel

Typical use: Observability pipelines using OpenTelemetry Collector → Loki/Tempo/Grafana

CEF (Common Event Format)

CEF:0|Security|IDS|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.2 spt=1232

Advantages:

  • Standard for security logs (SIEMs)
  • Extensible, consistent parsing

Disadvantages:

  • Security-focused, less relevant for application logs

Typical use: Firewalls, intrusion detection systems, security devices, SIEMs (ArcSight, Splunk)

🧠 Summary Table

Format Structured Typical Use Example Systems
Plain Text Legacy apps, IoT, CLI System logs, ESP devices
Syslog (RFC5424) 🟡 Partially Infrastructure, OS, network journald, rsyslog
CLF / Combined Web access logs Apache, NGINX
Key-Value 🟡 Partially Go apps, Prometheus Loki, Grafana
JSON Cloud-native, microservices Kubernetes, Loki
OTel Log Model Unified observability OpenTelemetry Collector
CEF Security and audit logs Firewalls, SIEMs

Best Practices

Format

  • JSON in UTF-8
  • Single line
  • Structured logs are best
  • They allow explicit field definition

Mandatory Fields

  • Date and time - in a fixed format and timezone (UTC preferred)
  • Log level - standardized level names
  • Service name - consistent naming
  • Trace ID and Parent Trace ID - for correlation
  • Message - readable event description
  • Error - error information (in a single line!)

Standards and Documentation

  • Write a document specifying fields and their meaning
  • Use external standards (e.g., OpenTelemetry) instead of inventing your own
  • Clearly define what we collect and from where

Log Shipping Architecture

Low volume

  • Direct sending to the logging system

Medium/high volume

  • Use OpenTelemetry Collector or Grafana Alloy
  • Buffering and batching
  • Ability to enrich logs with additional fields
  • Filtering at the collector level

Shipping rules

  • Asynchronous sending - don’t block the application
  • Application must have retry and buffering mechanisms
  • Fallback to stdout only in critical situations

Organization and Management

Using tenants

  • Not one system for the entire organization
  • Division by systems/teams
  • According to Conway’s Law

Log retention

  • Different retention for different log types
  • Better to collect more, but keep for shorter periods
  • Don’t archive info logs - recovering from archive is expensive

Logging budgets

  • Set limits for teams (dev/test)
  • Prevent “overzealous” teams
  • Cost control

Key Quality Principles

Quantity vs quality

  • Quantity does not go hand in hand with quality
  • We need good quality logs, not everything
  • Logs are only added, never removed - this needs to be controlled

Formatting

  • One line per log - avoids multiline problems
  • Replace newline characters in error messages
  • JSON as format

Log levels

  • Establish level standards
  • Consistent use of names (don’t mix Fatal/Critical)
  • Don’t create custom levels (max 5-6 levels)

Sampling - NO!

  • Don’t sample logs
  • Better shorter retention than losing data
  • During outages we need all logs

Separation of Concerns

Application vs system logs

  • Stdout for system logs and critical situations
  • Application logs to the central system
  • Developers should not access pods directly

Business logs

  • Replace with metrics or database records
  • Don’t mix with technical logs
  • For billing - database, not logs

Diagnostics

  • Use traces instead of diagnostic logs
  • Enable detailed logs only when needed

results matching ""

    No results matching ""