Logs

🪵 Introduction

What are they for?

What exactly happened

General Characteristics

Text
No uniform structure
Collected from many sources
Writing is NOT transactional

Classic Classification

System / Infrastructure logs
- Tracking the operation of the environment and resources (servers, containers, network)
- Container start/stop, server restart, disk errors, out of memory
Application logs
- Operation of the application itself, its components, and business logic
- Application startup, request processing, operation results, code errors
Access / HTTP logs
- Monitoring network traffic, user requests, and APIs
- HTTP request/response, method, URL, status code, execution time
Security / Audit logs
- Recording events related to security and auditing
- User login, failed login attempts, permission changes, operations on sensitive data
Business / Domain logs
- Tracking key domain events (useful in analytics)
- Order placement, course start, successful payment
Diagnostic / Debug / Trace logs
- Detailed data for error and performance analysis
- Method entry/exit, runtime variables, SQL query details

Formats

Plain Text (Unstructured logs)

Example:

2025-10-27 13:14:05 ERROR Payment failed for user 1234: Timeout connecting to gateway

Advantages:

Easy to read for humans
Simple to create (plain strings)

Disadvantages:

Difficult to parse programmatically
Unsuitable for structural queries and analytics (e.g., in Loki or Elasticsearch)

Typical use: Legacy systems, CLI tools, small embedded devices (e.g., microcontrollers, ESP sensors in Home Assistant)

Syslog Format (RFC 3164 / RFC 5424)

<34>1 2025-10-27T13:14:05Z myapp.example.com myapp 12345 ID47 [exampleSDID@32473 iut="3" eventSource="App"] Payment failed

Structure:

<PRI> (facility and severity)
Version
Timestamp
Hostname
Application name
Process ID
Message ID
Structured Data
Message text

Advantages:

Standardized, supported by most infrastructure
Works across systems and network devices
Compatible with rsyslog, journald, etc.

Disadvantages:

Limited structure; requires parsers to extract metadata
Often inconsistent implementations

Typical use: Linux system logs (/var/log/syslog), network devices, daemons, container runtimes

Common Log Format (CLF / NCSA)

127.0.0.1 - james [27/Oct/2025:13:14:05 +0000] "GET /index.html HTTP/1.1" 200 1043

Structure:

host ident authuser [date] "request" status bytes

Advantages:

Standardized for web servers (Apache, NGINX)
Works with many log analyzers

Disadvantages:

Fixed fields — not extensible
No JSON-like structure for modern metadata

Typical use: Web server access logs, HTTP request analytics

Combined Log Format (CLF Extension)

Adds referrer and user-agent fields:

127.0.0.1 - james [27/Oct/2025:13:14:05 +0000] "GET /index.html HTTP/1.1" 200 1043 "https://example.com/start" "Mozilla/5.0"

Typical use: Web traffic monitoring and analytics (NGINX, Apache, IIS)

Structured Logging (Key-Value Pairs)

level=error ts=2025-10-27T13:14:05Z msg="payment failed" user=1234 error="timeout" duration=2000ms

Advantages:

Still human-readable
Easy to parse with regex or log shippers (Promtail, Fluentd)

Disadvantages:

No strict schema; field naming conventions vary
Escaping quotes/spaces can be problematic

Typical use: Go (Zap, Logrus), systemd-journald, Grafana components

JSON Structured Logging

{
  "timestamp": "2025-10-27T13:14:05Z",
  "level": "error",
  "service": "payment-service",
  "user_id": 1234,
  "message": "Timeout connecting to gateway",
  "trace_id": "a1b2c3d4e5"
}

Advantages:

Machine and human readable
Ideal for log aggregation systems (Loki, Elasticsearch, Splunk, Datadog)
Enables querying and filtering based on fields
Works well with OpenTelemetry and Grafana Tempo correlation

Disadvantages:

Slightly larger data sizes
Harder to read in plain terminals

Typical use: Modern cloud-native and microservices systems (Kubernetes, containers, distributed tracing)

OpenTelemetry Log Data Model (OTel format)

{
  "time_unix_nano": "1730030045000000000",
  "severity_text": "ERROR",
  "body": "Payment failed for user 1234: timeout",
  "attributes": {
    "service.name": "payment-service",
    "user.id": "1234",
    "error.code": "ETIMEOUT",
    "trace_id": "a1b2c3d4e5"
  }
}

Advantages:

Standardized structure across telemetry types (metrics, traces, logs)
Supports context linking (trace/span IDs)
Portable across backends (Grafana Loki, OTLP receivers, Datadog, etc.)

Disadvantages:

Newer ecosystem; not all logging frameworks natively support OTel

Typical use: Observability pipelines using OpenTelemetry Collector → Loki/Tempo/Grafana

CEF (Common Event Format)

CEF:0|Security|IDS|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.2 spt=1232

Advantages:

Standard for security logs (SIEMs)
Extensible, consistent parsing

Disadvantages:

Security-focused, less relevant for application logs

Typical use: Firewalls, intrusion detection systems, security devices, SIEMs (ArcSight, Splunk)

🧠 Summary Table

Format	Structured	Typical Use	Example Systems
Plain Text	❌	Legacy apps, IoT, CLI	System logs, ESP devices
Syslog (RFC5424)	🟡 Partially	Infrastructure, OS, network	journald, rsyslog
CLF / Combined	❌	Web access logs	Apache, NGINX
Key-Value	🟡 Partially	Go apps, Prometheus	Loki, Grafana
JSON	✅	Cloud-native, microservices	Kubernetes, Loki
OTel Log Model	✅	Unified observability	OpenTelemetry Collector
CEF	✅	Security and audit logs	Firewalls, SIEMs

Best Practices

Format

JSON in UTF-8
Single line
Structured logs are best
They allow explicit field definition

Mandatory Fields

Date and time - in a fixed format and timezone (UTC preferred)
Log level - standardized level names
Service name - consistent naming
Trace ID and Parent Trace ID - for correlation
Message - readable event description
Error - error information (in a single line!)

Standards and Documentation

Write a document specifying fields and their meaning
Use external standards (e.g., OpenTelemetry) instead of inventing your own
Clearly define what we collect and from where

Log Shipping Architecture

Low volume

Direct sending to the logging system

Medium/high volume

Use OpenTelemetry Collector or Grafana Alloy
Buffering and batching
Ability to enrich logs with additional fields
Filtering at the collector level

Shipping rules

Asynchronous sending - don’t block the application
Application must have retry and buffering mechanisms
Fallback to stdout only in critical situations

Organization and Management

Using tenants

Not one system for the entire organization
Division by systems/teams
According to Conway’s Law

Log retention

Different retention for different log types
Better to collect more, but keep for shorter periods
Don’t archive info logs - recovering from archive is expensive

Logging budgets

Set limits for teams (dev/test)
Prevent “overzealous” teams
Cost control

Key Quality Principles

Quantity vs quality

Quantity does not go hand in hand with quality
We need good quality logs, not everything
Logs are only added, never removed - this needs to be controlled

Formatting

One line per log - avoids multiline problems
Replace newline characters in error messages
JSON as format

Log levels

Establish level standards
Consistent use of names (don’t mix Fatal/Critical)
Don’t create custom levels (max 5-6 levels)

Sampling - NO!

Don’t sample logs
Better shorter retention than losing data
During outages we need all logs

Separation of Concerns

Application vs system logs

Stdout for system logs and critical situations
Application logs to the central system
Developers should not access pods directly

Business logs

Replace with metrics or database records
Don’t mix with technical logs
For billing - database, not logs

Diagnostics

Use traces instead of diagnostic logs
Enable detailed logs only when needed

Logs

Logs

🪵 Introduction

General Characteristics

Classic Classification

Formats

Plain Text (Unstructured logs)

Syslog Format (RFC 3164 / RFC 5424)

Common Log Format (CLF / NCSA)

Combined Log Format (CLF Extension)

Structured Logging (Key-Value Pairs)

JSON Structured Logging

OpenTelemetry Log Data Model (OTel format)

CEF (Common Event Format)

Best Practices

Format

Mandatory Fields

Standards and Documentation

Log Shipping Architecture

Low volume

Medium/high volume

Shipping rules

Organization and Management

Using tenants

Log retention

Logging budgets

Key Quality Principles

Quantity vs quality

Formatting

Log levels

Sampling - NO!

Separation of Concerns

Application vs system logs

Business logs

Diagnostics

results matching ""

No results matching ""