Logs
🪵 Introduction
What are they for?
What exactly happened
General Characteristics
- Text
- No uniform structure
- Collected from many sources
- Writing is NOT transactional
Classic Classification
- System / Infrastructure logs
- Tracking the operation of the environment and resources (servers, containers, network)
- Container start/stop, server restart, disk errors, out of memory
- Application logs
- Operation of the application itself, its components, and business logic
- Application startup, request processing, operation results, code errors
- Access / HTTP logs
- Monitoring network traffic, user requests, and APIs
- HTTP request/response, method, URL, status code, execution time
- Security / Audit logs
- Recording events related to security and auditing
- User login, failed login attempts, permission changes, operations on sensitive data
- Business / Domain logs
- Tracking key domain events (useful in analytics)
- Order placement, course start, successful payment
- Diagnostic / Debug / Trace logs
- Detailed data for error and performance analysis
- Method entry/exit, runtime variables, SQL query details
Formats
Plain Text (Unstructured logs)
Example:
2025-10-27 13:14:05 ERROR Payment failed for user 1234: Timeout connecting to gateway
Advantages:
- Easy to read for humans
- Simple to create (plain strings)
Disadvantages:
- Difficult to parse programmatically
- Unsuitable for structural queries and analytics (e.g., in Loki or Elasticsearch)
Typical use: Legacy systems, CLI tools, small embedded devices (e.g., microcontrollers, ESP sensors in Home Assistant)
Syslog Format (RFC 3164 / RFC 5424)
<34>1 2025-10-27T13:14:05Z myapp.example.com myapp 12345 ID47 [exampleSDID@32473 iut="3" eventSource="App"] Payment failed
Structure:
<PRI>(facility and severity)- Version
- Timestamp
- Hostname
- Application name
- Process ID
- Message ID
- Structured Data
- Message text
Advantages:
- Standardized, supported by most infrastructure
- Works across systems and network devices
- Compatible with rsyslog, journald, etc.
Disadvantages:
- Limited structure; requires parsers to extract metadata
- Often inconsistent implementations
Typical use: Linux system logs (/var/log/syslog), network devices, daemons, container runtimes
Common Log Format (CLF / NCSA)
127.0.0.1 - james [27/Oct/2025:13:14:05 +0000] "GET /index.html HTTP/1.1" 200 1043
Structure:
host ident authuser [date] "request" status bytes
Advantages:
- Standardized for web servers (Apache, NGINX)
- Works with many log analyzers
Disadvantages:
- Fixed fields — not extensible
- No JSON-like structure for modern metadata
Typical use: Web server access logs, HTTP request analytics
Combined Log Format (CLF Extension)
Adds referrer and user-agent fields:
127.0.0.1 - james [27/Oct/2025:13:14:05 +0000] "GET /index.html HTTP/1.1" 200 1043 "https://example.com/start" "Mozilla/5.0"
Typical use: Web traffic monitoring and analytics (NGINX, Apache, IIS)
Structured Logging (Key-Value Pairs)
level=error ts=2025-10-27T13:14:05Z msg="payment failed" user=1234 error="timeout" duration=2000ms
Advantages:
- Still human-readable
- Easy to parse with regex or log shippers (Promtail, Fluentd)
Disadvantages:
- No strict schema; field naming conventions vary
- Escaping quotes/spaces can be problematic
Typical use: Go (Zap, Logrus), systemd-journald, Grafana components
JSON Structured Logging
{
"timestamp": "2025-10-27T13:14:05Z",
"level": "error",
"service": "payment-service",
"user_id": 1234,
"message": "Timeout connecting to gateway",
"trace_id": "a1b2c3d4e5"
}
Advantages:
- Machine and human readable
- Ideal for log aggregation systems (Loki, Elasticsearch, Splunk, Datadog)
- Enables querying and filtering based on fields
- Works well with OpenTelemetry and Grafana Tempo correlation
Disadvantages:
- Slightly larger data sizes
- Harder to read in plain terminals
Typical use: Modern cloud-native and microservices systems (Kubernetes, containers, distributed tracing)
OpenTelemetry Log Data Model (OTel format)
{
"time_unix_nano": "1730030045000000000",
"severity_text": "ERROR",
"body": "Payment failed for user 1234: timeout",
"attributes": {
"service.name": "payment-service",
"user.id": "1234",
"error.code": "ETIMEOUT",
"trace_id": "a1b2c3d4e5"
}
}
Advantages:
- Standardized structure across telemetry types (metrics, traces, logs)
- Supports context linking (trace/span IDs)
- Portable across backends (Grafana Loki, OTLP receivers, Datadog, etc.)
Disadvantages:
- Newer ecosystem; not all logging frameworks natively support OTel
Typical use: Observability pipelines using OpenTelemetry Collector → Loki/Tempo/Grafana
CEF (Common Event Format)
CEF:0|Security|IDS|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.2 spt=1232
Advantages:
- Standard for security logs (SIEMs)
- Extensible, consistent parsing
Disadvantages:
- Security-focused, less relevant for application logs
Typical use: Firewalls, intrusion detection systems, security devices, SIEMs (ArcSight, Splunk)
🧠 Summary Table
| Format | Structured | Typical Use | Example Systems |
|---|---|---|---|
| Plain Text | ❌ | Legacy apps, IoT, CLI | System logs, ESP devices |
| Syslog (RFC5424) | 🟡 Partially | Infrastructure, OS, network | journald, rsyslog |
| CLF / Combined | ❌ | Web access logs | Apache, NGINX |
| Key-Value | 🟡 Partially | Go apps, Prometheus | Loki, Grafana |
| JSON | ✅ | Cloud-native, microservices | Kubernetes, Loki |
| OTel Log Model | ✅ | Unified observability | OpenTelemetry Collector |
| CEF | ✅ | Security and audit logs | Firewalls, SIEMs |
Best Practices
Format
- JSON in UTF-8
- Single line
- Structured logs are best
- They allow explicit field definition
Mandatory Fields
- Date and time - in a fixed format and timezone (UTC preferred)
- Log level - standardized level names
- Service name - consistent naming
- Trace ID and Parent Trace ID - for correlation
- Message - readable event description
- Error - error information (in a single line!)
Standards and Documentation
- Write a document specifying fields and their meaning
- Use external standards (e.g., OpenTelemetry) instead of inventing your own
- Clearly define what we collect and from where
Log Shipping Architecture
Low volume
- Direct sending to the logging system
Medium/high volume
- Use OpenTelemetry Collector or Grafana Alloy
- Buffering and batching
- Ability to enrich logs with additional fields
- Filtering at the collector level
Shipping rules
- Asynchronous sending - don’t block the application
- Application must have retry and buffering mechanisms
- Fallback to stdout only in critical situations
Organization and Management
Using tenants
- Not one system for the entire organization
- Division by systems/teams
- According to Conway’s Law
Log retention
- Different retention for different log types
- Better to collect more, but keep for shorter periods
- Don’t archive info logs - recovering from archive is expensive
Logging budgets
- Set limits for teams (dev/test)
- Prevent “overzealous” teams
- Cost control
Key Quality Principles
Quantity vs quality
- Quantity does not go hand in hand with quality
- We need good quality logs, not everything
- Logs are only added, never removed - this needs to be controlled
Formatting
- One line per log - avoids multiline problems
- Replace newline characters in error messages
- JSON as format
Log levels
- Establish level standards
- Consistent use of names (don’t mix Fatal/Critical)
- Don’t create custom levels (max 5-6 levels)
Sampling - NO!
- Don’t sample logs
- Better shorter retention than losing data
- During outages we need all logs
Separation of Concerns
Application vs system logs
- Stdout for system logs and critical situations
- Application logs to the central system
- Developers should not access pods directly
Business logs
- Replace with metrics or database records
- Don’t mix with technical logs
- For billing - database, not logs
Diagnostics
- Use traces instead of diagnostic logs
- Enable detailed logs only when needed