What this really is (in human terms)
Metrics tell you aggregates- p95 latency, error rates, uptime curves. Logs tell you stories. Each line is a single request or an exception with the who/what/when/where/why. When your access and error logs are designed for humans (and machines), they become the fastest way to:
- detect that an incident is happening right now;
- pinpoint the root cause (app, DB, external API, cache, network, TLS, CDN);
- cut MTTR and make releases calmer;
- spot bots and abuse early and react with evidence;
- quantify the economics of performance: where milliseconds (and money) are leaking.
Logs aren’t a dusty archive “just in case.” Treat them like an API contract: stable fields, retention, access policy, and clear ownership.
How it works (structure and principles)
1) Two log classes: access and error
Access logs record every inbound request: client identity, method/path, status, size, timing. They are the source of truth for availability and performance from the outside.
Error logs capture exceptional situations: stack traces, timeouts, OS/DB limits, TLS handshake failures, WAF/ACL decisions. They are the key to root cause.
Both must be stitched together with a shared request_id (or trace_id), so one identifier reconstructs the chain: edge → app → DB/queue → external provider.
2) Formatting: readable by humans, parsable by machines
Agree on a field contract across services. A practical split:
- Access – readable extended format with: timestamp, client (IP/X-Forwarded-For/ASN), method, path, status, response size, timings (connect_ms, tls_ms, upstream_wait_ms, upstream_resp_ms, total), cache status, request_id, user agent, country.
- Error – compact JSON with: timestamp, level, service, component, request_id, message, error_code, duration_ms, retry_count, resource (db/cache/queue), plus safe context fragments.
Avoid PII: mask emails/phones, strip tokens and secrets. Never log passwords. If you must log identifiers, hash and salt them.
3) Shipping, indexing, retention
“Write to stdout, collect with an agent” keeps ops sane. For production: Vector / Fluent Bit / Filebeat → Loki / ELK / OpenSearch → dashboards (Grafana/Kibana) → alerts. Suggested retention:
- Hot (7–30 days): fast search during incidents.
- Warm (1–3 months): trend and regression analysis.
- Archive (6–12 months): investigations, security, compliance.
4) Security and access
Logs carry sensitive data: IPs, referrers, technical headers. Enforce role‑based access, transport and at‑rest encryption, approval workflows for PII de‑obfuscation, and automated redaction/masking.
Why it matters (five quick reasons)
- Lower MTTR. An alert that links straight to a log dashboard with coherent fields saves hours.
- Incident prevention. Repeated WARNs/timeouts in error logs are pre‑failure signals.
- Evidence for optimization. See where the pipeline tears: DB, cache, external APIs, CDN, TLS, network peering.
- Security posture. Access logs show brute force, scraping, strange user‑agents and ASNs before they bite.
- Business dialogue. Logs prove how changes (NVMe, cache, connection pools) improved p95 and revenue.
Reading logs without pain: patterns and checklists
Pattern 1 – “The site is slow”
Look at: p95/p99 per key endpoint, 2xx/4xx/5xx distribution, upstream_* timings, cache HIT ratio, response sizes.
If you see:
– growing 502/504 → upstream crashes or stalls;
– spikes of 499 → clients close connections first (they give up);
– higher p95 without 5xx → bottleneck in DB/search/external API.
Do: turn on/fix caching, tune pools and timeouts carefully, compress responses, batch/optimize hot queries.
Pattern 2 – “404/403 everywhere”
Look at: top paths, referrers, user agents, geo/ASN, release calendar.
If you see:
– 404 clusters on one path → broken route/sitemap;
– 403 bursts from a single ASN → WAF/ACL or brute‑force/scanner.
Do: ship 301 redirects, update sitemap/robots, add WAF/rate‑limit/captcha, restrict admin paths by IP.
Pattern 3 – “401/429 storm”
Look at: auth service latency, token bucket capacity/consumption, request intervals per client.
If you see:
– 401 on /login → password guessing or SSO glitch;
– 429 → buggy SDK retry loop or abusive client.
Do: strengthen rate limits/captcha, cache tokens, fix the client and add exponential backoff with jitter.
Pattern 4 – “TLS/SSL complaints”
Look at: handshake errors (handshake failed, unknown ca, no shared cipher), certificate validity, SAN coverage, TLS version mix.
Do: renew certificates, enable auto‑renewal, update cipher suites, test legacy clients in a controlled matrix.
Pattern 5 – “Random 5xx during peaks”
Look at: correlate access 5xx with error stack traces by request_id; queues and workers; OS limits (open files/sockets); GC pauses.
Do: tune worker/connection pools, raise ulimit, add workers, optimize hot endpoints, scale horizontally.
What to put in access logs (the minimal useful set)
- Time and zone: UTC, ISO‑8601.
- Client: source IP/X‑Forwarded‑For, ASN/country (if enriched), user agent.
- Request: method, path, protocol version (HTTP/1.1, HTTP/2, HTTP/3).
- Response: status, bytes, cache status (HIT/MISS/BYPASS), redirect target when applicable.
- Timings: connect_ms, tls_ms, upstream_wait_ms, upstream_resp_ms, total duration.
- Correlation: request_id/trace_id and a link to a trace span if you run distributed tracing.
Without timings and a request_id, your logs lose most of their diagnostic value.
What to put in error logs (without becoming a dumpster)
- Level: ERROR/WARN/INFO/DEBUG (limit DEBUG in production).
- Component: nginx/app/db/cache/queue/worker/cron.
- Context: endpoint/operation, safe parameters, request_id.
- Tech details: OS/driver/DB error codes, durations, whether it was a retry.
- Fix hint (when reasonable): “increase pool to N,” “add index,” “check external API timeout.”
Short, self‑contained messages beat verbose walls of text. The log should naturally translate into an action.
Log‑driven dashboards: must‑have widgets
- Status codes overall and per key endpoints (with p95 side‑by‑side).
- Latency p50/p95/p99 by route, region, device.
- Top exceptions from error logs with recent stack traces.
- Upstream timings: connect/TLS/wait/response.
- Cache statuses and MISS share impact on p95.
- Security anomalies: 401/403/404/429 bursts, top IP/ASN/user agents, 5xx flares.
The alert should open this dashboard via one link, with filters ready for request_id, path, status, region, user agent.
Playbooks: short action recipes
Playbook “502/504 spike”
- Check upstream availability and p95.
- Compare “before/after” release metrics.
- Inspect fresh error logs for timeouts and DB/cache/API outages.
- Increase connection pools, remove N+1 calls, add a cache in front of the API.
- Temporarily raise timeouts (carefully), enable degradation/fallback modes.
Playbook “499 rising”
- Map to traffic peaks and p95 growth.
- Compress & cache responses, try prerender/edge cache.
- Investigate mobile networks/regions with high baseline latency.
Playbook “401/403/429 waves”
- Isolate IP/ASN and user agents.
- Check auth service performance and limiter health.
- Add captcha/challenges, block abusive ASNs, enforce backoff in SDKs.
Playbook “TLS errors”
- Verify CA chain, SAN entries, and validity.
- Renew and enable auto‑renewal.
- Revisit cipher suites and legacy client policy.
Common mistakes and how to avoid them
- Noisy logs. Log what drives decisions. Move chatter to DEBUG; sample successes, log 100% of errors.
- Inconsistent formats. Different fields/timezones/levels break correlation. Standardize the contract.
- No retention. Full disk = new outage. Rotate and archive by class.
- PII and secrets. Enforce masking/redaction and test it regularly.
- Missing request_id. Root‑cause analysis becomes guesswork without it.
- Logs share disks with the DB. Separate volumes/pools; heavy logging can starve production I/O.
One‑evening upgrade plan for your logs
- Standardize fields for access/error logs; add timings and request_id.
- Enable collection & search (Vector/Fluent Bit → Loki/ELK) with basic indexes.
- Build a baseline dashboard: status codes, p95, exceptions, upstream timings, cache HIT.
- Create alerts for 5xx growth, 502/504 spikes, 499 bursts, 401/403/429 waves, TLS expiry, p95 > SLO.
- Write five playbooks for the most frequent scenarios (templates above).
- Run a drill: disable cache or slow the DB in staging; trace the incident via logs from trigger to fix.
Choosing tools without over‑engineering
- Small project / single server: rotation + local grep/awk, simple reports, log‑pattern alerts (fail2ban style).
- Growing stack / a few services: centralized collection (Vector/Fluent Bit), indexing in Loki/ELK, unified formats, starter dashboards.
- High load / microservices / SLOs: full platform- logs + metrics + tracing (OpenSearch/ELK/Loki + Prometheus + OpenTelemetry), separate storage pools for logs, class‑based retention and access.
Control storage cost: sample successful requests, but keep 100% of errors and security‑relevant events.
How this ties to SLI/SLO
- Success SLI: fraction of 2xx/3xx.
- Performance SLI: p95/p99 per key endpoints.
- Availability SLI: uptime from health checks.
- SLOs: wire alerts to log‑derived SLIs so you see the error budget burn in real time.
Why Unihost for log‑friendly infrastructure
Network & edge. Smart peering and DDoS filtering reduce noise and make latency predictable. Private VLANs separate log traffic from production flows.
Storage. NVMe Gen4/Gen5 for indices and journals deliver stable IOPS. Keeping logs and databases on separate pools prevents cross‑impact during peaks.
Flexible architecture. Start on VPS, move to dedicated or GPU servers later without re‑writing pipelines- thanks to IaC templates and consistent tooling. Snapshots/backups are policy‑driven.
Tooling. Ready profiles for Vector/Fluent Bit, ELK/Loki, Prometheus/Grafana/OTel, plus engineers to standardize formats, build dashboards, and write playbooks.
Conclusion
Great logs aren’t “nice text files.” They’re the operating system of your team: they accelerate investigations, harden security, prove optimizations, and calm down releases. Make it a habit: unified formats, timings and request_id, centralized collection, dashboards, and short playbooks. In a week you’ll see the first insights; in a month you’ll operate predictably.
Try Unihost servers – stable infrastructure for your projects.
Set up logging and observability on Unihost VPS or dedicated servers and cut MTTR in the very first month.