Uptime isn’t magic – it’s discipline. Two things create it: observability (you can see what’s happening) and alerts (you learn about trouble before your users do). Basic monitoring is the minimal set of checks and metrics that watch your site and infrastructure 24/7 and notify the right people when something goes wrong.
This guide gives you a practical, down‑to‑earth plan – no heavy jargon, just checklists, example alerts, and sensible thresholds. It’s written for site owners, tech leads, developers, and DevOps engineers who want to “lock down the basics” in a single evening and sleep better afterward.
How it works (the minimum theory)
Black‑box vs white‑box
- Black‑box monitoring views your system from the outside, like a user would: ping, TCP/HTTP checks, response time, TLS certificate validity, DNS health. It answers: “Is the site up from a client’s perspective?”
- White‑box monitoring looks inside: CPU/RAM/disk/IO, network errors, open file counts, DB connection pools, app errors, web‑server metrics, job queues. It answers: “Why did it break?”
You need both. Black‑box gives a fast signal; white‑box gives the context to fix the cause.
Metrics, logs, traces
- Metrics are numbers over time (p95 latency, error rate, CPU utilization). Compact and perfect for alerting.
- Logs hold event details. Great for incident forensics (stack traces, client IP, request path).
- Traces show a request’s path across services; ideal for finding bottlenecks (DB, external APIs).
For a solid start, metrics + logs are enough. Add tracing later.
SLI, SLO, and error budget
- SLI (Service Level Indicator): a measurable quality signal – success rate, p95 latency, monthly uptime.
- SLO (Service Level Objective): your target SLI (e.g., “99.9% success and p95 < 300 ms over 30 days”).
- Error budget: how much downtime or bad requests you can “spend” without breaking your SLO. It shapes release pace and incident priorities.
Why it matters (and where to start)
1) You hear about issues before your users do
A push alert at the first minute of degradation is cheaper than an hour of lost conversions, carts, and leads. Simple HTTP checks from a few regions catch outages before your support queue does.
2) You reduce MTTR
Mean Time To Recovery drops when alerts land on a clear dashboard and runbook: what to check, where the logs live, how to restart safely, how to roll back via snapshot.
3) You stop fearing releases
A system with alerts is safer to experiment in: you see where it hurts and you have reversibility.
4) You make economic decisions on facts
With observability it’s easy to compare: “moved to NVMe,” “added cache,” “optimized SQL” – what truly delivered gains and paid back.
What to monitor: a basic checklist
A. Availability and edge
- DNS: do A/AAAA/CNAME records resolve? Is TTL sensible? Do you have secondary NS?
- ICMP ping: does the host respond? Any rising RTT or packet loss?
- TCP/ports: are 80/443 and service ports open and accepting connections?
- HTTP/HTTPS: 2xx/3xx codes, TTFB and p95 latency, response size, presence of a keyword on the page (content check).
- TLS certificate: validity dates, chain, domain match; alert at 30/14/7 days before expiry.
- WAF/DDoS: no rule changes blocking legitimate traffic?
B. Application
- Error rate: share of 5xx (track 4xx separately) above X% for N minutes.
- Latency: p95/p99 of key pages or APIs – thresholds tied to SLO (e.g., p95 < 300 ms).
- Critical endpoints: /healthz, /readyz, /login, /checkout, key RPC/GraphQL methods.
- Queues: depth and age (emails/SMS/background jobs).
- Cache: hit ratio below Y%? DB connection pools starving?
C. Infrastructure (white‑box)
- CPU: utilization and steal time (on VMs), kernel load.
- RAM: free/cache levels, page cache, OOM events.
- Disk/FS: utilization > 80–90%, inodes, IOPS/latency; separate volumes for logs/journals.
- Network: errors/drops, packets per second, bandwidth, SYN backlog, ESTABLISHED/WAIT counts.
- Processes: did critical systemd services crash or restart loops appear?
- Database: replication health, lag, long queries, locks.
D. Dependencies
- Payments/email/geo APIs: provider SLAs, HTTP errors/timeouts.
- Object storage/CDN: latency and errors on PUT/GET, egress surprises.
- Queues/caches: availability and rising latencies.
E. Back‑office
- Backups: freshness, size trends, routine restore tests.
- Snapshots: scheduled and before releases.
- Schedulers: cron success, error logs.
- Access audit: suspicious logins, sudo escalations, firewall rule changes.
Thresholds: how to avoid alert fatigue
- Night ≠ day. Use different thresholds for peak/off‑peak, or relative ones (“50% above baseline for 10 minutes”).
- Deduplicate and group. One DB failure shouldn’t trigger 20 app alerts – group by root cause.
- Multi‑region validation. Require agreement from at least 2 of 3 probes before paging.
- Grace period. Delay critical pages by 30–60 seconds to filter transient spikes.
- Maintenance windows. Silence rules during deploys and routine ops.
- Message templates. Every alert should include a dashboard link, last deploy info, and a runbook.
Alert delivery channels
- Email – for non‑urgent notices and daily/weekly digests.
- Messengers (Slack/Telegram/Discord) – for the on‑call team: #alerts channels, incident threads, reactions for acks.
- SMS/voice – for critical off‑hours incidents (escalations).
- Webhooks – for status pages, ticketing systems, CI/CD hooks.
Set up escalations: if an alert isn’t acknowledged within N minutes, page the next tier. This prevents “stuck” incidents.
A practical starter stack
Here’s a small but effective open‑source bundle you can roll out quickly.
- Node Exporter – system metrics (CPU/RAM/disk/net).
- Blackbox Exporter – external ping/TCP/HTTP checks (including TLS and content match).
- Prometheus – scrapes and stores metrics time‑series.
- Alertmanager – alert rules, routing, dedup, escalations.
- Grafana – dashboards and release annotations.
- Filebeat/Vector + ELK/OpenSearch – log collection/search.
- Health‑check endpoints – /healthz (liveness), /readyz (readiness), /metrics (app metrics).
Example alert rule – replace YAML with a simple template
Fill these fields for your service: – Name: HighErrorRate
– Condition: “share of 5xx over the last 5 minutes > 5% (grouped by service)”
– Hold for: 10 minutes (to avoid flapping)
– Severity: critical
– Message: “{service}: 5xx > 5% (10m)”
– Runbook: link to your “high error rate” playbook
How to validate: chart the 5xx share for the past 1–7 days; pick a threshold that minimizes noise but catches real incidents fast.
TLS certificate check – parameter template
- Protocol: HTTPS expecting a 2xx
- Request timeout: 10 seconds
- HTTP versions: HTTP/1.1 and HTTP/2 allowed
- TLS chain validation: enabled (not expired; CN/SAN matches domain)
- Alert schedule: 30/14/7 days before expiry from at least two regions
Free disk space alert – parameter template
- Name: LowDiskSpace
- Condition: “free filesystem space < 10%” (exclude tmpfs/devtmpfs)
- Hold for: 15 minutes (ignore short spikes)
- Severity: warning (raise to critical at <5%)
- Message: “INSTANCE: <10% disk left – rotate logs/artifacts or extend volume”
- Tips: check log rotation, cache size, move build artifacts to object storage
A one‑evening rollout plan
Step 1 – Define SLOs. Example: “99.9% uptime over 30 days; p95 < 300 ms; 5xx < 1%.”
Step 2 – Choose probe locations. 3–5 geographies (EU/US/Asia), one probe per key endpoint (home, login, checkout, a core API method).
Step 3 – Add health checks. Implement /healthz and /readyz. Readiness should fail if DB/cache are down (when the service can’t serve traffic).
Step 4 – Deploy the stack. Prometheus + Node/Blackbox Exporters, Grafana, Alertmanager. Import ready‑made dashboards.
Step 5 – Create your top‑10 alerts. HTTP availability, TLS expiry, 5xx > 1–5%, p95 above SLO, CPU > 85% for 10m, RAM free < 15%, disk free < 15%, DB replication/lag, job queue depth, error spikes in logs.
Step 6 – Wire alert channels and escalations. Slack/Telegram + Email; add SMS/voice for “reds.” Configure quiet hours and maintenance windows.
Step 7 – Build a status page. Public (sanitized) and internal (detailed). Auto‑updates via Alertmanager webhooks.
Step 8 – Run a mini chaos drill. Intentionally break a dependency (e.g., turn off cache). Verify alerts fire and the runbook is clear.
Step 9 – Write runbooks. For each alert: 5–10 diagnostic steps, mitigation actions, and on‑call contacts.
Step 10 – Recalibrate monthly. Tune thresholds from real data and post‑incident reviews.
Common mistakes
- Too many alerts. Weak signals drown strong ones. Keep a tight set of red/yellow pages; push the rest to reports.
- Context‑free paging. Always include dashboard links and runbooks. “High CPU” with no graph is noise.
- No content check. HTTP 200 doesn’t guarantee the right page rendered. Look for a keyword/pattern.
- One monitoring region. A single DC glitch can page you falsely. Use independent regions.
- Co‑locating monitoring with the app. If everything crashes, nobody sees it. Keep monitoring off‑box.
- No DR or backup tests. A backup you’ve never restored isn’t a backup. Practice restores.
Why Unihost makes uptime easier
Network & infrastructure. Low p95 latency through smart peering, DDoS filtering, private VLANs, IPv4/IPv6, predictable uplinks. Fewer false positives and simpler prod/stage/dev isolation.
Storage. Fast NVMe Gen4/Gen5 for databases/indexes and logs. Stable IOPS means fewer micro‑incidents.
Platform. First‑class IaC (Terraform/Ansible), policy‑driven snapshots/backups, integrations with Prometheus/Grafana/ELK/OTel.
Operations. 24/7 site monitoring, SLAs on uptime/response, engineers who help tune kernel/network/DB and set healthy /healthz endpoints.
Scale path. Start on VPS, move to dedicated or GPU servers without changing providers or re‑architecting – your monitoring moves with your IaC.
TL;DR launch template
- Black‑box: HTTP/HTTPS from 3 regions, TLS‑expiry, DNS health.
- White‑box: Node Exporter + app metrics at /metrics.
- Alerts: 5xx, p95, CPU/RAM/disk, TLS, DB, queues.
- Channels: Slack/Telegram + Email (+ SMS/voice for critical).
- Runbooks and a status page.
- Monthly postmortems and threshold tuning.
Conclusion & CTA
Basic monitoring isn’t a luxury or “yet another project.” It’s insurance for your revenue and reputation. Start today: add health checks, expose metrics, define your first alerts, and connect channels. In a week you’ll have baseline stats; in a month you’ll have confidence your site will survive peak traffic, a risky deploy, and a flaky third‑party API.
Unihost can speed this up: pick the right footprint (VPS or dedicated), tune network and storage, enable observability and alerts, automate snapshots/backups, and write a DR plan you can actually execute.
Try Unihost servers – stable infrastructure for your projects.
Order a VPS or dedicated server on Unihost and raise your uptime with metrics, alerts, and a clear ops playbook.