Monitoring in Production: What to Track and What to Ignore
The biggest monitoring mistake isn't tracking too little — it's tracking too much. When every metric is "important," nothing is. Teams end up with 50-dashboard setups where no one can answer the basic question: "Is production healthy right now?"
Here's a framework for deciding what deserves your attention in production.
The golden signals
Google's SRE team identified four golden signals that cover most monitoring needs. Start here and add more only when you have a specific reason.
1. Latency
How long requests take to complete. Track percentiles, not averages. The average can look fine while 5% of your users are having a terrible experience.
- p50 — The median experience. Good for baseline.
- p95 — Where problems start to show. This is your primary alert metric.
- p99 — Tail latency. Important for high-traffic services.
Alert on: p95 latency exceeding 2x your baseline for 5+ minutes.
2. Traffic
Requests per second (or per minute). A sudden drop in traffic is often the first sign of a problem — users can't send requests if they can't reach your service.
Alert on: Traffic dropping below 50% of the same hour yesterday (adjusting for weekends).
3. Errors
The rate of failed requests. Distinguish between client errors (4xx — usually not your fault) and server errors (5xx — definitely your problem).
Alert on: 5xx error rate above 1% for 3+ minutes.
4. Saturation
How full your resources are. CPU, memory, disk, database connections, queue depth. These are leading indicators — they predict problems before they cause user-visible issues.
Alert on: Any resource consistently above 80% utilization.
What to track (but not alert on)
Not everything needs an alert. Some metrics are valuable for debugging and capacity planning but don't warrant waking someone up:
- Cache hit rates — Useful for optimization, not incidents
- Build times — Track the trend, fix it during business hours
- Database query counts — Good for identifying N+1 queries, not emergencies
- Deployment frequency — Engineering metric, not operational
- Individual pod/instance metrics — Only matters if it affects the service overall
What to actively ignore
These are common metrics teams track that actively make monitoring worse:
- CPU usage below 70% — Normal operation. Stop looking at it.
- Individual 404s — Bots and scrapers cause these. Only alert on 404 rate spikes.
- Garbage collection pauses — Unless they're causing user-visible latency
- Log volume — Noisy metric that rarely indicates real problems
Setting up a monitoring hierarchy
Organize your monitoring into three tiers:
Tier 1: User-facing (alert immediately)
- Endpoint availability (is it returning 200?)
- Error rate spikes
- Response time degradation
- SSL certificate expiry (within 14 days)
Tier 2: Infrastructure (alert during business hours)
- High resource utilization
- Database replication lag
- Queue growing faster than consuming
- Disk space trending toward full
Tier 3: Informational (dashboard only)
- Deployment history
- Cache performance
- Dependency response times
- Cost metrics
The "2am test"
For every alert you create, ask: "Would I want to be woken up at 2am for this?" If the answer is no, it shouldn't be a paging alert. Maybe it's a Slack notification, maybe it's just a dashboard metric. The 2am test keeps your alerting focused and your team sane.
Start with the basics
You don't need a complex monitoring stack to cover the essentials. Start with uptime checks on your critical endpoints, track response times, and set up alerts for when things go wrong. If you're looking for a clean, focused monitoring tool that covers Tier 1 without the complexity, PingGuard monitors endpoints from multiple regions, tracks response times, and alerts your team via Slack, email, or webhooks. Free for up to 5 endpoints.
Comments
Loading comments...