Guide2026-03-317 min read

Monitoring in Production: What to Track and What to Ignore

The biggest monitoring mistake isn't tracking too little — it's tracking too much. When every metric is "important," nothing is. Teams end up with 50-dashboard setups where no one can answer the basic question: "Is production healthy right now?"

Here's a framework for deciding what deserves your attention in production.

The golden signals

Google's SRE team identified four golden signals that cover most monitoring needs. Start here and add more only when you have a specific reason.

1. Latency

How long requests take to complete. Track percentiles, not averages. The average can look fine while 5% of your users are having a terrible experience.

p50 — The median experience. Good for baseline.
p95 — Where problems start to show. This is your primary alert metric.
p99 — Tail latency. Important for high-traffic services.

Alert on: p95 latency exceeding 2x your baseline for 5+ minutes.

2. Traffic

Requests per second (or per minute). A sudden drop in traffic is often the first sign of a problem — users can't send requests if they can't reach your service.

Alert on: Traffic dropping below 50% of the same hour yesterday (adjusting for weekends).

3. Errors

The rate of failed requests. Distinguish between client errors (4xx — usually not your fault) and server errors (5xx — definitely your problem).

Alert on: 5xx error rate above 1% for 3+ minutes.

4. Saturation

How full your resources are. CPU, memory, disk, database connections, queue depth. These are leading indicators — they predict problems before they cause user-visible issues.

Alert on: Any resource consistently above 80% utilization.

What to track (but not alert on)

Not everything needs an alert. Some metrics are valuable for debugging and capacity planning but don't warrant waking someone up:

Cache hit rates — Useful for optimization, not incidents
Build times — Track the trend, fix it during business hours
Database query counts — Good for identifying N+1 queries, not emergencies
Deployment frequency — Engineering metric, not operational
Individual pod/instance metrics — Only matters if it affects the service overall

What to actively ignore

These are common metrics teams track that actively make monitoring worse:

CPU usage below 70% — Normal operation. Stop looking at it.
Individual 404s — Bots and scrapers cause these. Only alert on 404 rate spikes.
Garbage collection pauses — Unless they're causing user-visible latency
Log volume — Noisy metric that rarely indicates real problems

Setting up a monitoring hierarchy

Organize your monitoring into three tiers:

Tier 1: User-facing (alert immediately)

Endpoint availability (is it returning 200?)
Error rate spikes
Response time degradation
SSL certificate expiry (within 14 days)

Tier 2: Infrastructure (alert during business hours)

High resource utilization
Database replication lag
Queue growing faster than consuming
Disk space trending toward full

Tier 3: Informational (dashboard only)

Deployment history
Cache performance
Dependency response times
Cost metrics

The "2am test"

For every alert you create, ask: "Would I want to be woken up at 2am for this?" If the answer is no, it shouldn't be a paging alert. Maybe it's a Slack notification, maybe it's just a dashboard metric. The 2am test keeps your alerting focused and your team sane.

Start with the basics

You don't need a complex monitoring stack to cover the essentials. Start with uptime checks on your critical endpoints, track response times, and set up alerts for when things go wrong. If you're looking for a clean, focused monitoring tool that covers Tier 1 without the complexity, PingGuard monitors endpoints from multiple regions, tracks response times, and alerts your team via Slack, email, or webhooks. Free for up to 5 endpoints.