Guide2026-03-219 min read

Monitoring Microservices: What Actually Matters

Microservices solve organizational scaling problems. They also create observability nightmares. When a request flows through 5 services before reaching the user, figuring out where it broke is genuinely hard. This guide focuses on practical monitoring strategies that work in the real world — not theoretical frameworks.

The three pillars (and which one matters most)

You've heard about the three pillars of observability: metrics, logs, and traces. They're all important, but if you're just starting out, prioritize in this order:

Health checks and uptime — Is each service reachable and responding?
Key metrics — Response time, error rate, throughput per service
Structured logs — When something breaks, can you find out why?
Distributed traces — For complex debugging across service boundaries

Most teams jump straight to traces and complex dashboards. Start with health checks. If you can't answer "is service X up right now?" you have bigger problems than tracing.

What to monitor in each service

Every microservice should expose these signals at minimum:

Health endpoint

Every service needs a /health or /healthz endpoint that checks its critical dependencies. This isn't just returning 200 — it should verify database connections, cache availability, and any external services it depends on.

// Express health check example
app.get('/health', async (req, res) => {
  const checks = {};

  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch {
    checks.database = 'error';
  }

  try {
    await redis.ping();
    checks.cache = 'ok';
  } catch {
    checks.cache = 'error';
  }

  const healthy = Object.values(checks)
    .every(s => s === 'ok');

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'degraded',
    checks
  });
});

The RED method

For each service, track the RED metrics:

Rate — Requests per second. Sudden drops often indicate upstream issues.
Errors — Error rate as a percentage. Alert when it exceeds your baseline.
Duration — Response time percentiles (p50, p95, p99). The p99 catches tail latency.

Inter-service communication monitoring

The hardest part of microservices monitoring is the spaces between services. When Service A calls Service B which calls Service C, a timeout in C cascades back through everything.

Key patterns for inter-service monitoring:

Circuit breakers — Track when circuits open. A tripped circuit breaker is an early warning.
Retry rates — If retries spike, a downstream service is struggling.
Queue depth — For async communication, monitor queue sizes. Growing queues mean consumers can't keep up.

Alerting without alert fatigue

The biggest mistake teams make is alerting on everything. When you have 20 services each with 10 alerts, that's 200 potential notifications. Your team will start ignoring them within a week.

Rules for sustainable alerting:

Alert on symptoms, not causes — "User-facing error rate above 1%" is better than "Service B CPU at 80%"
Use severity levels — Critical (pages someone) vs. Warning (Slack notification) vs. Info (logged)
Require multi-signal confirmation — One failed health check is noise. Three consecutive failures from multiple regions is signal.

Practical dashboard layout

Your main monitoring dashboard should answer one question: "Is everything okay right now?"

Organize it like this:

Top row — Service health grid (green/yellow/red for each service)
Second row — Overall error rate and response time graphs
Third row — Per-service key metrics
Bottom — Recent incidents and deployments timeline

Start simple

You don't need a $50,000/year observability platform to monitor microservices effectively. Start with health check monitoring for every service, add basic metrics, and build from there. If you're looking for a simple way to monitor your service endpoints across regions, PingGuard provides multi-region health checks with smart alerting — free for up to 5 endpoints. Start with visibility, then add complexity as your system grows.