How to Reduce False Uptime Alerts by 90%
False uptime alerts are the fastest way to make your team ignore real alerts. When your phone buzzes at 3am because a monitoring probe had a network hiccup — not because your service is actually down — you start to distrust the system. And then when a real outage happens, you're slower to respond.
Here are four techniques that, combined, can reduce false alerts by 90% or more.
The false alert problem
Most basic monitoring works like this: send an HTTP request, check if the response is a 200. If it's not, fire an alert. The problem is obvious — there are many reasons a single check might fail that have nothing to do with your service being down:
- Transient network issues between the monitoring server and your app
- DNS resolution hiccups
- Brief cloud provider blips (AWS, GCP, etc.)
- Monitoring service's own network problems
- CDN edge node issues
- One-off request timeouts under load
None of these mean your service is down. Your actual users might not be affected at all. But a simple "check failed = alert" system can't tell the difference.
Solution 1: Multi-region verification
Instead of checking from a single location, check from multiple geographic regions simultaneously. If your service is truly down, it'll be down from everywhere. If only one region sees a failure, it's almost certainly a network issue, not a real outage.
The key is majority voting. With 3 checking regions (e.g., US East, EU West, Asia Pacific), you only mark an endpoint as down if 2 out of 3 regions agree the check failed. This single technique eliminates most false positives from localized network issues.
// Majority voting logic (simplified)
const results = await Promise.all([
checkFromRegion("us-east"),
checkFromRegion("eu-west"),
checkFromRegion("ap-south"),
]);
const failures = results.filter(r => !r.ok).length;
const isDown = failures >= 2; // majority must failThis is more expensive to run (3x the checks), but the reduction in false alerts is dramatic.
Solution 2: Consecutive failure thresholds
Even with multi-region checks, a single round of failures shouldn't immediately trigger an alert. Require multiple consecutive failed checks before marking an endpoint as down.
For example, with 30-second check intervals and a threshold of 2 consecutive failures, you'd need the endpoint to fail for a full minute before alerting. This filters out brief transient issues while still catching real outages quickly.
The tradeoff is detection latency. A 2-failure threshold adds 30 seconds to your detection time. For most applications, that's an acceptable tradeoff for significantly fewer false alerts. For ultra-critical systems, you might use a threshold of 1 but rely on multi-region voting instead.
Solution 3: Smart state machine
The most sophisticated approach is modeling your endpoint's status as a state machine rather than a simple binary up/down toggle. Here's how it works:
- Up — Endpoint is healthy. Normal state.
- Confirming Down — A check failed. Run additional verification checks before alerting.
- Down — Multiple checks confirmed the endpoint is down. Alert sent.
- Recovering — A check passed while in Down state. Wait for additional passes to confirm recovery.
- Up — Recovery confirmed. Send recovery notification.
The "confirming" states are what make this powerful. When a check fails, you don't immediately alert. You enter a "confirming down" state and run one or two more checks from different regions. Only if those also fail do you transition to "down" and send the alert.
Similarly, when a down endpoint passes one check, you don't immediately send a recovery notification. You wait for one or two more successful checks to confirm it's actually back. This prevents "flapping" notifications (down → up → down → up) during intermittent issues.
Solution 4: Configurable degraded thresholds
Not every issue is a binary up or down. Sometimes your endpoint responds with a 200 but takes 5 seconds instead of the usual 200ms. That's not an outage, but it's not healthy either.
A "degraded" state fills this gap. You can set a response time threshold — say, 2 seconds — where the endpoint is marked as degraded rather than down. This gives you a separate, lower-urgency alert channel for performance issues versus actual outages.
Typical configuration:
- Healthy: Response in under 1 second with 2xx status
- Degraded: Response in 1-5 seconds with 2xx status
- Down: Response time exceeds 5 seconds, or non-2xx status
This lets you reserve your urgent notification channel (phone calls, Slack alerts with @here) for real outages while still tracking performance degradation.
How PingGuard implements these
PingGuard uses all four techniques by default:
- Multi-region checks from US, EU, and Asia with majority voting — 2 out of 3 regions must agree
- Consecutive failure threshold before transitioning from "confirming" to "down"
- Smart state machine with confirming, down, and recovering states to prevent flapping
- Response time thresholds for degraded state detection
You don't have to configure any of this manually. It's how PingGuard works out of the box. The result is that when you get an alert, you can trust it's a real issue — not a network blip.
The bottom line
False alerts erode trust in your monitoring system. When your team starts ignoring alerts because "it's probably nothing," you've lost the entire point of monitoring.
Multi-region voting, consecutive failure thresholds, smart state machines, and degraded states work together to ensure that when your phone buzzes, something is actually wrong. Implement these and you'll sleep a lot better.
Comments
Loading comments...