Why Your CI/CD Pipeline Needs Uptime Monitoring
The most common cause of production incidents isn't hardware failure or traffic spikes — it's deployments. Bad deployments cause 60-70% of outages according to multiple industry surveys. Yet most teams deploy blind, only finding out something broke when users complain.
Integrating uptime monitoring into your CI/CD pipeline catches these regressions in minutes instead of hours.
The deployment risk window
Every deployment creates a risk window — a period where things might break. This window starts when the deployment begins and ends when you've confirmed everything is working. Most teams leave this window open indefinitely because they don't have automated post-deployment verification.
// The risk window timeline
Deploy starts → New code is live → Verified healthy
|__________________|_____________________|
Deployment Risk window
(2-5 min) (??? - often hours)
The goal: close the risk window within 5 minutes of deployment, not hours.
Post-deployment health verification
After every deployment, automatically verify that your critical endpoints are healthy. This is the minimum viable deployment monitoring:
# GitHub Actions post-deploy verification
- name: Verify deployment health
run: |
echo "Waiting 30s for deployment to stabilize..."
sleep 30
# Check health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
https://myapp.com/api/health)
if [ "$STATUS" != "200" ]; then
echo "Health check failed with status $STATUS"
echo "Rolling back deployment..."
vercel rollback
exit 1
fi
echo "Health check passed"
Beyond simple status codes
A 200 status code doesn't mean everything is fine. Your post-deployment check should also verify:
- Response time — Is the response within normal bounds?
- Response body — Does the health endpoint report all dependencies as healthy?
- Critical user flows — Can users log in? Can they make purchases?
Continuous monitoring during rollout
If you use progressive rollouts (canary deployments, blue-green, rolling updates), monitoring during the rollout is essential. You need to compare error rates and response times between the old and new versions:
// Canary monitoring check
async function checkCanaryHealth() {
const canaryMetrics = await getMetrics('canary');
const stableMetrics = await getMetrics('stable');
// Compare error rates
if (canaryMetrics.errorRate > stableMetrics.errorRate * 1.5) {
await rollbackCanary();
await notify('Canary rolled back: error rate 50% higher');
return false;
}
// Compare latency
if (canaryMetrics.p95Latency > stableMetrics.p95Latency * 2) {
await rollbackCanary();
await notify('Canary rolled back: latency doubled');
return false;
}
return true;
}
Deployment annotations
Mark deployments on your monitoring timeline. When an incident occurs, the first question is always "did anything change?" Deployment annotations make this instantly visible.
Most monitoring tools support annotations via API. Trigger it from your CI/CD pipeline:
# Add deployment annotation after successful deploy
- name: Annotate deployment
run: |
curl -X POST https://monitoring.example.com/api/annotations \
-H "Authorization: Bearer $MONITORING_API_KEY" \
-d '{
"title": "Deployment v${{ github.sha }}",
"description": "Deployed by ${{ github.actor }}",
"timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}'
Smoke tests vs. monitoring
Smoke tests and monitoring serve different purposes:
| Aspect | Smoke tests | Uptime monitoring |
|---|---|---|
| When | Once, after deploy | Continuously |
| Catches | Immediate breakage | Degradation over time |
| Duration | Seconds | Ongoing |
| Scope | Predefined test cases | Real user paths |
You need both. Smoke tests catch "the deploy is completely broken" scenarios. Continuous monitoring catches "the deploy caused a slow memory leak that crashes the service after 2 hours."
Notification routing
Route deployment-related alerts differently than general alerts. The person who deployed should be the first to know if something breaks:
- Post-deployment failure → Notify the deployer directly
- Degradation within 1 hour of deployment → Notify the deployer + on-call
- Issue after 1 hour → Normal on-call rotation
Setting it up
You don't need a complex monitoring infrastructure to get started. The minimum setup is an uptime monitor on your health endpoint that checks frequently (every 30-60 seconds) and alerts your team when it detects a problem. If you're looking for a straightforward way to add endpoint monitoring that works with any CI/CD pipeline, PingGuard monitors your endpoints from 3 regions, sends alerts via Slack and webhooks, and gives you an uptime history that makes deployment-caused issues immediately visible. Free for up to 5 endpoints.
Comments
Loading comments...