The Developer's Incident Response Playbook
When production breaks at 2am, the difference between a 5-minute fix and a 2-hour outage comes down to one thing: preparation. Most teams don't have an incident response plan until after their first painful outage. This guide gives you the playbook before that happens.
Phase 1: Detection
The fastest incidents to resolve are the ones you catch before users notice. Automated monitoring is your first line of defense. You need three types of detection in place:
- Uptime monitoring — Is the service reachable? Are endpoints returning expected status codes?
- Performance monitoring — Are response times within acceptable bounds?
- Error rate tracking — Is the error rate spiking above your baseline?
Set up alerts with appropriate thresholds. A single timeout isn't an incident — but three consecutive failures from multiple regions is a strong signal. Use multi-region checks and majority voting to reduce false positives.
Phase 2: Triage
When an alert fires, the first 60 seconds matter most. Here's a triage checklist:
- Acknowledge the alert — Let your team know someone is on it
- Check the scope — Is it one endpoint, one service, or everything?
- Check recent changes — Was there a deployment in the last hour?
- Check dependencies — Is the database up? Are third-party APIs responding?
Most incidents fall into three categories: bad deployment, infrastructure failure, or dependency outage. Knowing which category you're in determines your next step.
The deployment rollback decision
If a deployment happened recently and the symptoms match, roll back first, investigate later. A 30-second rollback is almost always better than a 30-minute debugging session while users are affected. You can always redeploy once you understand the issue.
# Quick rollback on Vercel
vercel rollback
# Or on Kubernetes
kubectl rollout undo deployment/my-app
Phase 3: Communication
While you're fixing the issue, someone needs to communicate. If you're a small team, this might be the same person. Key communications:
- Update your status page — Even a simple "investigating" message reduces support tickets by 50%
- Notify stakeholders — A brief message in Slack with what you know so far
- Set expectations — "We're investigating, next update in 15 minutes" is better than silence
Phase 4: Resolution
Once you've identified the root cause, fix it. But don't just fix it — verify it. After deploying a fix:
- Watch your monitoring dashboards for 10-15 minutes
- Verify the specific endpoint or feature that was broken
- Check error rates are back to baseline
- Update your status page to "resolved"
Phase 5: Post-mortem
The post-mortem is where teams actually improve. Write it within 48 hours while details are fresh. A good post-mortem includes:
- Timeline — When did it start, when was it detected, when was it resolved?
- Root cause — What actually broke and why?
- Detection gap — Could we have caught this sooner?
- Action items — What will we change to prevent this class of issue?
The most important rule: post-mortems are blameless. Focus on systems and processes, not individuals. "The deployment pipeline didn't run integration tests" is actionable. "John pushed bad code" is not.
Building your runbook
Every team should maintain a simple runbook — a document that answers "what do I do when X happens?" Common entries include:
- Database connection failures — check connection pool, restart service
- High memory usage — check for memory leaks, scale horizontally
- Third-party API down — enable circuit breaker, serve cached data
- SSL certificate expired — renew and redeploy
Keep it simple, keep it updated, and make sure everyone on the team knows where to find it.
Start with monitoring
None of this works without reliable detection. If you're looking for a simple way to monitor your endpoints and get alerted when things break, PingGuard gives you multi-region uptime checks, status pages, and instant alerts — free for up to 5 endpoints. It's the foundation your incident response playbook needs.
Comments
Loading comments...