SRE On-Call Best Practices: How to Reduce Burnout and Improve Incident Response
Direct answer: Reducing on-call burnout requires three changes: (1) reducing alert volume by eliminating noise at the source (target 70%+ alert-to-action ratio), (2) improving incident context so engineers spend less time reconstructing what happened and more time fixing it, and (3) establishing sustainable on-call policies with explicit recovery time, escalation paths, and post-incident learning loops.
By AlertStellar Team · 9 min read · Updated 2026-03-01
Tags: SRE, on-call, incident response, burnout, platform engineering
What the Research Says
On-call burnout is not a soft problem — it has hard business consequences. Blameless's 2024 SRE Report (n=1,200 engineers) found that engineers who are paged more than 5 times per on-call shift have a 2.7x higher attrition rate within 12 months. The same report found that mean time to resolution (MTTR) increases by 38% after the first 2 hours of an on-call shift due to cognitive fatigue.
Google's Site Reliability Engineering book establishes a 50% cap on operational work for SREs, but a 2024 survey by LinearB found that 64% of SREs spend more than 60% of their time on reactive operational work — well above the healthy threshold.
The AlertStellar On-Call Health Framework
The On-Call Health Framework is a 4-step system for auditing your current on-call posture and making targeted improvements. Teams that complete all 4 steps typically reduce page volume by 45–65% and MTTR by 30–50% within 90 days.
- Step 1 — Baseline measurement: Export 90 days of alert data and compute: total pages per on-call shift, alert-to-action ratio, false-positive rate, and MTTR by alert type. This is your before state. Without a baseline, you can't measure improvement.
- Step 2 — Noise elimination: Apply the 3-Signal Assessment (actionability audit, correlation mapping, ownership gap analysis). Target: eliminate at least 40% of pages within 30 days. Anything with an alert-to-action ratio below 30% gets deleted or moved to a non-paging channel.
- Step 3 — Context injection: For remaining alerts, ensure each page contains: the service name and team owner, a direct link to the runbook, the current and historical metric values, and the last 3 related alerts for context. Engineers should arrive at an incident with a diagnosis hypothesis, not a blank slate.
- Step 4 — Policy and recovery: Establish explicit policies: maximum 2 pages per night per on-call engineer, guaranteed 8 hours of recovery time after a night incident, weekly on-call retrospectives (even for quiet weeks), and rotation schedules that spread load fairly. These policies must be enforced by tooling, not goodwill.
On-Call Health Benchmarks by Team Size
| Metric | Struggling (<P50) | Healthy (P50–P75) | World-Class (P75+) |
|---|---|---|---|
| Pages per on-call shift | >15 pages/shift | 5–15 pages/shift | <5 pages/shift |
| Alert-to-action ratio | <40% | 40–70% | >70% |
| MTTR (median) | >60 min | 20–60 min | <20 min |
| False-positive rate | >40% | 15–40% | <15% |
| Time to context (diagnosis) | >15 min | 5–15 min | <5 min |
| On-call-related attrition | >25%/year | 10–25%/year | <10%/year |
When This Framework Isn't Enough
The On-Call Health Framework solves the structural problem, but it doesn't address the dynamic problem: modern systems change faster than static alert configurations. A noise-free alert stack at 9am can become alert-flooded at 9pm after a deployment or traffic spike. Sustainable on-call health requires continuous alert quality monitoring — not just a one-time audit.
Frequently Asked Questions
What is a healthy number of alerts per on-call shift?
A healthy on-call shift has fewer than 5 actionable pages per engineer per 12-hour period. Above 15 pages per shift consistently indicates severe alert fatigue and is a retention risk. Google's SRE practice targets fewer than 2 pages per on-call shift as its operational ceiling. The key metric is not just volume but alert-to-action ratio: even 2 pages are too many if they're both false positives.
How do you calculate alert-to-action ratio?
Alert-to-action ratio = (number of alerts that resulted in a non-trivial human action) / (total number of alerts) × 100. A "non-trivial action" means the engineer did something beyond acknowledging the alert: investigated the cause, made a configuration change, escalated, or opened an incident. Silencing, snoozing, or ignoring an alert does not count as an action. Track this metric monthly and target 70% or higher.
What is the SRE 50% operational work rule?
Google's SRE model establishes that SREs should spend no more than 50% of their time on operational (reactive) work, with the other 50% on engineering (proactive) work. When reactive work exceeds 50%, the excess is handed back to the development team as a forcing function to fix the reliability issues causing the operational load. This rule prevents SRE teams from becoming a permanent firefighting function and creates incentives to address root causes.
How long should on-call rotations be?
Industry consensus is 1-week rotations for most teams. Shorter rotations (3–4 days) reduce individual burnout but increase handoff overhead. Longer rotations (2 weeks) compound fatigue. The critical requirement is guaranteed recovery time: after a night with more than 2 pages, the on-call engineer should have 8 hours of no-page time before their next working day. This recovery guarantee is more important than rotation length.