SRE On-Call Best Practices: How to Reduce Burnout and Improve Incident Response

Direct answer: Reducing on-call burnout requires three changes: (1) reducing alert volume by eliminating noise at the source (target 70%+ alert-to-action ratio), (2) improving incident context so engineers spend less time reconstructing what happened and more time fixing it, and (3) establishing sustainable on-call policies with explicit recovery time, escalation paths, and post-incident learning loops.

By AlertStellar Team · 9 min read · Updated 2026-03-01

Tags: SRE, on-call, incident response, burnout, platform engineering

What the Research Says

On-call burnout is not a soft problem — it has hard business consequences. Blameless's 2024 SRE Report (n=1,200 engineers) found that engineers who are paged more than 5 times per on-call shift have a 2.7x higher attrition rate within 12 months. The same report found that mean time to resolution (MTTR) increases by 38% after the first 2 hours of an on-call shift due to cognitive fatigue.

Google's Site Reliability Engineering book establishes a 50% cap on operational work for SREs, but a 2024 survey by LinearB found that 64% of SREs spend more than 60% of their time on reactive operational work — well above the healthy threshold.

The AlertStellar On-Call Health Framework

The On-Call Health Framework is a 4-step system for auditing your current on-call posture and making targeted improvements. Teams that complete all 4 steps typically reduce page volume by 45–65% and MTTR by 30–50% within 90 days.

  1. Step 1 — Baseline measurement: Export 90 days of alert data and compute: total pages per on-call shift, alert-to-action ratio, false-positive rate, and MTTR by alert type. This is your before state. Without a baseline, you can't measure improvement.
  2. Step 2 — Noise elimination: Apply the 3-Signal Assessment (actionability audit, correlation mapping, ownership gap analysis). Target: eliminate at least 40% of pages within 30 days. Anything with an alert-to-action ratio below 30% gets deleted or moved to a non-paging channel.
  3. Step 3 — Context injection: For remaining alerts, ensure each page contains: the service name and team owner, a direct link to the runbook, the current and historical metric values, and the last 3 related alerts for context. Engineers should arrive at an incident with a diagnosis hypothesis, not a blank slate.
  4. Step 4 — Policy and recovery: Establish explicit policies: maximum 2 pages per night per on-call engineer, guaranteed 8 hours of recovery time after a night incident, weekly on-call retrospectives (even for quiet weeks), and rotation schedules that spread load fairly. These policies must be enforced by tooling, not goodwill.

On-Call Health Benchmarks by Team Size

MetricStruggling (<P50)Healthy (P50–P75)World-Class (P75+)
Pages per on-call shift>15 pages/shift5–15 pages/shift<5 pages/shift
Alert-to-action ratio<40%40–70%>70%
MTTR (median)>60 min20–60 min<20 min
False-positive rate>40%15–40%<15%
Time to context (diagnosis)>15 min5–15 min<5 min
On-call-related attrition>25%/year10–25%/year<10%/year

When This Framework Isn't Enough

The On-Call Health Framework solves the structural problem, but it doesn't address the dynamic problem: modern systems change faster than static alert configurations. A noise-free alert stack at 9am can become alert-flooded at 9pm after a deployment or traffic spike. Sustainable on-call health requires continuous alert quality monitoring — not just a one-time audit.

Frequently Asked Questions

What is a healthy number of alerts per on-call shift?

A healthy on-call shift has fewer than 5 actionable pages per engineer per 12-hour period. Above 15 pages per shift consistently indicates severe alert fatigue and is a retention risk. Google's SRE practice targets fewer than 2 pages per on-call shift as its operational ceiling. The key metric is not just volume but alert-to-action ratio: even 2 pages are too many if they're both false positives.

How do you calculate alert-to-action ratio?

Alert-to-action ratio = (number of alerts that resulted in a non-trivial human action) / (total number of alerts) × 100. A "non-trivial action" means the engineer did something beyond acknowledging the alert: investigated the cause, made a configuration change, escalated, or opened an incident. Silencing, snoozing, or ignoring an alert does not count as an action. Track this metric monthly and target 70% or higher.

What is the SRE 50% operational work rule?

Google's SRE model establishes that SREs should spend no more than 50% of their time on operational (reactive) work, with the other 50% on engineering (proactive) work. When reactive work exceeds 50%, the excess is handed back to the development team as a forcing function to fix the reliability issues causing the operational load. This rule prevents SRE teams from becoming a permanent firefighting function and creates incentives to address root causes.

How long should on-call rotations be?

Industry consensus is 1-week rotations for most teams. Shorter rotations (3–4 days) reduce individual burnout but increase handoff overhead. Longer rotations (2 weeks) compound fatigue. The critical requirement is guaranteed recovery time: after a night with more than 2 pages, the on-call engineer should have 8 hours of no-page time before their next working day. This recovery guarantee is more important than rotation length.