Alerting & Alert Fatigue

Tune threshold, duration and impact. A bare 'CPU > 80%' fires constantly and gets ignored; an SLO-style 'booking errors > 5% for 5 min affecting >100 users' fires only on real incidents. Watch false alarms trade off against missed incidents.

Here's 16 minutes of booking error-rate: a few harmless 1-minute blips and one real sustained incident (shaded). Tune the alert so it fires on the incident — and only the incident.

Threshold> 2%

Sustained for1 min

Error rate over time · 🚨 = alert fires · shaded = the real incident

🚨

time (minutes) →

False alarms

Real incident

caught ✓

2 false alarms — the blips are paging you. Raise the threshold, require it to be sustained, and gate on impact.

What just happened

▹A bad alert fires on a raw threshold ('error rate > 2%') and goes off on every harmless blip. Flooded with false alarms, the on-call engineer starts ignoring them — alert fatigue — and misses the real one.
▹A good alert is actionable: it requires the condition to be SUSTAINED (e.g. 3+ minutes) and to have real IMPACT (e.g. >100 users), so transient noise is filtered out and only genuine incidents page someone.
▹Tune all three dials — threshold, duration, impact — to catch the real incident with zero false alarms. Every alert should be worth waking someone up for, and carry a severity, owner and runbook.