All labs
Lab 59
Observability

Alerting & Alert Fatigue

Tune threshold, duration and impact. A bare 'CPU > 80%' fires constantly and gets ignored; an SLO-style 'booking errors > 5% for 5 min affecting >100 users' fires only on real incidents. Watch false alarms trade off against missed incidents.

Here's 16 minutes of booking error-rate: a few harmless 1-minute blips and one real sustained incident (shaded). Tune the alert so it fires on the incident โ€” and only the incident.
Threshold> 2%
Sustained for1 min
Error rate over time ยท ๐Ÿšจ = alert fires ยท shaded = the real incident
๐Ÿšจ
๐Ÿšจ
๐Ÿšจ
๐Ÿšจ
๐Ÿšจ
๐Ÿšจ
time (minutes) โ†’
False alarms
2
Real incident
caught โœ“
2 false alarms โ€” the blips are paging you. Raise the threshold, require it to be sustained, and gate on impact.
What just happened