Tune threshold, duration and impact. A bare 'CPU > 80%' fires constantly and gets ignored; an SLO-style 'booking errors > 5% for 5 min affecting >100 users' fires only on real incidents. Watch false alarms trade off against missed incidents.
Here's 16 minutes of booking error-rate: a few harmless 1-minute blips and one real sustained incident (shaded). Tune the alert so it fires on the incident โ and only the incident.
Threshold> 2%
Sustained for1 min
Error rate over time ยท ๐จ = alert fires ยท shaded = the real incident
๐จ
๐จ
๐จ
๐จ
๐จ
๐จ
time (minutes) โ
False alarms
2
Real incident
caught โ
2 false alarms โ the blips are paging you. Raise the threshold, require it to be sustained, and gate on impact.
What just happened
โนA bad alert fires on a raw threshold ('error rate > 2%') and goes off on every harmless blip. Flooded with false alarms, the on-call engineer starts ignoring them โ alert fatigue โ and misses the real one.
โนA good alert is actionable: it requires the condition to be SUSTAINED (e.g. 3+ minutes) and to have real IMPACT (e.g. >100 users), so transient noise is filtered out and only genuine incidents page someone.
โนTune all three dials โ threshold, duration, impact โ to catch the real incident with zero false alarms. Every alert should be worth waking someone up for, and carry a severity, owner and runbook.