False Positives will kill your alerting system faster than False Negatives

Several years ago, I inherited a legacy application. It had multiple parts all logging back to a central ELK stack that had later been supplemented by additional Prometheus metrics. And every morning, I would wake up to a dozen alerts.

Obviously we tried to reduce the number of alerts sent. Where possible the system would self-heal, bringing up new nodes and killing off old ones. Some of the alerts were collected into weekly reports so they were still seen but, as they didn’t require immediate triage, could be held off.

But the damage was done.

No-one read the Slack alerts channel. Everyone had forwarded the emails to spam. The system cried wolf, but all the villagers had learnt to cover their ears.

With a newer project, I wanted an implicit rule; if an alert pops up, it is because it requires a human interaction. A service shut off because of a billing issue. A new deploy causing a regression in signups. These are things a human needs to step in and do something about (caveat emptor there is wiggle room in this).

There are still warnings been flagged up but developers can check in on these on their own time. Attention is precious and been pulled out of it every hour because of a NullPointer is not a worthy trade-off in my own experience.

A flood of false positives will make you blind to the real need of alerting; knowing when you’re needed.