Have you ever heard the tale of Peter, the young shepherd? Every once in a while Peter cried for help because the wolf was coming and the town would go into his rescue, only to find out he was lying. The one time the wolf really appeared no one answered Peter’s call. They had lost confidence in him.
One of the most important things about alerts is confidence. You should be confident that when they sound it’s because something is really happening, and when they are not, everything it’s ok.
Which is the worst alert? The one that sounds every time, Homer’s everything’s ok alarm. When you are on call and alarms don’t stop coming you’ll start “acking” them because they are annoying. And then, the one alarm you should have handled, will go under the rug.
It’s really important to measure things, take care of your services, use APMs, make the team responsible for the product they are building. But it’s often difficult to measure correctly. Do not measure everything, set thresholds that make sense, learn from past alarms and make post-mortem sessions. Once the team gets used to alarms, game over.
I’ve been on a team inside this situation. The week you were on call it would be a constant waterfall of alarms. My first week there I asked what about some of them and the answer was “Oh that one doesn’t matter, just ack it”… We did nothing to make it disappear. Why are we receiving it in the first place if it doesn’t matter?
Notice that measuring is difficult. Consider the alarms about anomalies in a service’s normal throughput. They measure how much our users are supposed to use our application. But what about weekends? Holidays? Special events (e.g. xmas, black friday)? Are our measurements capable of predicting this events or are we going to have a person suffering during his weekend because of bad predictions?
False negatives impact on customers, we don’t like them. What about false positives? They impact on the team. On trust, on the climate. We all like to sleep at night, to be able to go to the cinema without thinking we might need to leave in the middle of the movie because an alarm was triggered. Try to leave you error rate always on 0%, error noise will trigger your alarms when they are not supposed to because it will add to the real value.
To sum up, focus not only on your alerts and metrics because of the product and your customers but also because of your team. Your team is the one that makes the product and keeps the wheel turning everyday, it’s one of the most valuable assets you have. Keeping it cool, motivated and happy should be on your top priorities list.