It’s 3:30 PM and I am ready to take my lunch break. I am feeling extra daring today and decide to go down the street to grab a sandwich from the local convenience store. I hop in the car and take off as quickly as I can. I almost make it to the parking lot when my heart starts palpitating. This is the Pavlovian effect of an alert from Pager Duty blaring on my phone. Instantly regretting the brave yet foolish notion of leaving the safety of my laptop for sustenance, I grab my phone to see what the issue is. Did all of Amazon Web Service’s (AWS) servers just fall into a giant sink hole? Did one of our AWS instances decide to take off for a Memorial Day weekend barbecue? Fortunately, it was neither of those things. Much to my relief and frustration, I was being alerted to what I believed to be non-actionable item.

Sadly the experience described above isn’t isolated to me alone. It is an ugly but extremely important part of a System Administrator’s life (or insert your super awesome IT title here). Creating a monitoring system is simultaneously rewarding and horrifying. You are having a great time creating something that you know will most likely wake you up in the middle of night. It is always right before that part in the dream where I successfully hack the Gibson [1] and save the world! Monitoring and alerting are an undeniably important part of a healthy infrastructure; however, they do have their limits. Enter Alert fatigue.

Alert fatigue is defined by Wikipedia as “…occur[ing] when one is exposed to a large number of frequent alarms (alerts) and consequently becomes desensitized to them. [2]” To put it simply, the more alerts there are, the more likely you are to ignore them. The more jaded you become by a never ending barrage of alarms, the smaller the chance you will actually investigate and solve the issues creating them. The more unsolved issues that pile up the more likely you will use shortcuts to solve them. The more shortcuts used to patch problems will result in technical debt which if left unchecked will wreak havoc on your organization. So yeah… alert fatigue is kind of a big deal.

Are you suffering from alert fatigue? Well there is hope! Pager Duty has a great article [3] about this very topic with 7 tips to help your team get focused again and maybe even get a full nights rest.

  1. Commit to action
  2. Cut alerts that aren’t actionable & adjust thresholds
  3. Save non-severe incidents for the morning
  4. Consolidate related alerts
  5. Give alerts relevant names & descriptions
  6. Make sure the right people are getting alerts
  7. Keep it up to date with regular reviews

In my case, I notified my team that I believed we had a useless metric. They were quick to correct me and it turned out be an amazing learning opportunity. By openly discussing the check, I learned a lot about the items that surround it that make it critical. Furthermore, we realized that this particular checked suffered from a vague name and description as highlighted by step 5 above.

Dealing with alert fatigue requires a lot of effort. Going through checks with a fine tooth comb is no one’s idea of a great time. However, there is a long term benefit so make the time. It will make for a happier more productive you and a grateful well rested team.

Sources/References

[1] Hackers (1995) - http://www.imdb.com/title/tt0113243/

[2] Alert Fatigue - https://en.wikipedia.org/wiki/Alarm_fatigue

[3] Let’s Talk About Alert Fatigue - https://www.pagerduty.com/blog/lets-talk-about-alert-fatigue/

Comments