May is Mental Health Awareness Month
Human Impact of On-call
No one wants to be down. Being down is painful on many fronts. There’s the financial aspect—which is substantial, especially in this economic climate. You’ve read how each minute can mean thousands of dollars lost, so seconds count. If that wasn’t enough, not being up also affects organizations reputationally, and nothing feels worse than the sinking feeling of watching unhappy customers’ dissatisfied tweets pile up.
Yes, that’s enough reason to manage incidents quickly. (You sometimes daydream about avoiding incidents completely, but it is exactly that: a dream.) However, there’s one more thing to consider beyond the financial and reputational costs. The people closest to the incident, the responders. Unfortunately, they’re often overlooked because tackling incidents is considered their jobs, but precisely because it’s their job they’re the most invested in solving incidents better.
Sadly, improving the incident process for those tackling the issues hasn’t been discussed enough, but that doesn’t have to remain the case. When organizations are trying to find ways to reduce and prevent incidents, the most underutilized solution is supporting their talent. By looking closer at the processes and conditions teams work under during on-call, leadership can create gains seemingly magically.
First, you might wonder, is there actually something to improve right now? There is. At many companies, the incident responders experience stress at levels that lead to dissatisfaction, burnout, and even leaving the job. A 2019 survey by Catchpoint (inspired by Jaime's presentation at SREcon EMEA on mental health and incident response) demonstrated how widespread this is:
- 79% of survey respondents felt stress after incidents
- 69% of respondents felt moderate or high stress
- 52% said it affected mood
- 48% said it affected concentration
- 38% said it affected sleep
- 38% said it affected their ability to be social
Those numbers suggest that incidents take a toll on responders, and there is a clear link between high stress and low productivity. Beyond the work impact, effective leaders recognize that, when a significant number of respondents say that their mood, concentration, ability to sleep, and ability to be social are all negatively impacted, their organizations are not capable of reaching their full potential.
Why is incident response so stressful? There are four ingredients to stress: novelty, unpredictability, threat to ego, and sense of control. Briefly, we get stressed because of things that we’ve never encountered, are difficult to predict, make us feel lesser, or make us feel lacking agency. Not only one ingredient can apply to a stressful situation; for incidents, all four apply. Incidents are often new, come at seemingly random moments, can make us feel judged as we respond, and can appear to have a life of their own. A perfect storm for stress.
This doesn’t include the fact that incidents happen beyond the confines of the work day. Reliability is now an expectation around the clock, which means interruptions occur during dinner time, and, even worse, mid-sleep. Having to hop back on the job once at home is bad, but being woken up to respond to an alert is worse. Not getting a proper night’s sleep impacts our cognitive processes (an easy test for if you’ve slept enough is if you can wake up without an alarm and not feel sleepy) and, if it happens repeatedly, can lead to health issues.
Some interruptions outside of work hours are inevitable, but everyone has their limit. Organizations may not rigorously track when and how frequently interruptions occur beyond the nine-to-five, but employees do, even if it is subconsciously. Studies have shown that the more often people get paged when they’re not at work the more likely they are to leave their jobs.
Turning an eye away from the impact on-call has on people doesn’t mean the implications disappear: given how expensive, time-consuming, and morale-deflating departures are to an organization, doesn’t it make more sense to consider the impact to prevent people from leaving?
With all that said, the situation is even more heightened during the pandemic. Many industries during this crisis are experiencing more incidents, as much as eleven times as normal. Yet, the shift to mandated remote work and increased stress from navigating these challenging times has meant that organizations are slower to respond to incidents. These findings remind us that while the tooling we’ve adopted for incident response matters, what matters most still are the humans who use that tooling.
We have to start watching out for burnout, which has common factors that are clearly related to the ingredients for stress, such as work overload and lack of control. Burnout can lead to emotional exhaustion, cynicism, constant negative responses, and ineffectiveness. No one wants to burn out, and yet we sort of treat it as inevitable and unpreventable.
Our cognitive abilities are being stretched in ways we’ve never experienced in the modern era. This should be a resounding wake-up call that how we do on-call is not working, and waiting for things to “return to normal” is not a proactive and meaningful strategy. The risk is that although people may not be making any big changes right now during the uncertainty of the pandemic, there is an internal tally occurring, and, once things stabilize, we will see a migration away from roles with too high an on-call burden.
This stress (pre-, during, and post-pandemic) means that people dread on-call. For example, look at how often people talk about on-call “sucking.” You can’t blame them. The issue is that that means people hope to avoid handling incidents while on-call. This means that they have less opportunities to be hands-on with their systems, losing the chance to learn how they work.
This may seem like a weird angle to take around on-call, or one that feels like pure spin: on-call as a learning opportunity. But this has been championed by leading thinkers like Charity Majors and Cindy Sirdharan. On-call can be both things: painful and a learning experience. Our goal should be minimizing the former to allow for the latter.
What can we do about this? We can examine how we approach on-call, and rebuild it in a way that reinforces resilience. We have to not only monitor our systems, but also the processes that support the systems. The question on everyone’s mind should be how are we evaluating our on-call processes and uncovering the ways to make it work best for our culture?
Why is that? Because individuals can only do so much to build their own resilience. Yes, we could also use more mindfulness, more exercise, better sleep—but they do not make up for an environment that supports our resilience. A WHO report on mental health states something that is obvious to too many workers: “Many workplaces have opted for attempting to enhance their workers’ resilience rather than modifying risk factors.” Yet research has shown that situational and organizational factors play a bigger role.
This is partly what we aim to achieve with the Ovvy on-call survey. We want to see how the industry performs on-call, and what lessons we can teach one another to improve our practices. We need to think about how we can create better processes and stronger feedback loops to improve how we do on-call, because that means we're tackling the situational and organization factors.
This is important because at the heart of the on-call practice are people. The wonderful thing about people is that we are wildly adaptable, and we believe that even the smallest insights placed in the hands of people will lead to meaningful gains. The survey will close June 5, 2020, and we expect the report to be complete by August. The report will be freely available, and we look forward to the discussions (and maybe even arguments) that come from it.
We believe that healthy and resilient teams handle incidents better, creating a virtuous cycle that benefits everyone. Better incident response means happier customers, reduced costs, and stronger on-call teams. Impact doesn’t have to be negative. Our goal is that in the future when we discuss the impact of on-call, that conversation is about how it builds and strengthens organizations.