My On Call Manifesto

This has been burning for a while.

Software engineering is a great career. It gives you lots of opportunities to be creative, and the chance to solve tough, interesting problems. Especially in the realm of backend, API-driven systems. But it has a dark side: On-call.

Why is it a dark side? Well, for two reasons. The first is the most important. It pulls negatively against your work/life balance by requiring you to tether yourself to work in your off-hours. Second, if you're salaried, it sneakily dilutes your economic worth. For me the first is paramount, and honestly I've never felt much concern for the latter. Engineers are generally pretty well compensated.

On-call is a necessary and important part of the job. There have been plenty of articles written about how to do it "successfully", but I have yet to find any that emphasizes what, in my opinion, is the absolute most important strategy for success: Long rotation cycles.

To explain why, let's provide some context. There are many different ways to do on-call schedules, but the pattern I'm most familiar with -- and which I sense is probably the most common (at least for the median sized organization) -- is having one primary and one secondary on-call lead, rotating each week. There may be a tertiary on-call to the manager, but I think we can all agree that you never want to reach that point.

Now, let's talk about team organization. Again, these are broad strokes, but I think they are fairly representative of the middle.

You have a single, monolithic service, and all devs work on it (more common than you might guess). This generally, but not always, implies a decently large on-call pool.

You drank the microservice Kool-aid and have multiple services that blather to other services over HTTP. In this case, regardless of how you organize your teams, the reality is that there will be varying degrees of expertise for any given service. One engineer will be lead and know pretty much everything there is to know, and another is just knowledgable enough to try some things in the runbook. Someone else may have only heard of the service, and glanced at the README once. The important thing here is an engineer's cognitive limits.

A common reality of microservices is that they are frequently not that micro (more like just multi-), and even if they are, they can still have quite a few subtleties and complexities that someone not well into the code won't understand or remember, especially at 3AM. Plus, code changes.

Humans only have so much brain power and focus. From my own experience, depending on service complexity a single engineer may reach "expert" in 2-3 services, and only have "runbook knowledge" of 3-4 others, before really just not having anymore mental bandwidth.

I should also note that for a smaller, limited purpose service, you rarely need more than a couple of engineers to be expert. They handle the majority of the tickets that come up, and they're aware of all state changes. It would be suboptimal to assign more tickets to more engineers just to ensure that there are enough experts for the on-call schedule. (I also don't believe this is even tenable.)

Micro/multi-service is increasingly the world in which we live, so I'm going to focus on that scenario. How do you arrange your on-call rotation?

Should on-call be "experts only" for a given service or set of services?

Or, should you instead require teams to maintain runbooks and tooling that are substantial enough so that non-experts can be on-call?

I've done both arrangements, and I'm most definitely promoting the latter. Here's why.

If you're a good employee and co-worker, you want to shoulder your share of the load. On-call is a necessary evil, and everyone on the team shares in the suffering. The reason for a two-tiered on-call rotation is so that if the primary misses the alert, the secondary will get it and respond. But, as a decent person in the primary rotation, you want that to happen very rarely.

So what does that mean? Basically, that you tether yourself to your job for the week of primary on-call. You need to reschedule, put things on hold, re-arrange travel, haul your laptop to birthday parties, etc. I work remote and live in a rural area, and it's very easy to drop out of any kind of coverage. Not just because it's remote, but also because of the kind of personal activities I tend to do. Primary on-call means that you're giving over a portion of your personal life to your work.

How much of your life are you actually giving up? Well, that depends on exactly how short the rotation cycle is.

Let's say you decide that only experts should cover on-call for a service. And let's say you (optimistically) have 3 devs that are experts in a single service. This means engineers get a two week break between being on-call. Or, put another way, they are primary on-call for ~121 days out of the year. Or 1/3 of the year. Put yet another way, the engineer is tethered to their job for 1/3 of their existence.

What about 2-3 services? I would generally expect there to be a larger team in this case, so let's say you have 5 engineers managing 3 services. You now have a higher chance of outages, because you have 3 different services that each interact in different ways with an arbitrary set of other services. But you now have a 5 week rotation. Better, yeah? Well, let's see.

You have a one month break between primary shifts. This amounts to being primary on-call for 73 days out of a year, or 1/5 of the year. Which means 20% of your existence is tethered to work. An improvement, but remember the benefit is offset by the fact that the potential for issues is higher.

And we haven't even talked about secondary on-call. In which case you're still on the hook; however, in a supportive environment where you can count on your fellow engineers, you can probably worry less and step out of coverage here and there. It's still a tether, though, and still counts. In the case where there are 5 engineers sharing the load, you're technically on the hook for double the time (primary week and secondary week) so 40% of the year.

The reality is a bit different, I admit, and in a supportive environment let's say it's more like 30%. You still need to be available, because there's still a chance the primary has to reach out for help, and you don't want to leave them hanging. You can play the odds a little, is all.

This is why I say the best thing you can do for an on-call rotation is make it as long as possible. What if you double the rotation from 5 to 10? Well now you're down to ~36 days of job-tethering for the year, or once roughly every 2 months and 1 week. This is starting to feel reasonable.

So how can you achieve longer rotations, given the realities of cognitive limits and expertise?

By requiring that services be reliable enough to "share", and then provide tools and documentation that are robust enough so that non-service-experts can take on-call rotations. This has some requirements, and some benefits.

The first requirement is good and up-to-date documentation, generally in the form of a runbook. Annoying, I know, having to write prose rather than code, but well worth the payoff, not just for on-call but for on-boarding, general communication, and time-saving ("here's a link answering your question"), etc.

The second is understandable alerting. We've recently been toying around with providing a link to a runbook#anchor on each alert. Whoever is on-call clicks the link from the alert, opens the runbook, and reads "Getting this alert? Try X, Y, and Z."

The third is fixing problems that are fixable. This is the age-old tech-debt vs. features problem, where the company really wants you to spend 100% of the time banging out features, rather than solving existing headaches. Especially when the headaches only affect the engineers, who have likely pieced together workarounds.

The fourth is proper tooling. If a problem continues to exist because of other priorities, and requires a complex set of steps, make a script. Put in in scripts/oncall. Delete it later. Have tools that give easy, quickly understandable insight into the system as a whole. You should have ops tools for running the system, deploying code, looking at logs, etc. The difference between ops tools and on-call tools is that you want to keep the former, and delete the latter.

Fifth, your service should not be utterly broken. If it is, then keep it "in house" until it's stable. We've all been on beta projects, and know there are periods in the life of a product where it just simply cannot be expected to be stable. You should not push these services on non-service-experts.

Finally, you must recognize that if you're an expert in the service, you are the final destination when an issue is bad enough. Support can usually be divided into a couple tiers. Things that have clear workarounds are likely to be "Tier 1" and can be handled by a non-service-expert. Database indexes gone bad? Probably can be handled by a non-service-expert. Solution referenced in the runbook? Tier 1. The assumption here is that these are technical, experienced, smart people with on-call chops who simply don't know the ingredients to your particular soup. Any issue that rises to "Tier 2" will require an expert, and should be escalated. A good rule of thumb for non-service-experts: When you exhaust the runbook, escalate.

With the above in place, you should be able to integrate another team with a reasonable expectation that no one is going to dread their turn on-call.

There are still a few things you can do to further improve your chances of success. First, try and pair teams whose services overlap. If you're micro-servicing, it's quite likely that one team's service relies on another team's service. This means a team probably has at least a working knowledge of the service they will support as non-experts, since they consume its APIs.

Second, you should try and intermingle the on-call rotation so that you have a representative from each team on every rotation in the primary and secondary positions. This obviously works best with two teams and a perfectly even number, but we live in the real world, and there's a good chance you have an odd number of members or teams.

This article is clearly a view from the trenches, not a lofty managerial piece. I linked those above. However, the ideas there also apply. Make sure your team is flexible, cognizant of imbalances/burnout, foster the proper culture, and adapt strategies to fit your specific situation.

There is an additional emergent benefit in doing on-call rotations this way. I call it a "service litmus test". Effectively, you can judge the state and quality of a service by how well a non-service-expert will do in an on-call rotation.

Think about it. If a non-service-expert can easily take an on-call rotation without undue stress, it means the service is either trivial, or else

1) It's properly documented.
2) Alerting is tuned and integrated properly.
3) Ops tooling is in place, and on-call tooling is as well, if necessary.
4) Priorities between product needs and tech debt are balanced.
5) It's in production when it is production (not PoC, alpha, beta, etc).
6) It reacts to unexpected situations reasonably well (errant script anyone?).

I never expect any constantly evolving service to always nail all the above points all the time. And I don't expect the above to be used as a rubric, either. You can probably decipher the state by just chatting with the non-service-experts. How much are they sweating on-call for service X? Is the answer pretty consistent between all of them? If so, then you probably need to focus some time on addressing whatever shortcomings exist.

As an engineer currently in the trenches, this is my view up towards blue sky. Work/life balance is a frequently touted perk in HR propaganda, and in my opinion, is actually one of the best "intangible" ways to retain quality engineers. It's tough to quantify, for sure, but loosening the work-tether is a really good place to start. And on-call is probably the tightest noose.