Triaging Support Tickets the Smart Way

by SamOctober 9th, 2023

Too Long; Didn't Read

Triaging support tickets can be a difficult task, especially when several stakeholders have competing background priorities. This article explains how such decisions can be modelled by expected value calculations as found in decision problems under risk and uncertainty.

featured image - Triaging Support Tickets the Smart Way

Any developer who has been on live support knows the drill. Tickets flood the dashboard and decision pressure build. There is no way to resolve all issues in one go. Some ticket will have to take priority to the others and receive attention first.

That’s the uncontroversial bit, It’s a fact of life that not everything can be done at once and it just manifests here on a dashboard of support tickets. Things get a bit more heated when a priority ticket has to be chosen. Assuming that you’re neither a solo-entrepreneur or all-powerful dictator, you’ll be bound to discuss this with a couple of people who have slightly different background motivations. What ensues is a kind of round table debate which some team members may find all too political.

There is a lot to say about the psychology and interpersonal dynamics of such exchanges. But how would we approach a triage session if we weren’t bound by so much humanness? In other words, is there an optimal or near-optimal strategy applicable to the process of assigning priorities to a bunch of tickets that need looking at?

There is. The path to such a strategy lies in recognizing that this commonly faced software engineering problem is in fact an instance of a decision problem under risk and uncertainty.

All Action

Let’s get concrete(r).

Given a support ticket, the following actions are available:

no fix
external fix
quick fix
full fix

The seasoned developer might, at this point, quickly associated certain typical scenarios with each of these fixes. Firstly, a no fix can be chosen when the impact of the issue is extremely low. For example, it’s not customer-facing and only occurs after a series of edge inputs. Secondly, an external fix is chosen when it is clear that the error occurs due to a fault in third party software. In such cases, it is even imperative to negotiate a fix from the third party as to not introduce an additional dependency in the existing software system (say system A has a hack to cover for a bug in system B, when system B fixes its bug, system A might have to revert its hack). For example, an API unable to handle requests at a reasonable interval should not be flood-defended at consumer side. Thirdly, quick fixes can alleviate issues which have a high impact on the users. Lastly, full fixes provide the most robust solution but might take up significantly longer development time.

There are already a couple of general lessons to draw from just thinking about the scenarios in which these action occur.

Observation 1: the impact an issue has on the users of the software system is strong determinant of the chosen action
Observation 2: development time plays a non-negligible role and can be a potential source of fallacy (it’s human to want things now rather than later)

Great observations, but can we do better? I think so, let’s get formally concrete(r).

Expected Value

For each of the possible actions, calculate the expected return according to the following formula:

E[action] = P(issue | action) x V(issue) + P(no issue | action) x V(no issue)

where

P(): probability function, assign a numerical value between 0 and 1 representing how likely it is to have the issue

P(A|B): the conditional probability, how likely it is that A occurs given that B occurred

V(): value function, assigns a numerical value to each issue representing the gain/loss the issue brings to the software system

Real Scenario

There is a button on the front page whose font size is wrong. Annoying but not deadly.

We can leave it as is, the site is functional. There is no external fix, it’s our button? We’re using a library of UI components. So maybe their developer coded an Easter egg, changing the font size of every 5th button on a page. A quick fix can be devised by adding a bespoke stylesheet which is not transpiled or packaged up. A proper fix involves editing the namespaced stylesheet and pushing the change through the CI/CD pipeline and a release cycle.

Now, image the following imaginary stakeholders (any resembles to real person is purely coincidental):

Boss Betty: She thinks customers will notice the odd-one-out button and want the issue to be resolved as quickly as possible. She cares about the reputation of the company and feels that this issue does not reflect well upon the company values.
External Eddy: Man-in-the-middle between our developers and the UI library developers. Believes his developers more than he believes our developers. It’s his job to balance bringing in unnecessary work and maintaining the standing of the UI library as the best of the pack.
Wizard Wendy: Developer know for her creativity in finding quick fixes to hard problems. Has already coded up a fix in her notepad.
Seasoned Steve: Senior developer with a fetish for clean code, regularly wakes up at night after having a dream about loose sheets, stylesheets.
Agile Angie: Delivery lead, always there for the team and for the sprint targets, in reverse order.

Forgive me the cheek-in-tongue depictions of these characters, exaggeration can be a purposeful device to augment an exposition. I am confident that real companies are populated by people far more nuanced.

Before any formal analysis of the decision problem, these 5 people each have their preferred action based on their personal preferences and dispositions:

Boss Betty: quick fix
External Eddy: no external fix
Wizard Wendy: quick fix
Seasoned Steve: full fix
Agile Angie: quick fix

What would be the outcome of a meeting among those people deciding the right course of action regarding this issue? If all people have equal weight in the decision process, it’d be a quick fix. However, people have different standings. For example, Seasoned Steve might be able to argue his case and convince the other that a full fix is really most appropriate.

What’s happening in the background is that each of them is calculating their version of the expected value of each action and then communicating that to the others. Depending on their level of inventiveness, a diverse set of reasons might be brought forward as to why they believe their preferred action to be the right action for the company as a whole.

At this point, you might think: what is the value of having a formal method to break down a priority decision when they’re made individually anyway?

Reason 1: a breakdown of each person’s decision into its component can illuminate the real reasons behind their choice
Reason 2: it might be easier / more objective / more truth-conducive to discuss the parts of a decision separately and calculate the expected value of the different actions as a group rather than each coming towards the table primed with their preference

Etching Numbers

V(issue) in interval [-10, -100], i.e. some negative number depending on how bad the issue is perceived to be (eg. Boss Betty, concerned about the company’s reputation might estimate the negative impact of the issue to be higher than Seasoned Steve); let’s assume it to be -50

V(no issue) = 100, some positive number

P(issue | no fix) = 1
P(no issue | no fix) = 1 - P(issue | no fix) = 0

P(issue | external fix) = 0.99
P(no issue | external fix) = 1 - P(issue | external fix) = 0.01

P(issue | quick fix) = 0.2
P(no issue | quick fix) = 1 - P(issue | quick fix) = 0.8

P(issue | full fix) = 0.01
P(no issue | full fix) = 1 - P(issue | full fix) = 0.99

E[no fix] = P(issue | no fix) x V(issue) + P(no issue | no fix) x V(no issue)
= 1 x -50 + 0 x 100 = -50

E[external fix] = P(issue | external fix) x V(issue) + P(no issue | external fix) x V(no issue)
= 0.99 x -50 + 0.01 x 100 = -49.5 + 1 = -48.5

E[quick fix] = P(issue | quick fix) x V(issue) + P(no issue | quick fix) x V(no issue)
= 0.2 x -50 + 0.8 x 100 = -10 + 80 = 70
 
E[full fix] = P(issue | full fix) x V(issue) + P(no issue | full fix) x V(no issue)
= 0.01 x -50 + 0.99 x 100 = -0.5 + 99 = 88.5

According to the above calculation, we should prioritize the full fix. The quick fix is a close second. No fix and the external fix should be avoided.

Crunchy Conclusion

Whether it is desirable to explicate each triaging decision to this level of detail is questionable. Whether the developer who has read this article has an edge in understanding what’s really going on during a triage call? A certainty.