475 reads

Making Reliable Distributed Systems in the Absence of Erlang

by Oleksandr KaleniukMay 30th, 2023

Too Long; Didn't Read

Making reliable distributed systems in the presence of software errors is possible without Erlang. As soon as it has established the principles, as soon as it proved them feasible in practice, it has made itself redundant. The lessons we learned from Erlang were larger than Erlang itself.

featured image - Making Reliable Distributed Systems in the Absence of Erlang

It is 2023 now. Twenty years ago Joe Armstrong published a thesis titled “Making reliable distributed systems in the presence of software errors”. The thesis showcased several distributed systems and explained what made them reliable. All the systems were written in Erlang, a language Armstrong coauthored while working in Ericsson. It might have looked that Erlang was the main topic of the thesis. As it turned out, it wasn’t.

What is its main topic of Armstrong’s dissertation then?

The first time I saw the dissertation, I was working in Erlang too so I got interested at first, but then I was driven away from the dissertation by its focus on architecture and philosophy. Back then I already had some experience working in game dev and academia, and in my then experience, the very existence of architecture and philosophy were the most important diagnostic criteria for a failing software project. When people focus on architecture, this is a sure sign then nothing works, and when people focus on philosophy, then nobody works either.

Back then, I avoided that “architecture and philosophy mark of death” successfully for about a decade, and thus had a good track record. All my projects were successful. They just never failed.

However, as a bad pathologist, I confused the cause and the consequence. The presence of architecture and philosophy is not a mark of death but an attempt to cheat death. My projects were not successful due to the lack of architecture, they didn't need any architectural effort to be successful because of their level of simplicity. They were bound to succeed from the start and needed no remedy. They were too simple to fail.

In Joe's world, the systems are inherently complex so making things simple already requires effort. And he explains how to spend this effort with the most return on investment. It's not "the way" as many other acclaimed architects would have claimed in his place, but "a way". One among the others. Thought out, tested, then proven.

And his dissertation is much more about what it says in the title: making reliable distributed systems in the presence of software errors rather than the one specific tool for that namely Erlang.

In 2013, so ten years ago, I switched from Erlang back to C++, Python, C, and Assembly. I quit an Erlang job because the remote wasn’t working for me, and there were no Erlang jobs where I lived. I was sure the transition was temporary. In ten years, I thought, Erlang would conquer the world and I’ll come back to writing in Erlang without even having to move.

That didn’t happen. Erlang is reasonably popular compared to other functional languages such as Haskell or F#, but it’s nowhere near mainstream. How come? Don’t we want more reliable distributed systems?

We do. And we do build them. Sometimes in Erlang, more often in other languages. Erlang is a fitting language for distributed systems, indeed, but it’s not really essential. Even Armstrong himself admits that “indeed concurrent programs can be written in languages which are not themselves concurrent”. As it turned out, the main topic of his thesis was not the language at all but the presence of software errors itself.

Can’t we just fix all the errors?

There was one episode that ultimately reshaped my views on software errors. It happened right after I quit Erlang and went back to C++. Back then, I got a job in the field of static analysis, and have just learned the two facts that I then found disturbing.

The industry average is about 1 defect per 1000 lines of code. The probability of your code being infallible falls with every new line.
The analyzers we used to establish that fact are not that omnipotent themselves. One method to validate an analyzer was to add defects deliberately and see how much the analyzer would catch. Spoiler alert: not that many.

So one day I was walking along the street trying to contemplate all of that thinking: "Is there a way to cheat defects?" Surely, if a program is small enough we can presume that it's defectless. And then to build a large program we only need to build a lot of small ones. Hey! I reinvented microservices before it was cool!

But as I was thinking about all of that, I came to a crossroads. The lights were red so I stopped. But then, in retrospect, I might have thought, "I'll wait but I still have to cross the road so why don't I cross the road now and wait on the other side". And so I did. I crossed the road on the red light, came under the road light pole, and stood there until the lights turned green again.

And this was my answer, and the answer was "no." There is no way to cheat software defects. People make stupid mistakes all the time and for no good reason. If I can’t even cross a road properly 100% of the time, then surely no program, however small, could possibly be considered 100% reliable. Even 'print "Hello, wolrd!"' may contain an error everyone has simply failed to find so far.

So what do we do with this information?

What do we do? Well, the same thing mechanical engineers or electrical engineers do. We admit that no software component is 100% reliable and:

start measuring the reliability of software components and systems. Unit tests are not good enough, we need tons of data now.
start improving the reliability of software systems by introducing diagnostic, self-repair, and redundancy components.

Improvement is impossible without measurement, but with measurement, improvement is possible. See, I introduced a redundant statement and improved the phrase reliability. Easy!

What about Erlang?

Erlang is designed to accommodate principles for building reliable systems in the presence of software errors, and in such a capacity, it is definitely a worthy instrument. It is one of the few concurrency-oriented programming languages in the world.

But as Joe Armstrong himself pointed out in his dissertation: "...indeed concurrent programs can be written in languages which are not themselves concurrent". You can write concurrent programs in Python if you want. Or even shell. As soon as you can spawn a process and send it a message - you're good.

And this explains why Erlang hasn't conquered the world. As soon as it has established the principles, as soon as it proved them feasible in practice, it has made itself redundant. The lessons we learned from Erlang were larger than Erlang itself.