Here's How I Scaled A Cryptocurrency Exchange's Trading Engine to 1 Million TPS

Crypto exchange has been in great demand with the adaptability of cryptocurrencies surging and multiple new tokens/coins been put there attracting users by showcasing their great potential.

An exchange platform consists of many components which communicate with each other in order to achieve the functionality, but the main component which is the most critical and heart of the exchange is the matching engine which takes in multiple orders at once, maintains them, does the computations and pushes out trades.

The reason it being the most critical component is because a single failure or any wrong calculation could put the exchange at risk, mostly at the financial end breaking the user's trust.

The initial matching engine that I made was using node.js, typescript and rabbitMq, which ran really smooth with no delays or malfunctioning but when an enterprise-level solution was demanded the order placements were taken to higher levels, delays in trades were noticed, the server crashed consuming 100% CPU capacity, messaging queues started getting stuck.

When these problems were examined, it was understood that this wouldn't just require a small change but will have to be re-designed and built.

The exchange platform was build following a micro-service architecture because of which such kind of modifications and re-designs could be considered and experimented.

Learning that when it comes to heavy computation with high data load and low latency expectancy, node is not the best technology to use for this level of processing.

In general, any CPU intensive operation annuls all the throughput benefits Node offers with its event-driven, non-blocking I/O model as any incoming requests will be blocked while the thread is occupied with your number-crunching.

When it comes to adding concurrency on a multi-core server, there is some work being done by the Node core team in the form of a cluster module.

But still with clustering, one should offload all heavy computation to background processes written in a more appropriate environment for that, and having them communicate with each other via a message broker.

CHOOSING GO LANG

Choosing a multi-core system and increasing concurrency doesn't guarantee proportional speed ups, clumsy controls over concurrency can make it even worse on multi-core.

Go has made some really good design selection for highly scalable concurrency control.

Is it Process? No.
Is it Thread? No.
Is it Go-routine? Yes.

Context Management

When it comes to resource sharing and context management, concurrent tasks share processors and memory, usually, the number of tasks that have to be processed is more than available processors.

All it needs to handle is pausing and resuming the execution, but very efficiently.

Process of context switching requires many expensive operations like:

finding out process to run next, management of waiting/pending process
backup of all current CPU registers, restoring all CPU register to backup of next process
flush virtual memory cache
all operations run in kernel, so managing switching between operating system kernel mode and user mode.

Threads

Threads are similar to processes but they share address space because of which thread context is smaller than process context.

Thread is faster than the process for creation and switching game, but still, context switch overhead exists.

For threads and processes, kernels need to backup/restore entire registers because kernels don't know which registers are in use.

Threads allocate a fixed size for its stack on creation irrespective of the required stack size which limits the number of concurrent tasks running at a time.

GO-ROUTINES

When compared to Threads, Go-routines have minimal context overhead.

Go-routines also have co-operative scheduling, which minimizes context switching itself. They do context switching only in well-defined situations like channel send/receive operations, go statement, blocking system calls (file/network IO), and garbage collection.

If Go-routines aren't co-operative, starvation becomes very likely.

Go compiler emits code for using the register check and backup of them for every context switching event.

Go compiler knows what stack size is required for a given function. It starts with allocating a very small size stack and just before the function call Go checks whether current stack can accommodate the functions stack size requirement if it's not sufficient it dynamically increases the stack size according to the need. Vice a versa it can also shrink the stack size. As a result, Go-routines can keep a necessary stack size and allow maximum concurrent go-routines.

Go is special on multi-core systems owing to its clever design choices, it is super cheap and fast for context management and dynamic stack size management of goroutine allows more concurrency.

COMPARISON NODE ENGINE VS GO ENGINE

After building the new system, it was time to compare the scalability and efficiency of the two systems, the results were exciting.

Go engine results looked like,

high concurrency
low latency
trade speed/throughput calculated was 14 microseconds

CPU UTILISATION WAS MANAGED SO WELL AMONG ALL CORES

CONCLUSION

Choosing simple and efficient things always work. I could have designed a complex system with many queues, background workers, etc but instead decided to leverage the efficiency and simple approach to concurrency that Go-lang provides us out of the box.

There is always the right tool available for the job, it's just knowing about them helps in the best selection.