Scaling to Infinity: Innovating the Worker Queue

When building applications, it's common to be coordinating frontends, databases, worker queues, and APIs. Traditionally, queues and APIs are kept separate; queues handle long-running or CPU intensive tasks and APIs serve quick responses to keep applications snappy.

Here at Paragon, we were faced with the question: how do we build a performant API for a queue that handles all our customers' tasks?"

For those not familiar with Paragon, Paragon provides a visual builder for creating APIs and workflows. Users can build cron jobs and API endpoints in minutes that connect to databases, 3rd party APIs and services, and logic or custom code for routing requests and transforming data. With that context, we had to build a worker queue that could support the following use cases:

users can run arbitrary code (steps) on the server
the system should be able to scale to execute zero, thousands or millions of steps in parallel
a series of steps (workflow) can be triggered by an API request or a scheduled event
if triggered by an API request, the output of the workflow should be returned in an API response nearly instantaneously

Sounds like a Herculean feat, right? Spoiler alert: it was. This blurs the lines between a worker queue and an API, and there were no common engineering paradigms existed to draw from. As you can imagine, the security and performance implications of these product requirements kept our engineering time busy for some time.

Introducing: Hermes + Hercules

Due to the complexity, performance, and security requirements of our platform, we've had to innovate on the API + worker queue construct a bit. We created Hermes (our API messaging service) and Hercules (our workflow executor) to solve these problems.

Hermes accepts API requests, sends them to Hercules for execution, waits for the specified workflow and its steps to complete (or fail), then sends back a response to any awaiting clients. They're entirely separate services, but they communicate together to receive, schedule and execute work.

One might think that the added complexity and latency between submitting jobs to a worker queue and waiting for a response might have slowed down the API. We were quite pleased to find out the opposite: our APIs got much faster, particularly when processing large arrays of data.

Thanks to Hercules' ability to self-monitor and autoscale, we can distribute work across processes and run them in parallel. Additionally if a branch of steps fail, the others can continue to run successfully without terminating the request adding more consistency and reliability for workflows.

With that in mind, here are some of the things we thought about or learned while building a resilient, scalable system.

Fault Tolerance

Things will break. That's a truth we all must accept when building systems. Code won't work as expected, memory will leak, APIs won't return the expected responses, upgraded dependencies will introduce new issues, etc etc. We're dealing with all of these unknown unknowns as well as the fact that we're running arbitrary code from our users that may fail.

To ensure jobs don't take down other jobs, it's important to run them in isolation. This means:

Use a single process per execution. No single process should be processing more than one job at a time. If one crashes the process, all other in progress jobs would fail.
Isolate code from core environment. Users shouldn't be allowed to write malicious code that messes with the underlying hardware, environment variables, or other key resources. We use a combination of docker containers and sandboxed environments within those containers to separate code from the main processes.
Separate the API layer from the worker layer. It's of the utmost importance to us that every job gets executed. Given that workers will periodically go down, we needed to ensure we have a reliable server always available to accept requests to queue jobs even if there's no worker immediately available to process them.

n-Scale

As most agile teams do, we have multiple environments for testing different versions of our code. This means in one dev environment there may be zero jobs running while in another there could be thousands.

To ensure we're able to meet traffic demands while still optimizing resources, here are a few key infrastructure elements we've employed.

Horizontally scale with containers. Given that we're running only one job per process, running code on more powerful servers (vertically scaling) doesn't equate to higher throughput. Thus, the answer for us was horizontally scaling smaller instances of each service to meet demand.
It's important that all of these machines are running the same versions of the code for consistency, so we package each version of our code into executable images with Docker. We build the images in our CI environment on every deployment, and when new containers are deployed to meet demand, they pull the latest image.
Add auto-scale monitors. We monitor a few key metrics to determine scaling logic, which include the number of jobs in the queue, the oldest jobs in the queue, memory usage on the machines, response times from requests, and endpoint health checks.
Separate domain-specific code into microservices. Our microservice architecture allows us to auto-scale resources independently, meaning there could be a single instance of Hermes accepting requests but hundreds of Hercules instances processing them. Additionally, we have many more microservices for various functionality required by Hermes or Hercules to access data stores.

Persisted State

Our users' workflows may run in milliseconds, minutes or days depending on their configuration. We're often deploying new features multiple times a day, meaning servers terminate to pull the latest images, and jobs execute in separate processes across a variable amount of servers.

This means workflow state can't be stored in memory otherwise other machines wouldn't have access to them.

Steps may need data from previously executed steps in a workflow that ran some unknown time ago. Additionally, jobs should only be executed once, meaning whether there's one processing server or thousands, there should be no race conditions amongst workers.

These are all common scenarios for a worker queue. The solution for all of this is straight-forward:

Don't store anything important in memory or on the containers.
Use an ACID compliant datastore external to the containers. This allows for locking mechanisms for jobs and reliability when reading / writing data.

Cache Rules (Everything Around Me)

Our distributed architecture comprises of multiple microservices with their own data stores, some of which having multiple data stores they read and write to. A single API call to one microservice may trigger a dozen calls to other services, reads and writes, encryption / decryption methods, etc which can lead to hundreds of milliseconds, if not seconds, of overhead.

Given that a workflow can have any amount of steps, every 100 milliseconds added to a step can lead to an unideal user experience.

Caches can have huge performance improvements across systems when implemented correctly. One thing to consider here is that in a containerized n-scaled microservice architecture, in-memory caches, while easy to implement and very performant, aren't always the best choice.

If a microservice has 10 instances behind a load balancer each running with in-memory caches implemented in front of a database, the cache will miss nearly every time. Thus, we use Redis as a shared cache between instances.

Additionally, we considered the request volumes and read to write ratios. Given that our workflows execute on average thousands of times more than they're deployed, that was the very first thing we cached, which has in turn saved us millions of database queries, API calls between microservices, and seconds responding to API calls.

Conclusion

No system is perfect, but we're proud of what we've built at Paragon. Hercules does all of the heavy lifting while Hermes does all the talking. They can scale up and down independently of each other to meet demand, and when a process fails, other parts of the system are both unaware and unaffected.

We'd love to hear your thoughts on our implementation! Happy coding.