Kafka Gotchas

I’ve assisted several large clients in building a microservices-style architecture using Kafka as a messaging backbone, having a reasonably good understanding of its abilities and the use cases that really bring them out. But I’m not a Kafka apologist by any stretch; any technology that has gone through such a rapid adoption curve is bound to polarise its audience and rub certain developers up a wrong way, and Kafka is no exception. Like anything else, you need to invest a significant amount of time in getting across Kafka and event streaming in general, before you become fully proficient and can harness its might. And be prepared to face one or two frustrations, to put it mildly, along the way.

I’ve compiled a list of shortcomings that may cause developer frustration, or catch out unsuspecting first-timers. In no particular order:

Too many tunable knobs

The number of configuration parameters in Kafka can be overwhelming, not just for newcomers but also seasoned pros. Possibly with the sole exception of the JVM, I cannot think of another technology that has this many configuration parameters.

This isn’t to say that the config options aren’t necessary; but one does wonder how many of those parameters could instead be replaced by ergonomics, much like Java did with G1. So rather than specifying a plethora of individual thresholds and tolerances, let the operator set a performance target, and have the system derive an optimal set of values that best meet this target.

Unsafe defaults

This is my biggest gripe with the config options. Kafka authors make several bold claims around the strength of their ordering and delivery guarantees. You would then be forgiven for assuming that the defaults are sensible, insofar as they ought to favour safety over other competing qualities.

Kafka defaults tend to be optimised for performance, and will need to be explicitly overridden on the client when safety is a critical objective.

Fortunately, setting the properties to warrant safety has only a minor impact on performance — Kafka is still a beast. Remember the first rule of optimisation: Don’t do it. Kafka would have been even better, had their creators given this more thought.

Some of the specific examples I’m referring to are:

```
enable.auto.commit 
```
— defaults to true, which results in consumers committing offsets every 5 seconds (configured by
```
auto.commit.interval.ms
```
), irrespective of whether the consumer has finished processing the record. Often, this is not what you want, as it may lead to mixed delivery semantics — in the event of consumer failure, some records might be delivered twice, while others might not be delivered at all. This should have been set to false by default, letting the client application dictate the commit point.
```
max.in.flight.requests.per.connection
```
— defaults to 5, which may result in messages being published out-of-order if one (or more) of the enqueued messages times out and is retried. This should have been defaulted to 1.

Appalling tooling

There is no consistency in the naming of command-line arguments and the simple act of publishing keyed messages requires you to jump through hoops — passing in obscure, undocumented properties. Some native capabilities, such as record headers, aren’t even supported. The usability of the built-in tools is a well-known heartache within the Kafka community.

This is a real shame. It’s like buying a Ferrari, only to have it delivered with plastic hub caps. Most Kafka practitioners have long abandoned the out-of-the-box CLI utilities in favour of other open-source tools such as Kafdrop, Kafkacat and third-party commercial offerings like Kafka Tool.

Complicated bootstrapping process

The bootstrapping and service discovery process used by clients to establish broker connections is complicated and tends to confuse users. Clients will initially be supplied with a list of broker addresses and ports. A client will then connect to an address at random, discovering the remaining broker nodes, before forming new connections directly to the discovered nodes.

This is fairly trivial in a simple, homogeneous network setup where all connections from all clients and peer nodes traverse a single ingress. In a heterogeneous network, there may be several ingress points to segregate broker-to-broker communications, internal clients that live on the same local network, and external clients that might connect over the Internet.

The bootstrapping/discovery process needs a special configuration, requiring dedicated listeners and a separate set of advertised listeners that will be presented to the connecting clients.

Shaky client libraries

The quality/maturity of client libraries in languages other than Java, Python, .NET and C is sub-par. If you are a Java developer, you’ve got it made — that’s where most of the development is concentrated. But Golang and some other communities have struggled in getting access to stable libraries, and while some of these ‘indie’ libraries have been around for several years, the quantity and severity of some of the bugs that I’ve come across in these languages are genuinely concerning.

Lack of true multitenancy

According to Kafka maintainers, it supports multitenancy. The present design is limited to access control lists (ACLs) to segregate topics and maintain quotas, which creates an illusion of isolation for clients, but does not create isolation in the administrative plane. That’s like saying that your fridge supports multitenancy because it lets you store food on different shelves.

A true multitenancy solution would provide for multiple logical clusters within a larger, physical cluster. These logical clusters could be administered separately; a misconfiguration of an ACL in one logical cluster, for example, would have no effect on other logical clusters.

Lack of geo-awareness

Geographical replication isn’t built-in to the brokers, and it is generally accepted that high performance Kafka clusters and ‘stretch’ topologies don’t mix. There is an open-source project — MirrorMaker — which is effectively a pipeline for pumping records from one cluster to another, without preserving any critical metadata (such as offsets).

Confluent has its proprietary tool — Replicator — that will preserve metadata, but is a part of the licensed Confluent Enterprise suite.

On the balance of it, and in spite of the points above, I wouldn’t say that Kafka is rubbish — quite the opposite. Of course, Kafka isn’t without its flaws. The tooling is sub-par, to put it mildly; the breadth of Kafka’s configuration options is overwhelming, with defaults that are riddled with gotchas, ready to shock the unsuspecting first-time user.

But as an event streaming platform, Kafka has shaped the way we now architect and build complex systems. It’s given us choices, and that’s a good thing. Its benefits go beyond the superfluous, and they dwarf any of the niggles that are bound to exist in a technology that has undergone such aggressive adoption.

Was this article useful to you? Take a moment to bookmark it, so others might spot it too. I’d love to hear your feedback, so don’t hold back! If you are interested in Kafka or event streaming, or just have any questions, follow me on Twitter.