Building with Kafka: the Good, the Bad & the Ugly - Globality Blog

Building with Kafka: the Good, the Bad & the Ugly

Share:

featured post

Here at Globality we are always excited to adopt great technologies that improve our platform’s functionality. Over the past year, a number of our engineering teams have been steadily embracing Kafka. Over time, our need to scale has become more important for our business. That was one of the big motivators to adopt Kafka and gradually make a shift from our existing message brokering architecture. Another reason is the level of community support and wide-ranging usage that the Kafka ecosystem in general enjoys.

In this post, I’ll take you through the work we did building a new feature for the platform using Kafka Streams and what we learned in the process. We’re going to cover an introduction to the business case for what we built using Kafka Streams, an overview of how our architecture ended up looking on the backend, and finally lessons learnt and things we could improve upon next time.

Before we dive into the technical detail, I’ll quickly explain a few key terms we use within our platform. We operate a marketplace with users on both sides of the equation. Companies who are buying or procuring services are referred to as Clients. Providers, as the name suggests, are responsible for providing services to the clients operating on our platform. These range in size from global consultancies and accounting firms right down to highly specialist and niche suppliers operating in a much more targeted industry.

Finally, projects are discrete units of work tendered by the clients on the platform to the providers. This encapsulates all of the information the client would like to share with providers in terms of their requests and requirements for a piece of work they would like to get done. We use information gathered from all three entities (clients, providers and projects), leveraging our AI technology to predict providers that might be a particularly good fit for a project on the platform and might never have been considered otherwise.

Glo Insights: The Feature

At Globality, we are building an AI-driven marketplace for clients to procure services from providers. Often, when evaluating providers for a given project, these clients must parse extensive amounts of information manually to determine if providers are a fit for the project at hand.

The aim of our feature is to surface information (e.g., case studies, services offered, past experience) indicating that a provider is uniquely relevant to the client’s project.

Glo Insights for a project on our platform

Glo presents these insights at the top of each provider’s profile, enabling clients to quickly identify providers that best match their project requirements and allowing them to make a well-informed decision when choosing a provider for their project.

The focus here was not on the mechanics of how we actually predict good insights. Those types of questions (and ultimately AI models) will evolve and improve over time. What we wanted was a robust mechanism to pre-compute these insights and display them to the user without incurring a read-time penalty.

Glo Insights: The Architecture

The way we have implemented Kafka at Globality utilises a concept known as change-data-capture (CDC). This is because our backend architecture has historically consisted of a number of PostgreSQL databases distributed across our services. Jumping straight in and attempting to migrate them to event sourcing all in one big-bang would be a bad idea.

So instead we capture the changes from our databases needed to pre-compute these highlights. In this case, there are two main sets of data we care about:

  • Data about providers (e.g., capabilities they have, case studies, locations they operate in, etc.)

  • Data about projects (e.g., the sector, what service the client is looking for, language requirements, etc.)

Ultimately these two collections of data reside in two different services. Hence, we have two downstream Kafka Streams services responsible for reshaping, aggregating, and finally joining this data. A third service acts as a query service, making the underlying KTable queryable using OpenAPI and HTTP, a pattern we were already intimately familiar with in our gateway layer.

An example of the two “Bounded Contexts” we’ve used, joined by a Kafka topic

At a high-level, this follows our long-term architectural direction of using the BoundedContext pattern to de-couple the various business functions of our platform.

Provider Data

We store data about the various providers on our platform in a format known as assertions. This is a relationship between a subject and an object and is useful as a very generalisable concept. For example, we might have a piece of data that says:

Provider ACME_INC (subject) has GEOGRAPHIC_EXPERIENCE (relationship) in LONDON (object)

Depending on the use case, this becomes increasingly inefficient. It requires us to compute a large number of joins on some ID before we can build a provider profile, which is a projection of all the assertions we have on a given provider into a single object. Think of this as our version of a social media profile, but for a company rather than an individual. Our prior implementation of this was essentially a read-time rollup of all the assertions data for a given provider, which can be slow due to the size of joins involved; a typical provider will have hundreds, or even thousands, of associated assertions. As you can imagine, this is not ideal from the perspective of a user accessing our platform.

One special case is how case studies (typically .pdfs or rich HTML produced by the provider) are handled. These objects are considered separate from the other assertions about a provider, since they themselves can have assertions about them. It’s worth noting that when streamed out of the tables initially, the case studies are not hydrated with assertions associated with them. Essentially this step involves de-normalising these so that they are stored in objects which also contain any associated assertions.

For example, we might have a case study assertion that looks something like:

Case Study FORTUNE_500_DIGITAL_TRANSFORMATION (subject) is ABOUT_INDUSTRY (relationship) called IT_CONSULTING (object)

We now precompute this (for highlights) as soon as the data changes on the provider side, which means there’s no read-time penalty on the client-side.

Project Data

Similar to the provider side, we stream the data using CDC from the backend service responsible for matching projects to providers. In this case it’s matching_plans (which contain the relevant matching data for a given project) and match_suggestions (which tell us Provider X is a potential match for Project Y).

Our original architecture was to call the endpoint responsible for computing insights within a for-loop. This is because the insights are computed for each provider individually, and we have multiple providers per project that we need to loop over (denoted by the aggregate step in the below diagram). As it turns out, this for-loop is particularly inefficient (more on that in a second)…

Aggregating before the external call

What’s wrong with this approach? Well, as in any Kafka-Streams application, an external call is going to be the most expensive part of this streams processor. Imagine a case where we have a single provider that changes on a project with 5 matched providers. This is a fairly conservative estimate; we have projects in production with 10 or more matched providers. We would unnecessarily recompute highlights for 4 out of the 5 matched providers each time a single provider changes.

Hence, we have to very inefficiently recompute highlights for all providers for a given project, even if only one has changed. This is further exacerbated by the fact that providers change a single assertion at a time, so there might also be some intermediate provider profiles that are computed if a large number of assertions are changed in a short period of time.

Calls to the external endpoint per 10 minute interval

When we initially released the feature to production, this is the traffic generated as the backend reprocessed all currently existing provider profiles, in order to compute insights for them. This is around 16,000 requests/minute, hitting a service that would typically see tens of requests per minute. Note that the Kafka cluster is able to handle this in its stride; however, the backend service that’s being called ends up getting absolutely hosed.

Three way join to create highlight_plan

To eliminate this problem (even though it was functionally correct, it had poor performance characteristics) we refactored this part of the topology. We now compute a three way join, that allows us to have all the necessary data for computing highlights prior to aggregating by project.

Aggregating after the external call

Since the refactor, we don’t need to use a for-loop before the external call, as we are computing the highlights provider-by-provider as opposed to project-by-project. Therefore, if a single provider changes, only the highlights for that single provider are recomputed rather than all providers for a given project. If a single project changes, all provider highlights for that project will obviously need to be recomputed, but that was the case in the previous architecture as well. As you can see in the diagram above, the aggregation step happens after the external call.

Glo Insights: Lessons Learnt

Bringing a hypersonic missile to a knife fight

Kafka is built for scale, BIG scale. One particularly double-edged sword is that it’s likely to be able to process far more messages than other services you might have previously built. Depending on how you use Kafka, the volume of messages produced and consumed can quickly overwhelm your infrastructure.

Storage usage for the cluster

In the above example, we had a bug that erroneously generated ~150MB of traffic per minute. This was released to our dev cluster, and in a few hours, this nearly tripled the storage usage of our application. Of course Kafka is able to achieve this amount of throughput without breaking a sweat. Had we left such a bug out there in the dev environment, our Kafka brokers would eventually have hit their storage limits, causing the cluster to fall over. That wouldn’t have been a great look.

Without robust integration and end-to-end tests and performance monitoring, it’s easy for issues like this to slip through. We did have unit tests that ensured the functional correctness of our code. However, though the final result computed by Kafka Streams was the correct result, it also generated an overwhelming number of intermediate messages. Since Kafka Streams computes intermediate states, it’s possible to have the correct result, just computed inefficiently. These intermediate states will only be cleaned up when log compaction is run (if you have that enabled for your Kafka cluster).

On the whole, we’ve found Kafka data persistence to be a good thing. You get good visibility into what is happening in your cluster, since messages don’t disappear after being consumed as they do with some other technologies; this is particularly the case using tools such as Kowl to monitor your topics. The tradeoff here is that whenever you have erroneous messages polluting your topics, in the best case you must bring in the clean-up crew for some broad ranging topic-deletion in a dev environment. In the unfortunate case that this has managed to slip out to production, you’ll probably end up writing code to ignore any of these illegitimately generated messages.

The boundary between old and new

Often the most error-prone aspects of Kafka Streams occur on the boundary between the Kafka eco-system and external services. In most cases, effort should be put towards ensuring that such boundaries are minimised, if not entirely eliminated from the get-go, using tools such as Kafka Connect as a data integration vehicle between different parts of the system. However, for legacy architectural reasons, this may not always be feasible. Your infrastructure is only as performant as its weakest link (and we already know that Kafka can be very performant). This is similar to the concept that any effort optimising outside of the bottleneck is wasted effort (see The Phoenix Project).

In our case, we were seeing traffic in the region of 16,000 requests per minute against a legacy service. This is a service that would previously have received 10-15 requests per minute. Though the service didn’t end up falling over, it did log each HTTP request. This became problematic, as the volume of requests being made quickly overran our daily allowance of our logging platform. This is a good demonstration of why you should design your architecture to eliminate calls from Kafka Streams to an external service.

We managed to resolve the issue by improving the efficiency of our queries to the external service. Instead of making the calls to our external service inside a for-loop, we modified our model of the domain object so that it no longer requires this for-loop. Reshaping data earlier on would have made our lives easier. As when working on any problem, there are pitfalls we could have avoided.

As an example, we shifted a lot of the heavy lifting out of our database models and onto the back of Kafka Streams. A more robust strategy would have been to refactor the database models to better represent the data we care about in our business domain. However, this is always a bit of a Herculean task to undertake retrospectively. Your data should represent the business objects you care about within your bounded context; don’t overcomplicate it.

Complexity: the eternal enemy

As everyone knows, complexity is bad. Using a new technology adds an indeterminate amount of additional complexity compared to using established methods. Since this is a relatively new string in our bow at Globality, there is still a high initial bar for creating a new Kafka Streams service. As we gain experience with these technologies, we are building more of a consensus on how we apply Kafka Streams to fit our needs.

The increasing number of developers at Globality using Kafka has presented its own set of challenges. Many of the best practices and guidelines we applied informally when first using Kafka internally must now be shared with a wider audience. In my experience, it’s better to enforce such rules with an outcome-based mentality rather than specifically how it’s achieved. Sets of rules used across multiple teams that heavy-handedly enforce how a specific problem should be solve tend to be brittle and inflexible to the specific needs of a team and any particular skillsets they might possess. Across Globality Engineering, there has been a real push towards autonomy as the only feasible way to relieve this particular growing pain.

As with any complex system, there is no silver-bullet solution. Kafka was never going to solve all our problems in one fell swoop. Whatever you build with, success ultimately comes down to mastery of your tools; mastery takes practice, and practice takes time.

To Infinity and Beyond!

Although we’ve faced many challenges introducing the technologies I’ve described here, we’re excited to have overcome them thus far, and we’ve learned a lot on this journey. We are miles ahead of where we were before in terms of our competency with Kafka-Streams. Previously, the service responsible for the Insights feature was quite literally DoSing other services. This is because of the sheer number of joins it was computing across multiple microservices, which could easily cause them to fall over. That hasn’t been the case with our new architecture; it has gone from a feature that was actively hitting scalability limits to one that is probably the most scalable on the platform.

Interested in joining us on our journey to solve problems such as the above? We are always on the look-out for talented engineers. You can find a list of our open positions here.

Author

  • Marshall Bradley is a Senior Software Engineer at Globality, working on team Data-Pipeline. He has spent his career so far building software in a variety of different environments, from tens of thousands strong global companies (such as Expedia) down to tiny five person startups (such as Blacksmiths Group). He's currently developing primarily in Java and Python, with a particular focus on the Kafka eco-system. After work he can sometimes be found at the local climbing gym, alongside other team members from Data-Pipeline.

Marshall Bradley

Marshall Bradley is a Senior Software Engineer at Globality, working on team Data-Pipeline. He has spent his career so far building software in a variety of different environments, from tens of thousands strong global companies (such as Expedia) down to tiny five person startups (such as Blacksmiths Group). He's currently developing primarily in Java and Python, with a particular focus on the Kafka eco-system. After work he can sometimes be found at the local climbing gym, alongside other team members from Data-Pipeline.