20 Mar 2022

Apache Kafka® key concepts

A glossary of terms related to Apache Kafka®

crabby

Crabby

|

RSS Feed

Written by the Aiven team

Apache Kafka® key concepts illustration

An individual Apache Kafka server. This could be stand-alone, but is usually one of a larger cluster of nodes. The broker is the server that is running Apache Kafka itself, as opposed to any of the surrounding tools (e.g. Apache Kafka Connect).

An application which is reading data from Apache Kafka, and often acting upon it in some way. Even the various tooling that is often used with Apache Kafka is ultimately a simple producer or consumer when it’s actually communicating with Apache Kafka.

Much like how many Apache Kafka brokers are often clustered together to make a scalable platform, consumers will also usually need to scale beyond a single instance of an application. When multiple copies of an application are running, they somehow need to coordinate which instance is handling which messages - coordination is generally thought of as a hard problem to solve in software. Thankfully, Apache Kafka has a built-in concept of groups of consumers, which can be used to allocate different copies of an application to different partitions and ensure that workloads are spread evenly.

Application architecture built around

(record, message) A single discrete chunk of data, the smallest unit which can be produced onto or consumed from Apache Kafka. A message will always have a “value” (i.e. the body of the message, to use an email analogy) but will often also have a “key” (which should be a quick way to identify something which the message relates to) and “headers” (metadata about the message, similar to email headers, which could be e.g. “From” and “To” and “Date”)

see Broker

See Broker

see Event

When dealing with huge amounts of data, often a single server is not enough. This could either be because of total quantity (sometimes disks just aren’t big enough), or because there are so many applications that a single server would get bogged-down with handling all of them. Partitioning is how Apache Kafka splits-up a given topic across multiple servers, whilst always treating a given server as the “leader” of a partition (which is where producers send data for that partition). Partitions is fundamental to how data is “sharded” across multiple servers, and has some important upshots replication happens per partition, and preserving the order of messages is only guaranteed within a partition. When choosing which partition to publish to, an application will almost always use the message’s key (this is usually hidden inside the library), so care must be taken to make the key represent something which you want to keep in the same order (e.g. something like “customer ID” is probably a good choice, so all messages relevant to that customer will be kept in order). Changing the number of partitions may have consequences, as there are consequences both of increasing and decreasing the partition count, so care should be taken when creating a new topic to consider future needs.

An application which is writing data into Apache Kafka, and doesn’t care who or what is reading the data. The data could be well-structured, or simple strings of text, and will often have additional metadata with it to help describe the data itself.

A messaging architecture, where some applications are publishing messages which are then copied to any other listening applications. This is similar to a radio broadcast, where anyone who is tuned to the correct frequency will hear the audio being sent from a central point. Contrast this with a point-to-point architecture, where the producer and consumer would be directly coupled together and aware of each other. Often the subscribers would need to be listening at the exact moment that a message is published, although not with Apache Kafka, which is near-realtime.

A messaging architecture where messages are sent by a producer in a given order, and then at some point they are received by a consumer in the same order that they are sent. In some message brokers, once a message is consumed it is then removed and can never be consumed again, but this is not the case with Apache Kafka - instead, a watermark is kept for each consumer representing the most recent message being read. If the consumer gets interrupted or restarted then it simply picks up from where it left off. This “state” (i.e. “queue position”) can be kept within Apache Kafka itself.

see Event

When it comes to data, maximum risk happens when there’s only one copy of it. Built-in to Apache Kafka is the ability to replicate data across multiple servers, and keep track of which servers have which data, so that even if a server fails then the data is still preserved. This is configured per topic. Additional controls decide how many copies are in sync when producing messages.

Messages on Apache Kafka are organised into logical channels, or topics. An application will decide which topic it is producing onto or consuming from, and normally have expectations about the message depending on which topic is used. An example could be “sensor-readings” or “kubernetes-logs”. Topics will often be named something human-readable.

Related blogs

All things open source, plus our product updates and news in a monthly newsletter.

Subscribe to the Aiven newsletter

Loading...