Apache Kafka: an Essential Overview

Written by Clifford

To use an old term to describe something relatively new, Apache Kafka is messaging middleware.

In performance tests it has been shown to be able to do two million writes per second.

In addition to being software, Kafka is also a protocol, like TCP.

In other words, it operates at the transport layer of the OSI model.

it’s possible for you to think of it as an extremely high-capacity syslog.

Kafka was written by LinkedIn and is now an open source Apache product.

They wrote it in Scala.

LinkedIn and other companies use Kafka to read data feeds and organize them intotopics.

It is widely used today, because it scales almost without limit and is highly fault-tolerant.

As shown below, it is often used with other open source tools that are likewise very popular.

It sends its output on a socket just like syslog, using its own TCP protocol.

So it would easily work as a syslog replacement.

Then you might figure out how it would fit with your other platform tools.

you’re able to also output Kafka data to Hadoop or a data warehouse for analysis and reporting.

Lots of companies, like Pinterest, use Kafka for processingstreams.

Later in this post we give examples of products that work with Kafka.

That is its real value: connecting it to something else.

This simple example uses stdin and stdout to illustrate.

you’re free to also use tail to write data to it using Tail2Kafka.

So you could, for example, use that to write Apache webserver logs to Kafka.

The full list of clients ishere.

Products used with Kafka

Below are some other open source products that have been integrated with Kafka.

Storm

If you use Apache Storm, you usually use Kafka as a component of that.

Storm is for processing streams.

Everyone from Yahoo to Twitter uses it.

Storm says it …does for real-time processing what Hadoop did for batch processing.

Storm produces its output in a graph.

Graphs are used to model many kinds of data, like the Friends relationships between Facebook users.

The Storm architecture is a kind of graph too, as shown in the illustration below.

Boltsaggregate data and run join operations on that.Spoutsare the source.

If you turn Storm off, you lose the data.

That is because it is streaming data.

Usually people do store streaming data, since its value is as a live feed used to run analytics.

Here is a graphic from Hortonworks showing how Kafka can sit in front of Storm and Hadoop.

In this case, Storm writes its data to Hadoop.

Camus/Hadoop

Camusis another open source product from LinkedIn.

It is used to dump Kafka data to Hadoop.

Logstash

Logstashis a tool for processing log events.

But its more than thatit also facilitatesElasticsearch.

Elasticsearch is the company behind three open source products: Elasticsearch, Logstash, and Kibana.

Elasticsearch is called a real-time distributed search and analytics engine.

It does full-text searches, aggregation, and analytics.

Wikipedia, for example, uses it for searches.

Elasticsearch is one of the consumers as well.

So is Apache Spark.

Sparkis one of the consumers in the architecture above.

The main reason for this is that Spark does all of it in memory, thus running much faster.

So there you have a basic overview of Kafka.

Then it’s possible for you to see what it does and convince your management to use it elsewhere.

Products used with Kafka#

Storm#

Camus/Hadoop#

Logstash#

Products used with Kafka

Storm

Camus/Hadoop

Logstash