MARGO

News

Lamport clocks and the pattern of the Idempotent Producer (Kafka)

Discover our return on the conference algorithms distributed for Big Data season 3: a history of time

By Yannick Decourtray Software Engineer
23/05/2018

Do you know the Lamport clocksDevoxx France 2018 was the opportunity, during the very interesting talk of DuyHai DOAN , to discover or rediscover this algorithm formalized by Leslie Lamport in 1978, more than ever used today in the field of distributed systems, and which would have inspired the Kafka developers in the implementation of the pattern of Idempotent Producer .

 

Lamport clocks, what is it?

In a centralized system, there is only one process, and the dating of events is trivial. It is based on the physical clock of the single process, and the scheduling of these events is not a problem. In other words, it is naturally possible to say which event occurred before another.

In a distributed system, however, there are several interconnected processes, and each of them has its own physical clock. The dating of events is therefore relative to each process, and the scheduling of events is not trivial. In other words, it is difficult to say with certainty that an event on one process occurred before another on another process.

Lamport’s algorithm proposes a solution to establish the partial order of a sequence of events in a distributed system, based on a mechanism of logical clocks. What purpose? Well, there are, in fact, in the field of distributed computing, a number of problems solved by the logical dating of events between processes … We will see later the case of Kafka, which is inspired by Lamport’s clocks to solve a problem. relating to the method of delivery Exactly once. But before that, let’s study the basic principle.

 

Lamport clocks, how does it work?

Imagine a system composed of n processes communicating with each other. If a local event occurs at the same time on process a and on process b, everyone understands that it does not actually happen at the same time. It is not possible to define a precedence relationship between these two events, and the exact order of execution of each is not defined from a logical point of view. These events are said to be competing, and they have no causal link. In reality, always from a logical point of view, the order of execution of these events does not matter.

On the other hand, if a transmits a message to b, this results in a sending event on a which necessarily occurs before that of the reception on b. These two events are then united by a causal link, and their logical order is perfectly defined. By extension, any event prior to the sending of the aforementioned message from the process a is de facto also prior to that of the reception on b.

This is the basic postulate of Lamport.

In summary, we have three types of events:

  • Local events to processes,
  • The sending of messages,
  • Message receptions.

In order to be able to logically date each of these events on each process, and to be able to order them together, Lamport proposes to use a simple counter on each process, which he calls Clock Process. The clock of the process at a time t is always worth the date of the last event that took place on this process. And when the system starts, the clocks of all processes are synchronized to 0.

When an event occurs, it is dated as follows:

  • If it is a local event, its logical date will be Clock + 1
  • If it is a send event, its logical date will also be Clock + 1. This date will be carried in the message, in order to help the receiver to logically date its own reception event.
  • If it is a receiving event, its logical date will be [ MAX (Clock, Date carried by the event)] + 1

And that’s all! With this algorithm, the logical dates of all receive events are all greater than those of the events preceding them, whether the events in question took place on the receiving process or on the sending process. Causal relationships are respected and scheduling is possible.

…Partially…

Indeed, there is still the problem of the competing events mentioned above. Between two given processes, there are events that have the same logical date and yet did not occur at the same time. Since we want to establish a chronology of what has happened on a distributed system, what is the priority to apply to events with an identical logical date? To this thorny question, Lamport simply recalls that events with the same logical date are concurrent and have no causal link between them. And so, that the exact order of their execution does not affect the system.

Nevertheless, by convention, in order for all processes to have the same timeline, competing events will be classified using the process identifier. For example, we will always consider that event ‘2’ of the process has occurred before event ‘2’ of process b. Just because a is before b in the alphabet. Without more complications. Quite simply.

 

Application case: Kafka and Exactly once delivery

Kafka is a highly popular distributed mail broker, originally created by LinkedIn and now maintained by the Apache community, known for its performance and stability at low to very high Volumatic’s.

Exactly ounce delivery, what is it? Let’s first explain the default mode, which is the At least once delivery.

Like any self-respecting message broker, Kafka must ensure that the messages it receives are distributed to its customers. A problem that arises from here and there: even if the message is distributed by Kafka, how can I be sure that the customer has received it? A simple acquittal on his behalf in the opposite direction will do the job very well, you say. But what if a network cut occurs at this time, or if the customer suffered a crash before being able to send this famous ACK?

Well, in a very traditional way for Kafka, any unacknowledged message is considered as not received, and must be sent again. And this, at the risk that the customer receives and processes the message in reality twice (or more) … This is called At least once delivery. In other words, in this particular mode, the same message can be received several times by the client. And in this precise mode, therefore, the client must be himself “idempotent”, i.e. insensitive to n receiving the same event.

At least once delivery, is a good start …

It turns out that Kafka offers even better … an Exactly once delivery mode. That is, a message will only be distributed once to each customer. Guaranteed! Magic! This much more technical feature, considered a feat by some, is based on three concepts: an inter-partition transaction mechanism, a persisted offset system, and a pattern of idempotence. applied between Kafka brokers instances and … the producers.

Complex? … Do not panic: we will not dwell on the first two concepts because they are actually irrelevant! This is the third one that interests us here, and that we will detail in this article.

Indeed … let’s think a little. For a customer to receive a given event only once, it must be certain that Kafka only records it once in one of its partitions. Although there are other issues to consider (but which are addressed by the other two concepts mentioned above that we will not come back to in the context of this article), this one is the most basic. In Exactly Ounce mode, an event sent by a Producer, even to multiple attempts, must be absolutely recorded one, and only once, by Kafka.

But the Producers are also exposed to problems of acquittal – on the part of Kafka himself this time – (micro network cut, crash of a Kafka broker instance …). How to guarantee on the one hand that an event will be recorded once, and that the causality will be respected by the other, that is to say that the order of events will be preserved, even if one of between them is momentarily lost then replayed?

This is where the pattern of the Idempotent Producer comes in. And this is a strong reminder of Lamport’s clocks …

 

Pattern of the Idempotent Producer

To understand what will follow, you need to know a minimum of how Kafka works.

Kafka is a distributed publish / subscribe system, which manages the message families by Topic. A Topic is segmented into n partitions distributed throughout the cluster, a node (named broker) that can possibly manage several partitions. When a Producer sends a message to a given Topic, it will be routed to a particular node of the cluster, depending on its key to be stored in a single partition. Messages with the same key will always be saved on the same partition, so on the same Kafka node. Read this paragraph carefully if it is not perfectly clear.

Each node keeps a value called High Watermark for each Producer. This is the equivalent of Clock in the Lamport algorithm. In addition, each Producer maintains a separate sequence for each partition to logically date its events.

In Kafka velocity is sought. The Producers are designed to send bursts of messages without waiting for acknowledgment between each message. These will arrive along the water asynchronously. But what happens if any of these acquittals are lost? The Producer will obviously retry the sending of the message concerned, but it must not cause it to cause inconsistency in the order of the messages actually recorded on the broker side.

Thanks to the algorithm of the Idempotent producer, Kafka is able to detect these inconsistencies of sequences of reception of messages, and thus to reject all the messages which follow. Specifically, each time a Producer publishes something, the message carries its own logical identifier incremented by one in one, and the broker will only pay it if this identifier is equal to Watermark + 1.

In the absence of the exact application, one finds here the essence of the Lamport Clocks algorithm, whereas this theory was born in 1978 and at that time few systems were distributed.

 

Conclusion

Funnily, Confluent has published a 67-page documentation [1] explaining in great detail how Exactly Once has been implemented in Kafka. And if the notion of idempotence of the producers is mentioned repeatedly, the strict application of the Lamport Clocks algorithm for the implementation of this pattern does not appear once. What to deduce?

Well, maybe the developers working on Kafka did not have the Lamport Clocks in mind when they developed their solution. Most of the problems we have today in the field of software architecture have probably all been theorized years ago, at times when the infrastructure was much less complex. Even without having the systematic knowledge of these theories, common sense leads us to solutions to these problems that converge with them.

It would have been interesting for DuyHai DOAN, during his presentation at Devoxx, to point out that the pattern of Kafka’s Idempotent producer may not be, as he implied, a strict and mechanical application of the Lamport’s Clocks algorithm, but it is the result of a wise reflection that proves that this theory is correct and more relevant than ever.

 

DuyHai DOAN’s presentation, like this post, is greatly inspired by this article: https://cwiki.apache.org/confluence/display/KAFKA/Idempotent+Producer. The synthesis of the pattern that is made here is deliberately summarized, we invite you to read it if you need more technical details.

[1] https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7CP1SnQGdp2eF0wSw9ra8/edit#


By Yannick Decourtray Software Engineer
Big Data
Data
DataScience
Tribune

Successfully completing a data project: a path still strewn with pitfalls

In 2020, corporate investment in data projects is expected to exceed 203 billion dollars worldwide. But at a time when many are claiming to be Data Driven Companies, lots of data projects end in failure. Yet most of these failures are unnecessary and due to well-known causes! Focus on the recurrent pitfalls to avoid.

05/02/2019 Discover 
News

Kaggle Challenge: TalkingData AdTracking Fraud Detection

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution.

31/05/2018 Discover 
News

Data Science applied to the retail industry: 10 essential use cases

Data Science is having an increasing impact on business models in all industries, including retail. According to IBM, 62% of retailers say the use of Big Data techniques gives them a serious competitive advantage. Knowing what your customer wants and when, is today at your fingertips thanks to data science. You just need the right tools and the right processes. We present in this article 10 essential applications of data science in the field of retail.

31/05/2018 Discover 
News

Introduction to TensorFlow on the datalab of Google Cloud Platform

TensorFlow is a software library, open source since 2015, of numerical computation developed by Google. The particularity of TensorFlow is its use of data flow graphs.

30/05/2018 Discover 
News

Establishment of a centralised log management platform with the Elastic suite

The volume of data generated by our systems and applications continues to grow, resulting in the proliferation of data centers and data storage systems.  In the face of this data explosion and the investment in skills and resources, decision-makers need sophisticated analysis and sophisticated dashboards to help them manage their systems and customers.

14/05/2018 Discover 
News

Introduction to Reactive Systems

Margo Consultants participated in  Devoxx France 2018 , the conference for Passionate Developers, organized from April 18 to 20, 2018 in Paris. Discover a synthesis on reactive systems illustrated by a concrete use case.

11/05/2018 Discover