Zookeeper leader election


1.Zookeeper cluster

Check the cluster diagram:

1.jpg

As the diagram shown , there is only one leader in zookeeper cluster. The leader node is the core of cluster . 

Generally , zookeeper provide three methods to elect a leader:

  • LeaderElection  

  • AuthFastLeaderElection

  • FastLeaderElection

The default method is FastLeaderElection.so we will introduce how do “FastLeaderElection” work.

2.Election process

For example,there are 5 servers, no transaction data in these servers. Their numbers are 1, 2, 3, 4, and 5, and they are started in sequence according to their numbers. The selection process is as follows:

  • Server 1 starts, votes for itself, and then sends voting information. Since other machines have not started yet, it cannot receive feedback information. The status of server 1 is LOOKING (election status).

  • Server 2 starts, vote for itself, and exchange results with the server 1. Server 2 wins because the number of server 2 is larger, but at this time the number of votes is not more than half, so the status of the two servers is still LOOKING.

  • Server 3 starts, vote for itself, and exchange information with the started server 1 and 2. Because server 3 has the largest number, server 3 wins. At this time, the number of votes is more than half, so server 3 becomes the leader, server 1, 2 become follower

  • Server 4 starts, vote for yourself, and exchange information with the previously started servers 1, 2, and 3. Although the number of server 4 is large, server 3 is leader now, so server 4 can only become a follower.

  • Server 5 starts, do the same as server 4.

3.Key information in election

  • Serverid:

For example, there are three servers, the numbers are 1, 2, and 3.

The greater the number, the greater the weight in the selection algorithm.

  • Zxid:

The largest data ID stored in the server.

The larger the data, the greater the weight in the selection algorithm.

  • Epoch:

The logical clock value in one round of voting is the same. This data will increase in each voting round, and then compared with the value in the received voting information that returned by other servers.

  • Server status:

LOOKING, the cluster is looking for a leader.

FOLLOWING, this node is a follower , it can voting for next election.

OBSERVING, synchronize leader status, and nothing to do with voting.

LEADING, this node is leader.

After the voting is completed, the all above voting information needs to be sent to all servers in the cluster.

4.Election in detail

When a node have more than half of voting in this cluster , it will be a leader

1.First, start the election phase, each server reads its own zxid.

2.Send voting information

    a. First, each server will vote for itself in the first round.

    b. Voting information includes: Serverid, Zxid, Epoch of the elected leader. Epoch will increase as the number of election rounds increases.

3. Receive voting information

If server B receives data from server A (LOOKING)

1)First, determine the logical clock value:

If the sent logic clock Epoch is greater than the current logic clock. First, update the logic clock Epoch and clear the election data from other servers. Then, determine whether you need to update your current leader Serverid. Judgment rules: 

The maximum value of zxid and the leader Serverid are used for judgment. First look at the data zxid, the larger data zxid wins; secondly, judge the leader Serverid, the leader Serverid larger wins; then broadcast the latest election results (that is, the three data mentioned above (leader Serverid, Zxid, Epoch)) To other servers)

If the sent logic clock Epoch is less than the current logic clock. It means that the other server is in a relatively early Epoch, here only three types of data (leader Serverid, Zxid, and Epoch) of the machine need to be sent.

If the logic clock Epoch is equal to the current logic clock. Then according to the above judgment rules to elect the leader, and then broadcast its own latest election results (that is, the three data mentioned above (leader Serverid, Zxid, Epoch) to other servers).

Secondly, judge whether the server has collected the election status of all servers: if so, set your own role (FOLLOWING or LEADER) according to the election result, and just exit the election process.

Finally, if you have not received the election status of all servers: You can also judge whether the leader is supported by more than half of the servers, if it is, then try to receive the data within 200ms, If no updates, it means that everyone has acquiesced to this result, and also set the its role and exit the election process.

If the received server A is in another state (FOLLOWING or LEADING).

a) The logic clock Epoch is equal to the current logic clock, and the data is saved to RECVSET. At this time, the server is already in the LEADING state. If the receiving server claims to be the leader at this time, it will be judged whether more than half of the servers vote for it, and if so, the election status will be set to exit the election process.

b) Otherwise, this is a message that does not match the current logical clock, then it means that there has been an election result in another election process, so add the election result to the OUTOFELECTION set, and then judge whether the election can be ended according to OUTOFELECTION , If possible, save the logical clock, set the election status, and exit the election process.