美文网首页Zookeeper
ZK currentEpoch&acceptedEpoch

ZK currentEpoch&acceptedEpoch

作者: Alen_ab56 | 来源:发表于2021-11-09 15:23 被阅读0次

在做多机房kafka切ZK演练时发现,当原集群的zk节点加入新集群时,出现报错
Leaders epoch, 6 is less than accepted epoch, 9

查看/data/zookeeper/data/version-2目录下确实有2个文件,分别是
acceptedEpoch、currentEpoch,这2个文件里的值都是9

这是为什么呢?这两个文件是做什么的?
这两个文件分别反映了指定的server进程已经看到的和参与的epoch number。尽管这些文件不包含任何应用级别的数据,但他们对于数据一致性来说很重要,决定了集群的选主能否成功.

https://issues.apache.org/jira/browse/ZOOKEEPER-335?focusedCommentId=16975961&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16975961

这两个变量主要是为了解决集群失败恢复的场景
As mentioned, the implementation up to version 3.3.3 has not included epoch variables acceptedEpoch and currentEpoch. This omission has generated problems [5]
(issue ZOOKEEPER-335 in Apache’s issue tracking system) in a production version
and was noticed by many ZooKeeper clients. The origin of this problem is at the beginning of Recovery Phase (Algorithm 4 line 2), when the leader increments its epoch
(contained in lastZxid) even before acquiring a quorum of successfully connected followers (such leader is called false leader ). Since a follower goes back to FLE if its
epoch is larger than the leader’s epoch (line 25), when a false leader drops leadership
and becomes a follower of a leader from a previous epoch, it finds a smaller epoch (line

  1. and goes back to FLE. This behavior can loop, switching from Recovery Phase to
    FLE.
    摘自:http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

简单来说就是: 以前是不区分acceptedEpoch 和 currentEpoch的,以前epoch是直接从zxid中前32位里提取的。但这会导致一个问题:假设有三个服务器s1, s2, s3. 集群s1和s2取得联系,且s1为leader,s3为LOOKING:
s2重启,加上s3的选票,将s3选为leader
s3把自己当做leader,且epoch+1,但无法与其它server取得联系。此时s1还是认为自己是leader(后文会问为什么)。
s2无法与s3取得联系,同时收到s1的LEADING信息,便回到s1的旧集群里
s3无法与他人取得联系,退出leadership,回到FLE,并收到旧集群leader s1的消息,便作为follower也回到旧集群里
s3作为follower发现自己的epoch比旧leader的epoch还大,便又回到FLE
之后s3就不断在4和5之间徘徊,不断在FLE阶段和RECOVER阶段循环。

至于为什么s1自认为自己是leader, 是因为leader有一个缓存时间导致leader不会因为某些瞬时故障而结束自己的任期.
这个缓存时间的原理是:心跳包
在心跳包以内leader1检测不到leader2和leader3的learnHandler线程死亡,因而leader状态保持有效,仅仅是状态表示标识,不会影响写操作,因为写操作会要求半数以上节点响应,而这个时间端这个要求是不满足的.

那么acceptedEpoch和currentEpoch是怎么解决故障恢复问题的呢?
if (newEpoch > self.getAcceptedEpoch()) {
wrappedEpochBytes.putInt((int) self.getCurrentEpoch());
self.setAcceptedEpoch(newEpoch);
} else if (newEpoch == self.getAcceptedEpoch()) {
// since we have already acked an epoch equal to the leaders, we cannot ack
// again, but we still need to send our lastZxid to the leader so that we can
// sync with it if it does assume leadership of the epoch.
// the -1 indicates that this reply should not count as an ack for the new epoch
wrappedEpochBytes.putInt(-1);
} else {
throw new IOException("Leaders epoch, "
+ newEpoch
+ " is less than accepted epoch, "
+ self.getAcceptedEpoch());
直接报错,强制不允许大于leader的epoch的节点加入集群

相关文章

网友评论

    本文标题:ZK currentEpoch&acceptedEpoch

    本文链接:https://www.haomeiwen.com/subject/masmzltx.html