raft study

作者: 帆子_8c3a | 来源:发表于2019-09-30 20:31 被阅读0次

raft study
深入剖析共识性算法 Raft
Raft算法可视化
raft算法笔记
fabric1.4.1新特性 — raft排序服务
Raft论文学习
2018-05-24RAFT分布式动画原理实现
分布式系统 - Raft一致性算法
RocketMQ主从切换
etcd raft 代码阅读

Paper

名词

replicated state machine
拜占庭问题
apply a log, commit a lo: commted的log最终会反应到每个server

和其他一致性算法的区别

Strong Leader
Leader election : 引用随机的timer,避免竞选冲突
Membership changes : cluster中添加、删除时候，使用new joint consensus来过渡

其他replicated state machine例子

Chubby
ZooKeeper

一致性算法需要满足的特点

Safty(永远不返回不一致结果): 在非拜占庭条件下，包括network delay,partitions等等
Available : 当多数server可用时，即可用

raft从3个角度分解一致性问题

Leader election
Log replication
Safty : 每个server run log以后一定保证状态机一致

三个状态

Leader
Follower
Candidate

image.png

Term

image.png

Term起到了logical clock的作用

Timeout

Election timout: 150~300ms, timeout后，从Follower=>Candidate
Heartbeat timeout

RPC

RequestVote
AppendEntries
InstallSnapshot

Leader election

Leader周期的发送不带log的AppendEntries RPC
Follower如果在Heartbeat timeout后没收到心跳，follower转换成candidate
1. term自增
2. 为自己投一票
3. 发送RequestVoteRPC给其他server
Follower遇到下列状况变换状态
1. 赢得竞选
2. 其他人赢得竞选
3. 一定时间内没有人赢得竞选

Follower赢得选票

Majority rule保证在某一个term中，最多有一个candidate赢得竞选

其他人赢得竞选

Follower接到>=的term的AppendEntries，立刻从candidate=>Follower
Follower接到<的term的AppendEntries，立刻reject请求

没有人赢得竞选

Election timout后，会开启新的一轮term,并发起竞选
Election timout对于每个server是随机值，避免冲突（150~300ms）

Raft guarantees

image.png

Log内容

状态机的更新值
Term号(用来检测是否一致)
每条Log以log index标识

image.png

Log replication

Client发到Leader端，leader先自己写log
Leader发送带log的AppendEntries
the leader applies the entry to its state machine and returns the result of that execution to the client

Leader决定何时apply log entry

这个log entry的状态被称作committed。
committed过的log entry，一定会反应到所有状态机上。
Leader创建的log entry一旦有majority的数量的复制，就可以commit了。
上图中log1~7都已经commmited了
leader记录着下一条要被commit的log entry,在以后的AppendEntries会带上这个值。
follower会根据这个值，来apply相应的log entry
commit会保证前面log entry也commit，如果不满足，先commit前面的log entry

Leader端保存的变量

自己下一个要commit的log的位置
其他Follower要复制的nextIndex

Log Matching Property

有相同的index和term的两个log entry,存储相同命令
有相同的index和term的两个log entry,这之前内容一定一样

image.png

AppendEntries Consistency Check

AppendEntries发送log的同时，会带着上一条log的index和term，如果follower不能找到index和term相同的log,则reject。reject后，Lead会试图发一个带着前一个log的AppendEntries，如果再失败，就再找更前一个log

image.png
Leader向follower发送AppendEntries返回成功，一定说明这个append操作log entry(包括之前的)一定严格和Leader的log一致。

如何解决log不一致

当leader没有复制完所有log entry时候，其他follower竞选成新的leader后，会产生log 不一致问题
Leader端为每个follower维护一个position,nextIndex。初始时候，每个follower的这个值设置为Leader的最新log entry position。每次发送AppendEntries的时候，用这个position的log entry。如果被reject了，则减一再试。
AppendEntries Consistency Check会保证AppendEntries的log entry(包括之前)一定和leader一致
Leader一定不会删除、修改自己的log，只会append log

Leader Completeness Property

Log一旦commit, 以后的Leader一定有这条log
有imcomplted log的Server不能当选Leader

Election restriction

Candidate发送RquestVote给其他server时候，会发送其最近的log，如果这个log比server的还晚，server会拒绝掉这个RPC
RquestVote会携带lastLogIndex，
先判断lastLogTerm，看谁的term最新。如果都一样，再继续看lastLogIndex，看谁的最长。
image.png

如何commit

对于log entry是自己任期内的term，达到majority就可以commit
对于log entry不是自己任期内的term，不去判断是否达到majority
当前任期的term的log entry被commit，一定保证以前任期的log entry被commit
每个server都会有 commitIndex,如果AppendEntries中的 leaderCommit > commitIndex，则更新commitIndex， set commitIndex =min(leaders commit, index of last new entry)
每个server都会有 lastApplied, Server会不停的判断commitIndex > lastApplied，一旦为真就去apply log