Raft协议主要分为三个模块:Leader election
、Log replication
和Safety
。
Raft将服务器节点分为Leader
、Candidate
和Follower
三种,协调者被称为领袖/主(Leader),参与者被称为群众(Follower)。相对于其他的协议,Raft中的Leader更强,这体现在:
- Leader是唯一的。
- Log entries只能从Leader发送给其他服务器,事实上Follower不主动发送,而只响应来自Leader和Candidate的请求。
- 客户端只能和Leader交互,如果客户端首先连上了Follower,那么会被Follower转发给Leader。
- Raft的独特之处还在于其在Leader election的过程中Raft使用了
随机计时器
进行超时。此外,Raft还提供了一个joint consensus
的算法处理Membership changes的问题。
raft中的Progress
代表leader看到的followers的进度信息。有三种状态类型用来跟踪follower。
// file: raft/tracker/state.go
// StateType is the state of a tracked follower.
type StateType uint64
const (
// StateProbe indicates a follower whose last index isn't known. Such a
// follower is "probed" (i.e. an append sent periodically) to narrow down
// its last index. In the ideal (and common) case, only one round of probing
// is necessary as the follower will react with a hint. Followers that are
// probed over extended periods of time are often offline.
StateProbe StateType = iota
// StateReplicate is the state steady in which a follower eagerly receives
// log entries to append to its log.
StateReplicate
// StateSnapshot indicates a follower that needs log entries not available
// from the leader's Raft log. Such a follower needs a full snapshot to
// return to StateReplicate.
StateSnapshot
)
Learners
和Voters
不会有交集。
Joint consensus
joint config
term
is a logic clock in the raft
quorum
Raft协议中每个节点都会记录本地Log,etcd使用raftLog表示本地Log
// file: raft/log.go
type raftLog struct {
// storage contains all stable entries since the last snapshot.
storage Storage
// unstable contains all unstable entries and snapshot.
// they will be saved into storage.
unstable unstable
// committed is the highest log position that is known to be in
// stable storage on a quorum of nodes.
committed uint64
// applied is the highest log position that the application has
// been instructed to apply to its state machine.
// Invariant: applied <= committed
applied uint64
logger Logger
// maxNextEntsSize is the maximum number aggregate byte size of the messages
// returned from calls to nextEnts.
maxNextEntsSize uint64
}
applied <= committed
Deep Dive: etcd
Consensus and Quorum
Replicated state machine
Leader election
- Candidate, Follower, Leader
- Term
- Election
- Hearbeat
Log replication
- Only leader manages the replicated logs.
- Leader only append to log.
- Leader keeps trying to replicate its logs to followers.
- Committed index
- Applied index(always smaller than committed index)
Raft in etcd
Raft implementation
-
Minimalistic design for flexibility, deterministic and performance
- Raft package does not implement network transport between peers.
- Raft package does not implement storage to persist log and state.
-
Raft is modeled as a state machine
- State
- Input, output
- Transition between states
Server's handling loop
for {
select {
...
case rd := <- r.Ready():
r.storage.Save(rd.HardState, rd.Entries, rd.Snapshot)
r.transport.Send(rd.Messages)
s.Apply(rd.CommittedEntries)
....
}
}
Request lifecycle
- Send proposal to Raft
r.Propose(ctx, data)
- If successfully committed, data will appear in
rd.CommittedEntries
- Apply committed entries to MVCC
- Return apply result to client
Add/Remove a node
当 Leader 收到 Configuration Change
的消息之后,它就将新的配置(后面叫 C-new,旧的叫 C-old) 作为一个特殊的 Raft Entry 发送到其他的 Follower 上面,任何节点只要收到了这个 Entry,就开始直接使用 C-new。当 C-new 这个 Log 被 committed,那么这次 Configuration Change 就结束了。当在 TiKV 以及 etcd 里面,并没有使用这种方式,只有当 C-new 这个 Log 被 committed 以及被 applied 之后,节点才知道最新的 Configuration 的情况。这样做的方式是比较简单,但需要注意几点:
- 当 Log 里面有一个 Configuration Change 还没有被 committed,不允许接受新的 Configuration Change 请求,主要是为了防止出现多 Leader 情况。
- 如果只有两个节点,需要移除一个节点,如果 Leader 在发起命令之后,另一个节点挂了,这时候系统没法恢复了。
WAL
为了保证数据的安全性(crash或者宕机下的恢复),都会使用WAL,etcd也不例外。etcd中的每一个事务操作(即写操作),都会预先写到事务文件中。
Snapshot
etcd作为一个高可用的KV存储系统,不可能只依靠log replay
来实现数据恢复。因此,etcd还提供了snapshot
(快照)功能。snapshot即是定期把整个数据库保存成一个单独的快照文件,这样一来,不但缩短了日志重放的时间,也减轻了WAL的存储量,过早的WAL可以删除掉。
假设 3 个节点,然后新加入了一个节点,如果 Leader 在给新的 Follower 发送 Snapshot 的时候,另一个 Follower 当掉了,这时候整个系统是没法工作了,只有等 Follower 完全收完 Snapshot 之后才能恢复。为了解决这个问题,我们可以引入 Learner
的状态,也就是新加入的 Learner 节点是不能算 Quorum 的,它不能投票。只有 Leader 确认这个 Learner 接受完了 Snapshot,能正常同步 Raft Log 了,才会考虑将其变成正常的可以 Vote 的节点。
网友评论