basic

一致性是分布式系统容错的基本问题。一致性涉及多个服务器状态（Values）达成一致。

首先需要明确的是一致性算法的目标是什么，主要面对的问题是在只使用单个服务器时由于发生错误导致数据丢失等事情发生。解决这个问题的思路也很简单，就是备份，将操作重复到多个机器上就不怕单个机器出错了。但随之而来的就是，数据不一致、乱序等问题，一致性算法想要做到的是即使有结点出错，对外仍是一个完整的可以正常工作的整体。

paxos很难理解，为了达到易于理解的目标，raft做了很多努力，其中最主要是两件事情：

问题分解
状态简化

问题分解是将"复制集中节点一致性"这个复杂的问题划分为数个可以被独立解释、理解、解决的子问题。在raft，子问题包括，leader election， log replication，safety，membership changes。而状态简化更好理解，就是对算法做出一些限制，减少需要考虑的状态数，使得算法更加清晰，更少的不确定性（比如，保证新选举出来的leader会包含所有commited log entry）。

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

上面的引文对raft协议的工作原理进行了高度的概括：raft会先选举出leader，leader完全负责replicated log的管理。leader负责接受所有客户端更新请求，然后复制到follower节点，并在“安全”的时候执行这些请求。如果leader故障，followes会重新选举出新的leader。

这就涉及到raft最新的两个子问题： leader election和log replication。

leader election

raft协议中，一个节点任一时刻处于以下三个状态之一：

leader
follower
candidate

可以看出所有节点启动时都是follower状态；在一段时间内如果没有收到来自leader的心跳，从follower切换到candidate，发起选举；如果收到majority的造成票（含自己的一票）则切换到leader状态；如果发现其他节点比自己更新，则主动切换到follower。

总之，系统中最多只有一个leader，如果在一段时间里发现没有leader，则大家通过选举-投票选出leader。leader会不停的给follower发心跳消息，表明自己的存活状态。如果leader故障，那么follower会转换成candidate，重新选出leader。

选举中两个超时

election timeout

The election timeout is the amount of time a follower waits until becoming a candidate.

The election timeout is randomized to be between 150ms and 300ms.

这个选举超时后，follower成为candidate，开始一个选举任期。先给自己投一票，向其他node请求投票。其他node收到后，如果还没有vote，则vote一次。candidate收到大多数投票后，成为leader。

log replication

当有了leader，系统应该进入对外工作期了。客户端的一切请求来发送到leader，leader来调度这些并发请求的顺序，并且保证leader与followers状态的一致性。raft中的做法是，将这些请求以及执行顺序告知followers。leader和followers以相同的顺序来执行这些请求，保证状态一致。

Replicated state machines

共识算法的实现一般是基于复制状态机（Replicated state machines），何为复制状态机：

f two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.

简单来说：相同的初识状态 + 相同的输入 = 相同的结束状态。引文中有一个很重要的词deterministic，就是说不同节点要以相同且确定性的函数来处理输入，而不要引入一下不确定的值，比如本地时间等。如何保证所有节点 get the same inputs in the same order，使用replicated log是一个很不错的主意，log具有持久化、保序的特点，是大多数分布式系统的基石。

因此，可以这么说，在raft中，leader将客户端请求（command）封装到一个个log entry，将这些log entries复制（replicate）到所有follower节点，然后大家按相同顺序应用（apply）log entry中的command，则状态肯定是一致的。

请求完整流程

当系统（leader）收到一个来自客户端的写请求，到返回给客户端，整个过程从leader的视角来看会经历以下步骤：