美文网首页
Paxos 没有歧义的解释(尽量)

Paxos 没有歧义的解释(尽量)

作者: affe | 来源:发表于2020-03-20 09:24 被阅读0次

本文质量: 4 / 10(自己还没读过)
本文修改次数: 0
阅读时长: 30 min

前言

笔者之前为了学Java去看Cassandra源码,偶然之间看到了Paxos这样一个算法。出于好奇前前后后看了4,5遍,仍然感觉不得要领。最近又翻出来看感觉似乎有点懂了,于是打算写一篇博客讲讲Paxos。中文社区里的Paxos讲解几乎都不准确或者有瑕疵,我觉得最根本的原因是没有严格定义所用到的术语,我会严格按照我所定义的术语来讲解,并配上几个例子,希望最后能让我室友看懂Paxos到底是干嘛的。

不过如果只是为了学习共识算法,没有必要太纠结于Paxos。一是就算搞懂了Simple Paxos,也很难在工程上实现可行的Multi-Paxos,二是有Raft存在的世界里,我想不到Paxos优于Raft的理由。我强烈推荐读Raft的paper,超级容易懂,并且在工程上已经有很好的实现了(比如TiKV就在用的etcdraft,RocketMQ的Nameserver似乎也是用的Raft)

另外为了练习一下English Technical Writing(以及要交的一个作业),接下来内容会先用英语写,之后有空了会把中文版的放上来。刚开始联系英文写作肯定会有很多语法行文上不够地道的地方,还望读者谅解。
(推荐一下Google Technical Writing Course我觉得无论对中文还是英文技术文档写作都有很大的帮助,而且是免费的)

Before we begin

The audience is required to

  • Have Computer Science Background
  • Know what problem Paxos algorithm is trying to solve
  • Already read some related materials about Paxos but are still confusing

If you are new to Paxos, I would recommend covering the following materials first

  • Paxos Made Simple: an explanation by the Paxos algorithm author. This paper includes some proof, which I recommend skip for the first time
  • Wikipedia about Paxos : this material provides some examples

If you still don't understand or needs to verify your understanding, this article might help you.

What you will possibly get (not guaranteed)

  • Another explanation of Paxos, which might help you understand Paxos better
  • How a simple Paxos algorithm works with examples

Motivation and learning plan

Tons of blogs are trying to explain Paxos algorithm. But we all know it's difficult to understand or explain it. After reading 10+ blogs, I found most blogs (even the Paxos Made Simple itself) do not use terms consistently, or use ambiguous references (like this xxx, that node xxx). I believe by using terms consistently and eliminate ambiguity can make Paxos more understandable. And even if I'm wrong, people who understand Paxos can easily point out why or where I am wrong.

I will first describe the simple Paxos algorithm (so multi-Paxos not included in this article). This process is hard to understand so it's totally fine if you don't know "why he design it like this?". It's unnatural according to many people.

Then I will walk through some examples by applying the algorithm. You might need to go back to the paper or my description a couple of times before you understand it. Once you feel confident, you can try to the proof why Paxos is working mathematically(in the Paxos Made Simple paper). This is my recommended process that worked for me. Gl, hf!

Description of Paxos

A simple scenario problem

Let's assume in a cluster running Paxos algorithm, each node store a copy of a variable named tax. Initially tax was assigned a value 5%. Now we have a couple of nodes, some of the nodes want to set tax to a different value on all nodes. If there is only one node that wants to do so, things would be easy: the node simply sends the new value, let's say 10%, to all other nodes in the cluster, and other nodes will set the copy of tax on themselves to 10%. But if 2 node simultaneously wants to set tax to 2 different value on all nodes, let's say "10%" and "12%" respectively, problems arise.

Before we discuss what is the problem here, we first need to name the nodes that want to set tax value to a new value on all nodes and the nodes that accept the new value only.

  • We name node that wants to set tax value to a new value on all nodes proposer(s). Because a proposer node is "proposing" new value for tax, and by using the verb "proposing" we know this proposal could be accepted or rejected. We use p1, p2, ... to refer to proposer 1, proposer 2, ....

  • We name the other nodes that don't want to set tax to a new value acceptor(s). We use a1, a2, a3, ... to refer to acceptor 1, acceptor 2, acceptor 3, ...

In real life, a node can be both a proposer and an acceptor, but for simplicity, we assume a node can only be either a proposer or an acceptor. This trivial change won't affect the Paxos algorithm.

Let's go back to the problem we didn't discuss earlier. If 2 proposer wants to set tax to 2 different value on 5 acceptors , let's say 10% (from p1)and 12%(from p2). If we don't do anything, due to network uncertainty, the process would be like:

  • a1, a2, a3 receives 10% first, they set tax to 10%. a4, a5 receives 12% first, they set tax to 12%
  • then a1, a2, a3 receives 12%, they set tax to 12%.a4, a5 receives 10%, they set tax to 10%.
  • in the end, a1,a2,a3 have tax value set to 12% and a4, a5 have tax value set to 10%, this leads to inconsistency.
  • what's worse, even another proposer p3 sets the tax value to 15% eventually, it's still not acceptable because the tax history on a1,a2,a3 and a4,a5 are different. (If you don't understand why this is not acceptable, you can send me a private message to me to share your thoughts)

Here we introduce Paxos to solve the problem.

Paxos specification

Start a round of Paxos algorithm triggers 2 phases (and will result in one value be chosen, or no value are chosen in case of failure). The first phase is called prepare phase, the second phase is called accept phase. If a proposer wants to set tax to a new value, proposer creates a proposal. A proposal contains the value to be set, like 10%, and a global unique timestamp (we'll use ts for short).

Proposer: Prepare Phase

proposer :
rule 1. a proposer sends a prepare request message to a majority of the acceptors ( majority means more than half of total acceptors) bringing the proposal. proposal's ts is N, and value is v. Then proposer expects to receive prepare response messages from those acceptors
rule 2. if the proposer didn't receive prepare response messages from a majority of acceptors, the proposer abort its proposal.
rule 3. if the proposer receives prepare response messages from a majority of acceptors, it can proceed to accept phase. But before proceeding, the proposer will check every prepare response messages it received. A prepare response message contains multiple fields. Specifications on these fields will are in later sections, for now just skip it.
- a) if lastAcceptedProposalTimestamp field is empty in all prepare response message, proposer keeps the value of its proposal intact. Then proposer proceeds to accept phase
- b) if at least one of the prepare response message have a non-empty lastAcceptedProposalTimestamp field, proposer will find the prepare response message with the largest non-empty lastAcceptedProposalTimestamp field. This prepare response message contains a field lastAcceptedProposalValue, let's assums it's u. Then proposer will set the value of its proposal to from v to u. Finally proposer proceeds to accept phase

Proposer: Accept Phase

proposer

rule 4. a proposer sends an accept request to a majority of the acceptors (Notice this majority acceptors doesn't have to be the acceptors as in prepare phase step 1 ) bringing the proposal (noticed the value of this proposal might be modified as described in prepare phase step 3.b)
rule 5. if proposer receives more than half of accept response, then proposer wins. proposer will inform all acceptors that it wins and send the value to acceptors. So far we have one chosen value for all nodes in the cluster, we will use examples to illustrate why only one value can be chosen at this point.
rule 6. if proposer didn't receive more than half of accept response, it knows someone else probably won this round or no one wins, so it will wait until next round (with a new timestamp)

Acceptors

acceptor can receive prepare request and accept request. Each acceptor will keep some fields. These fields are set to empty initially.

  • lastPreparedProposalTimestamp :
  • lastacceptedProposalTimestamp :
  • lastAcceptedProposalValue

On receiving prepare request:
rule 1. acceptorchecks thetimestampincluded in theprepare request, let's sayN`, then

  • acceptor will compare N with lastPreparedProposalTimestamp, if N > lastPreparedProposalTimestamp, then acceptor will response, but it needs to check
  • a) if lastAcceptedProposalTimestamp is empty (haven't accept any), it will send prepare response to the proposer with lastacceptedProposalTimestamp set empty as well.
  • b) if lastAcceptedProposalTimestamp is not empty, it will accept this request then send a prepare response with lastAcceptedProposalTimestamp and lastAcceptedProposalValue

On receiving accept request:
rule 2. acceptor checks if the timestamp value N is larger than lastPreparedProposalTimestamp, if N < lastPreparedProposalTimestamp, the acceptor will either not response or response a reject message(it's trivial)
rule 3. Otherwise (N >= lastPreparedProposalTimestamp), acceptor will then check its own lastAcceptedProposalTimestamp and lastAcceptedProposalValue

  • a) if lastAcceptedProposalTimestamp is empty (means the acceptor hasn't accepted any response in this round), it must accept this proposal and send a accept response to the proposer.
  • b) if lastAcceptedProposalTimestamp is not empty, it will further compare the value in the proposal, let's say v, against the lastAcceptedProposalValue, let's say, z. If v == z, then it must accepts this proposal and send a accept response to the proposer. Otherwise, it either ignores the message or sends a reject message.

in acceptor rule 3, I used "must". Actually, if an acceptor is allowed to accept an accept request but rejects it, this behavior won't cause safety issues (having two value chosen). But this might stop the Paxos process(have no value chosen). If all healthy acceptors don't respond to or reject a "should be accepted" accept request, then they should be called "unhealthy" because they are being lazy !

When a round completes ( a single value was chosen in the cluster), the acceptors set all the fields to initial value so the next round can begin. Notice in a single round, an acceptor can accept multiple proposals according to the rule (as long as the proposals follow the rule)

Explanation of rules (TODO)

Some key points that worth notice.

  1. An acceptor only sends prepare response if the proposal ts is larger than the acceptor's lastPreparedProposalTimestamp. This ensures outdated proposals never proceed to accept phase.

  2. The weird "modify proposal value" in proposer rule 3-b. This happens when a proposerproposed a very new timestamp proposal, but there is already an on-going accept phase. proposer rule 3-b and acceptor rule 1-b is like an acceptor saying: "Hey, I know you have a very new timestamp value, and I can give you my prepare response, but there is already an accept phase voting process, and I've already accepted an accept request. So to make this round finishes as soon as possible, you should change your proposal value to the max of proposal values that are in accept phase. However since I don't know what values other acceptors accepted, I can only send you mine and you need to find out the max yourself." These two rules procedure is the key of Paxos

Examples (Not finished yet)

Suppose we have 2 proposers, p1 and p2, with 5 acceptors, a1, a2, a3, a4, a5. We use ts(p1) for timestamp of p1's proposal and value(p1) for value of p1's proposal. For example, if p1 wants to set tax to 20% at timestamp 1, then ts(p1) is 10, value(p1) is20%.

we will use

  • lastPrepareTs to represent lastPreparedProposalTimestamp
  • lastAcceptedTs to represent lastAcceptedProposalTimestamp
  • lastAcceptedValue to represent lastAcceptedProposalValue

Myth 1:
Can there be more than 1 proposers enter accept phase? Yes
Can there be more than 2 different value in accept phase? Yes
Example 1 :

  1. p1 sends a prepare request with ts(p1)timestamp = 1, value(p1) = 10%, and get 3 prepare response from a1,a2,a3, it can procced to accept phase but due to some latency (or network issue), it doesn't start sending accept request immediately.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1
a3 1
a4
a5
  1. p2 sends a prepare request with ts(p2) = 2, value(p2) = 15%, and get 3 prepare response from a3,a4,a5. Notice a3 can send prepare response according to acceptor rule 1
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1
a3 2
a4 2
a5 2
  1. p1 start to sends a accept request with ts(p1) = 1, value(p1) = 10%, but only a1, a2 can give p1 a accept response, a3 cannot because of acceptor rule 1. Even a3 sent a prepare reponse to p1, that doesn't always mean a3 is a dedicated supporter of p1.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1 1 10%
a2 1 1 10%
a3 2
a4 2
a5 2
  1. p2 then starts sends a accept request with ts(p2) = 2, value(p2) = 15% to a3,a4,a5,and get 3 accept response, everything goes smoothly. p2's proposal won the round. tax is set to 15%

Example 2: p2 modified its proposal value

  1. p1 sends prepare requests with ts(p1)timestamp = 1, value(p1) = 10%, and get 3 prepare response from a1,a2,a3, it can procced to accept phase but due to some latency (or network issue), it doesn't start sending accept request immediately.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1
a3 1
a4
a5
  1. p2 sends prepare requests with ts(p2) = 2, value(p2) = 15%, but only get 2 prepare response from a4,a5. Due to network issues, the prepare request send to a3 was delayed. (But it will be received by a3 sometime in the near future)
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1
a3 1
a4 2
a5 2
  1. p1 start to sends accept requests with ts(p1) = 1, value(p1) = 10%, this time a2, a3 give p1 accept response, due to network issues, a1 cannot receive the accept request of p1.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1 1 10%
a3 1 1 10%
a4 2
a5 2
  1. the late prepare request of p2 finally arrives a3, a3 checks the ts of p2 proposal, and gives prepare request to p2, however, it also tells p2 that "I've already accepted a proposal, its ts is 1, and value is 10%".

  2. p2 receives prepare response message from a3, along with 2 other prepare response message from a4,a5. Then p2 follows proposer rule 3-b, sees a non-empty lastAcceptedTs field ifrom a3 response. Then p2 modifies its proposal value to 10%.

6.p2 then starts sends accept requests with ts(p2) = 2, value(p2) = 10% (value is no longer 15%) to a3,a4,a5, a3 receives the accept request, find 1) ts(p2) >= lastAcceptedTs 2) value(p2) == lastAcceptedValue meets the accept requirement. So a3 accept the request and send a accept response to p2. Then a4, a5 also accepts.

acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1 1 10%
a3 2 2 10%
a4 2 2 10%
a5 2 2 10%
  1. p2 wins the round, and set tax to 10% on all acceptors. In this case, even p1 didn't win, but the value p1 wants to set finally wins. Paxos guarantees there is only one value in each round, but that value might be proposed by multiple proposers. Anyone of those proposers wins the round is ok.

Example 3 : The most interesting example. Will it be possible that a1, a2 accept the proposal with value = 10%, a3 accept the proposal with value = 15% and a4, a5 accepts the proposal with 20% ? Paxos can guarantee this won't happen. See examples,

  1. p1 sends prepare requests with ts(p1)timestamp = 1, value(p1) = 10%, and get 3 prepare response from a1,a2,a3, it can procced to accept phase but due to some latency (or network issue), it doesn't start sending accept request immediately.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1
a3 1
a4
a5
  1. p2 sends prepare requests with ts(p2) = 2, value(p2) = 15%, and get 3 prepare response from a3,a4,a5.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1
a2 1
a3 2
a4 2
a5 2
  1. p1 start to send a accept requests with ts(p1) = 1, value(p1) = 10%, but only a1 received the message, the result would be
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1 1 10%
a2 1
a3 2
a4 2
a5 2
  1. p2 start to send accept requests with ts(p2) = 2, value(p1) = 15%, but only a5 received the message, the result would be
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1 1 10%
a2 1
a3 2
a4 2
a5 2 2 15%

Interesting things are coming. Suppose now we have a new proposer , p3 who wants to propose with ts(p3) = 3, value(p3) = 20%. And p3 just begin it's prepare phase.

Consider what nodes it sends to:

  1. If p3 send prepare requests to a1,a3,a4 or a2,a4,a5 or any majority of acceptors that includes a1 or a5, p3 would be forced to modify its value. If p3 really want's to win
    with p3 original value = 20%, it can only sends prepare request to a3,a4,a5, or the acceptors that haven't accepted any proposal

But will this always be possible? No

  • if a majority of acceptors already have accepted a proposal, according to Pigeonhole principle, p3 will at least received a non-empty lastAcceptedTs response from one of the acceptors, then p3 must change its value to lastAcceptedValue associated with the largest lastAcceptedTs. For example, if p3 send prepare requests to a1, a3, a5, because the response from a5 has the largest lastAcceptedTs = 2 , p3 set its value to 15%.
acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 3 1 10%
a2 1
a3 3
a4 2
a5 3 2 15%

( because a1, a3, a4, a5 will reject p1' s proposal, p3 will eventually win)

If less than half of acceptors have accepted a proposal already, then p3 must be lucky enough to send prepare requests to a majority of acceptors, such as a2, a3, a4 in our case.

acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1 1 10%
a2 3
a3 3
a4 3
a5 2 2 15%

Luckily, a3, a4, a5 won't accept any accept request from p1 or p2 because a2,a3,a4 has a lastPrepareTs = 3, according to acceptor rule 2. Then the result would be

acceptor lastPrepareTs lastAcceptedTs lastAcceptedValue
a1 1 1 10%
a2 3 3 20%
a3 3 3 20%
a4 3 3 20%
a5 2 2 15%

So Paxos guarantees in a given round, and more than half of acceptors already accepted a proposal, then no new proposer can propose a new (new means never appeared in any previous proposals) value in accept phase. This property ensures there will always a value be chosen in each round.

Conclusion

Sure enough, we only discussed the most simple Paxos scenario. In our example, we assume a majority of acceptors have 3 acceptors, but 4, 5 are the same. We use number 3 because it reduces network traffic (we don't necessarily communicate all nodes to reach consensus, we only need just more than half). This is why some cluster running Paxos requires an odd number of nodes --- 4-nodes or 5-nodes cluster all need at least 3 nodes to form a majority.

In the case of node failures, Paxos also guarantees that, as long as at least more than half nodes are alive, Paxos can work properly. However, there are more complicated edge cases and performance issues related to Paxos. I'm not interested in digging too deep into those industrial rabbit holes. But understanding Paxos definitely helped me understand Raft better, which might be more useful if I want to find a distributed system-related job these days.

References

p.s. I just read another article about Paxos using the same term accept request and prepare request after I finish writing this blog. links here -> Paxos算法介绍—Basic Paxos. Great minds think alike! :XD

相关文章

网友评论

      本文标题:Paxos 没有歧义的解释(尽量)

      本文链接:https://www.haomeiwen.com/subject/amsdyhtx.html