本文质量: 4 / 10(自己还没读过)
本文修改次数: 0
阅读时长: 30 min
前言
笔者之前为了学Java去看Cassandra源码,偶然之间看到了Paxos这样一个算法。出于好奇前前后后看了4,5遍,仍然感觉不得要领。最近又翻出来看感觉似乎有点懂了,于是打算写一篇博客讲讲Paxos。中文社区里的Paxos讲解几乎都不准确或者有瑕疵,我觉得最根本的原因是没有严格定义所用到的术语,我会严格按照我所定义的术语来讲解,并配上几个例子,希望最后能让我室友看懂Paxos到底是干嘛的。
不过如果只是为了学习共识算法,没有必要太纠结于Paxos。一是就算搞懂了Simple Paxos,也很难在工程上实现可行的Multi-Paxos,二是有Raft存在的世界里,我想不到Paxos优于Raft的理由。我强烈推荐读Raft的paper,超级容易懂,并且在工程上已经有很好的实现了(比如TiKV就在用的etcdraft,RocketMQ的Nameserver似乎也是用的Raft)
另外为了练习一下English Technical Writing(以及要交的一个作业),接下来内容会先用英语写,之后有空了会把中文版的放上来。刚开始联系英文写作肯定会有很多语法行文上不够地道的地方,还望读者谅解。
(推荐一下Google Technical Writing Course我觉得无论对中文还是英文技术文档写作都有很大的帮助,而且是免费的)
Before we begin
The audience is required to
- Have Computer Science Background
- Know what problem Paxos algorithm is trying to solve
- Already read some related materials about Paxos but are still confusing
If you are new to Paxos, I would recommend covering the following materials first
- Paxos Made Simple: an explanation by the Paxos algorithm author. This paper includes some proof, which I recommend skip for the first time
- Wikipedia about Paxos : this material provides some examples
If you still don't understand or needs to verify your understanding, this article might help you.
What you will possibly get (not guaranteed)
- Another explanation of Paxos, which might help you understand Paxos better
- How a simple Paxos algorithm works with examples
Motivation and learning plan
Tons of blogs are trying to explain Paxos algorithm. But we all know it's difficult to understand or explain it. After reading 10+ blogs, I found most blogs (even the Paxos Made Simple itself) do not use terms consistently, or use ambiguous references (like this xxx, that node xxx). I believe by using terms consistently and eliminate ambiguity can make Paxos more understandable. And even if I'm wrong, people who understand Paxos can easily point out why or where I am wrong.
I will first describe the simple Paxos algorithm (so multi-Paxos not included in this article). This process is hard to understand so it's totally fine if you don't know "why he design it like this?". It's unnatural according to many people.
Then I will walk through some examples by applying the algorithm. You might need to go back to the paper or my description a couple of times before you understand it. Once you feel confident, you can try to the proof why Paxos is working mathematically(in the Paxos Made Simple paper). This is my recommended process that worked for me. Gl, hf!
Description of Paxos
A simple scenario problem
Let's assume in a cluster running Paxos algorithm, each node store a copy of a variable named tax
. Initially tax
was assigned a value 5%
. Now we have a couple of nodes, some of the nodes want to set tax
to a different value on all nodes. If there is only one node that wants to do so, things would be easy: the node simply sends the new value, let's say 10%
, to all other nodes in the cluster, and other nodes will set the copy of tax
on themselves to 10%
. But if 2 node simultaneously wants to set tax
to 2 different value on all nodes, let's say "10%" and "12%" respectively, problems arise.
Before we discuss what is the problem here, we first need to name the nodes that want to set tax
value to a new value on all nodes and the nodes that accept the new value only.
-
We name node that wants to set
tax
value to a new value on all nodesproposer(s)
. Because aproposer
node is "proposing" new value fortax
, and by using the verb "proposing" we know this proposal could be accepted or rejected. We usep1, p2, ...
to refer toproposer 1, proposer 2, ...
. -
We name the other nodes that don't want to set
tax
to a new valueacceptor(s)
. We usea1, a2, a3, ...
to refer toacceptor 1, acceptor 2, acceptor 3, ...
In real life, a node can be both a proposer
and an acceptor
, but for simplicity, we assume a node can only be either a proposer
or an acceptor
. This trivial change won't affect the Paxos algorithm.
Let's go back to the problem we didn't discuss earlier. If 2 proposer
wants to set tax
to 2 different value on 5 acceptor
s , let's say 10%
(from p1
)and 12%
(from p2
). If we don't do anything, due to network uncertainty, the process would be like:
-
a1, a2, a3
receives10%
first, they settax
to10%
.a4, a5
receives12%
first, they settax
to12%
- then
a1, a2, a3
receives12%
, they settax
to12%
.a4, a5
receives10%
, they settax
to10%
. - in the end,
a1,a2,a3
havetax
value set to12%
anda4, a5
havetax
value set to10%
, this leads to inconsistency. - what's worse, even another proposer
p3
sets thetax
value to15%
eventually, it's still not acceptable because thetax
history ona1,a2,a3
anda4,a5
are different. (If you don't understand why this is not acceptable, you can send me a private message to me to share your thoughts)
Here we introduce Paxos to solve the problem.
Paxos specification
Start a round of Paxos algorithm triggers 2 phases (and will result in one value be chosen, or no value are chosen in case of failure). The first phase is called prepare phase
, the second phase is called accept phase
. If a proposer
wants to set tax
to a new value, proposer
creates a proposal
. A proposal
contains the value
to be set, like 10%
, and a global unique timestamp
(we'll use ts
for short).
Proposer: Prepare Phase
proposer
:
rule 1. a proposer
sends a prepare request
message to a majority of the acceptors
( majority means more than half of total acceptors
) bringing the proposal
. proposal
's ts
is N
, and value
is v
. Then proposer
expects to receive prepare response
messages from those acceptors
rule 2. if the proposer
didn't receive prepare response
messages from a majority of acceptors
, the proposer
abort its proposal
.
rule 3. if the proposer
receives prepare response
messages from a majority of acceptors
, it can proceed to accept phase
. But before proceeding, the proposer
will check every prepare response
messages it received. A prepare response
message contains multiple fields. Specifications on these fields will are in later sections, for now just skip it.
- a) if lastAcceptedProposalTimestamp
field is empty in all prepare response
message, proposer
keeps the value
of its proposal
intact. Then proposer
proceeds to accept phase
- b) if at least one of the prepare response
message have a non-empty lastAcceptedProposalTimestamp
field, proposer
will find the prepare response
message with the largest non-empty lastAcceptedProposalTimestamp
field. This prepare response
message contains a field lastAcceptedProposalValue
, let's assums it's u
. Then proposer
will set the value
of its proposal
to from v
to u
. Finally proposer
proceeds to accept phase
Proposer: Accept Phase
proposer
rule 4. a proposer
sends an accept request
to a majority of the acceptors
(Notice this majority acceptors
doesn't have to be the acceptors
as in prepare phase
step 1 ) bringing the proposal (noticed the value
of this proposal
might be modified as described in prepare phase
step 3.b)
rule 5. if proposer
receives more than half of accept response
, then proposer
wins. proposer
will inform all acceptors
that it wins and send the value
to acceptors
. So far we have one chosen value for all nodes in the cluster, we will use examples to illustrate why only one value can be chosen at this point.
rule 6. if proposer
didn't receive more than half of accept response
, it knows someone else probably won this round or no one wins, so it will wait until next round (with a new timestamp
)
Acceptors
acceptor
can receive prepare request
and accept request
. Each acceptor
will keep some fields. These fields are set to empty initially.
-
lastPreparedProposalTimestamp
: -
lastacceptedProposalTimestamp
: lastAcceptedProposalValue
On receiving prepare request
:
rule 1. acceptorchecks the
timestampincluded in the
prepare request, let's say
N`, then
-
acceptor
will compareN
withlastPreparedProposalTimestamp
, ifN
>lastPreparedProposalTimestamp
, thenacceptor
will response, but it needs to check - a) if
lastAcceptedProposalTimestamp
is empty (haven't accept any), it will sendprepare response
to theproposer
withlastacceptedProposalTimestamp
set empty as well. - b) if
lastAcceptedProposalTimestamp
is not empty, it willaccept
this request then send aprepare response
withlastAcceptedProposalTimestamp
andlastAcceptedProposalValue
On receiving accept request
:
rule 2. acceptor
checks if the timestamp
value N is larger than lastPreparedProposalTimestamp
, if N < lastPreparedProposalTimestamp
, the acceptor
will either not response or response a reject message(it's trivial)
rule 3. Otherwise (N >= lastPreparedProposalTimestamp
), acceptor
will then check its own lastAcceptedProposalTimestamp
and lastAcceptedProposalValue
- a) if
lastAcceptedProposalTimestamp
is empty (means theacceptor
hasn't accepted any response in this round), it mustaccept
thisproposal
and send aaccept response
to the proposer. - b) if
lastAcceptedProposalTimestamp
is not empty, it will further compare thevalue
in theproposal
, let's sayv
, against thelastAcceptedProposalValue
, let's say,z
. Ifv == z
, then it mustaccepts
thisproposal
and send aaccept response
to the proposer. Otherwise, it either ignores the message or sends a reject message.
in acceptor rule 3, I used "must". Actually, if an
acceptor
is allowed to accept anaccept request
but rejects it, this behavior won't cause safety issues (having two value chosen). But this might stop the Paxos process(have no value chosen). If all healthyacceptors
don't respond to or reject a "should be accepted"accept request
, then they should be called "unhealthy" because they are being lazy !
When a round completes ( a single value was chosen in the cluster), the acceptors
set all the fields to initial value so the next round can begin. Notice in a single round, an acceptor
can accept multiple proposals according to the rule (as long as the proposals
follow the rule)
Explanation of rules (TODO)
Some key points that worth notice.
-
An
acceptor
only sendsprepare response
if the proposalts
is larger than theacceptor
'slastPreparedProposalTimestamp
. This ensures outdated proposals never proceed toaccept phase
. -
The weird "modify proposal
value
" inproposer
rule 3-b. This happens when aproposer
proposed a very newtimestamp
proposal, but there is already an on-goingaccept phase
.proposer
rule 3-b andacceptor
rule 1-b is like an acceptor saying: "Hey, I know you have a very new timestamp value, and I can give you myprepare response
, but there is already anaccept phase
voting process, and I've alreadyaccepted
anaccept request
. So to make this round finishes as soon as possible, you should change your proposal value to the max of proposalvalues
that are inaccept phase
. However since I don't know what values otheracceptors
accepted, I can only send you mine and you need to find out the max yourself." These two rules procedure is the key of Paxos
Examples (Not finished yet)
Suppose we have 2 proposers
, p1 and p2
, with 5 acceptors
, a1, a2, a3, a4, a5
. We use ts(p1)
for timestamp
of p1's proposal
and value(p1)
for value
of p1's proposal
. For example, if p1
wants to set tax
to 20%
at timestamp 1
, then ts(p1)
is 10, value(p1)
is20%
.
we will use
-
lastPrepareTs
to representlastPreparedProposalTimestamp
-
lastAcceptedTs
to representlastAcceptedProposalTimestamp
-
lastAcceptedValue
to representlastAcceptedProposalValue
Myth 1:
Can there be more than 1 proposers
enter accept phase
? Yes
Can there be more than 2 different value
in accept phase
? Yes
Example 1 :
-
p1
sends aprepare request
withts(p1)timestamp = 1, value(p1) = 10%
, and get 3prepare response
froma1,a2,a3
, it can procced toaccept phase
but due to some latency (or network issue), it doesn't start sendingaccept request
immediately.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | ||
a3 | 1 | ||
a4 | |||
a5 |
-
p2
sends aprepare request
withts(p2) = 2, value(p2) = 15%
, and get 3prepare response
froma3,a4,a5
. Noticea3
can sendprepare response
according toacceptor
rule 1
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | ||
a3 | 2 | ||
a4 | 2 | ||
a5 | 2 |
-
p1
start to sends aaccept request
withts(p1) = 1, value(p1) = 10%
, but onlya1, a2
can givep1
aaccept response
,a3
cannot because ofacceptor
rule 1. Evena3
sent aprepare reponse
top1
, that doesn't always meana3
is a dedicated supporter ofp1
.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | 1 | 10% |
a2 | 1 | 1 | 10% |
a3 | 2 | ||
a4 | 2 | ||
a5 | 2 |
-
p2
then starts sends aaccept request
withts(p2) = 2, value(p2) = 15%
toa3,a4,a5
,and get 3accept response
, everything goes smoothly.p2
's proposal won the round.tax
is set to15%
Example 2: p2 modified its proposal value
-
p1
sendsprepare requests
withts(p1)timestamp = 1, value(p1) = 10%
, and get 3prepare response
froma1,a2,a3
, it can procced toaccept phase
but due to some latency (or network issue), it doesn't start sendingaccept request
immediately.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | ||
a3 | 1 | ||
a4 | |||
a5 |
-
p2
sendsprepare requests
withts(p2) = 2, value(p2) = 15%
, but only get 2prepare response
froma4,a5
. Due to network issues, theprepare request
send toa3
was delayed. (But it will be received bya3
sometime in the near future)
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | ||
a3 | 1 | ||
a4 | 2 | ||
a5 | 2 |
-
p1
start to sendsaccept requests
withts(p1) = 1, value(p1) = 10%
, this timea2, a3
givep1
accept response
, due to network issues,a1
cannot receive theaccept request
ofp1
.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | 1 | 10% |
a3 | 1 | 1 | 10% |
a4 | 2 | ||
a5 | 2 |
-
the late
prepare request
ofp2
finally arrivesa3
,a3
checks thets
ofp2
proposal, and givesprepare request
top2
, however, it also tellsp2
that "I've already accepted a proposal, itsts
is 1, andvalue
is 10%". -
p2
receivesprepare response
message froma3
, along with 2 otherprepare response
message froma4,a5
. Thenp2
followsproposer
rule 3-b, sees a non-emptylastAcceptedTs
field ifroma3
response. Thenp2
modifies its proposal value to10%
.
6.p2
then starts sends accept requests
with ts(p2) = 2, value(p2) = 10%
(value is no longer 15%) to a3,a4,a5
, a3
receives the accept request
, find 1) ts(p2) >= lastAcceptedTs
2) value(p2) == lastAcceptedValue
meets the accept requirement. So a3
accept the request and send a accept response
to p2
. Then a4, a5
also accepts.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | 1 | 10% |
a3 | 2 | 2 | 10% |
a4 | 2 | 2 | 10% |
a5 | 2 | 2 | 10% |
-
p2
wins the round, and settax
to10%
on allacceptors
. In this case, evenp1
didn't win, but thevalue
p1
wants to set finally wins. Paxos guarantees there is only one value in each round, but that value might be proposed by multipleproposers
. Anyone of thoseproposers
wins the round is ok.
Example 3 : The most interesting example. Will it be possible that a1, a2
accept the proposal with value = 10%
, a3
accept the proposal with value = 15%
and a4, a5
accepts the proposal with 20%
? Paxos can guarantee this won't happen. See examples,
-
p1
sendsprepare requests
withts(p1)timestamp = 1, value(p1) = 10%
, and get 3prepare response
froma1,a2,a3
, it can procced toaccept phase
but due to some latency (or network issue), it doesn't start sendingaccept request
immediately.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | ||
a3 | 1 | ||
a4 | |||
a5 |
-
p2
sendsprepare requests
withts(p2) = 2, value(p2) = 15%
, and get 3prepare response
froma3,a4,a5
.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | ||
a2 | 1 | ||
a3 | 2 | ||
a4 | 2 | ||
a5 | 2 |
-
p1
start to send aaccept requests
withts(p1) = 1, value(p1) = 10%
, but onlya1
received the message, the result would be
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | 1 | 10% |
a2 | 1 | ||
a3 | 2 | ||
a4 | 2 | ||
a5 | 2 |
-
p2
start to sendaccept requests
withts(p2) = 2, value(p1) = 15%
, but onlya5
received the message, the result would be
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | 1 | 10% |
a2 | 1 | ||
a3 | 2 | ||
a4 | 2 | ||
a5 | 2 | 2 | 15% |
Interesting things are coming. Suppose now we have a new proposer
, p3
who wants to propose with ts(p3) = 3, value(p3) = 20%
. And p3
just begin it's prepare phase
.
Consider what nodes it sends to:
- If
p3
sendprepare requests
toa1,a3,a4
ora2,a4,a5
or any majority ofacceptors
that includesa1
ora5
,p3
would be forced to modify itsvalue
. Ifp3
really want's to win
withp3
originalvalue = 20%
, it can only sendsprepare request
toa3,a4,a5
, or theacceptors
that haven't accepted any proposal
But will this always be possible? No
- if a majority of
acceptors
already have accepted a proposal, according to Pigeonhole principle,p3
will at least received a non-emptylastAcceptedTs
response from one of theacceptors
, thenp3
must change its value tolastAcceptedValue
associated with the largestlastAcceptedTs
. For example, ifp3
send prepare requests toa1, a3, a5
, because the response froma5
has the largestlastAcceptedTs = 2
,p3
set its value to15%
.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 3 | 1 | 10% |
a2 | 1 | ||
a3 | 3 | ||
a4 | 2 | ||
a5 | 3 | 2 | 15% |
( because a1, a3, a4, a5
will reject p1
' s proposal, p3
will eventually win)
If less than half of acceptors
have accepted a proposal already, then p3
must be lucky enough to send prepare requests
to a majority of acceptors
, such as a2, a3, a4
in our case.
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | 1 | 10% |
a2 | 3 | ||
a3 | 3 | ||
a4 | 3 | ||
a5 | 2 | 2 | 15% |
Luckily, a3, a4, a5
won't accept any accept request
from p1
or p2
because a2,a3,a4
has a lastPrepareTs = 3
, according to acceptor
rule 2. Then the result would be
acceptor | lastPrepareTs | lastAcceptedTs | lastAcceptedValue |
---|---|---|---|
a1 | 1 | 1 | 10% |
a2 | 3 | 3 | 20% |
a3 | 3 | 3 | 20% |
a4 | 3 | 3 | 20% |
a5 | 2 | 2 | 15% |
So Paxos guarantees in a given round, and more than half of acceptors
already accepted a proposal, then no new proposer
can propose a new (new means never appeared in any previous proposals) value
in accept phase
. This property ensures there will always a value be chosen in each round.
Conclusion
Sure enough, we only discussed the most simple Paxos scenario. In our example, we assume a majority of acceptors
have 3 acceptors
, but 4, 5 are the same. We use number 3 because it reduces network traffic (we don't necessarily communicate all nodes to reach consensus, we only need just more than half). This is why some cluster running Paxos requires an odd number of nodes --- 4-nodes or 5-nodes cluster all need at least 3 nodes to form a majority.
In the case of node failures, Paxos also guarantees that, as long as at least more than half nodes are alive, Paxos can work properly. However, there are more complicated edge cases and performance issues related to Paxos. I'm not interested in digging too deep into those industrial rabbit holes. But understanding Paxos definitely helped me understand Raft better, which might be more useful if I want to find a distributed system-related job these days.
References
- Paxos Made Simple
- Wikipedia about Paxos
- Zhihu : Explain Paxos the easy way
- Some other blogs I read but I cannot remember where they are ...
p.s. I just read another article about Paxos using the same term accept request
and prepare request
after I finish writing this blog. links here -> Paxos算法介绍—Basic Paxos. Great minds think alike! :XD
网友评论