The replicas move through a succession of configurations called views.
In a view one replica is the primary and the others are backups.
Views are numbered consecutively. The primary of a view is replica
p such that p = v mod |R|, where v is the view
number. View changes are carried out when it appears that the primary has
failed. Viewstamped Replication [26]
and Paxos [18] used a similar approach
to tolerate benign faults (as discussed in Section 8.)
副本在一系列称为 views 的配置中移动。在视图中,一个副本是主副本,其他副本是备份副本。视图是连续编号的。视图的主对象是副本 p,使得 p = v mod |R|,其中 v 是视图编号。当主数据库似乎失败时,将执行视图更改。Viewstamped Replication [26] 和 Paxos [18] 使用类似的方法来容忍良性故障(如 Section 8 所述)。
The algorithm works roughly as follows:
该算法的工作原理大致如下:
The remainder of this section describes a simplified version of the
algorithm. We omit discussion of how nodes recover from faults due to lack
of space. We also omit details related to message retransmissions. Furthermore,
we assume that message authentication is achieved using digital signatures
rather than the more efficient scheme based on message authentication codes;
Section 5 discusses this issue further.
A detailed formalization of the algorithm using the I/O automaton model
[21] is presented in [4].
本节的其余部分介绍该算法的简化版本。我们省略了对节点如何从由于空间不足而导致的故障中恢复的讨论。我们还省略了与消息重传相关的详细信息。此外,我们假设消息身份验证是使用数字签名实现的,而不是基于消息身份验证代码的更高效方案;第 5 节进一步讨论了这个问题。[4] 中介绍了使用 I/O 自动机模型 [21] 的算法的详细形式化。
Each message sent by the replicas to the client includes the current
view number, allowing the client to track the view and hence the current
primary. A client sends a request to what it believes is the current primary
using a point-to-point message. The primary atomically multicasts the request
to all the backups using the protocol described in the next section.
副本发送到客户端的每条消息都包含当前视图编号,从而允许客户端跟踪视图,从而跟踪当前主视图。客户端使用点对点消息向它认为是当前主数据库的内容发送请求。主数据库使用下一节中描述的协议以原子方式将请求多播到所有备份。
A replica sends the reply to the request directly to the client. The
reply has the form REPLY, v, t, c, i, r,
where v is the
current view number, t is the timestamp of the corresponding request,
i is the replica number, and r is the result of executing the requested operation.
副本将请求的回复直接发送到客户端。回复的形式 为 REPLY, v, t, c, i, r ,其中 v 是当前视图编号,t 是相应请求的时间戳,i 是副本编号,r 是执行请求的操作的结果。
The client waits for f+1
replies with valid signatures from different replicas, and with the same t
and r, before
accepting the result r.
This ensures that the result is valid, since at most f
replicas can be faulty.
客户端在接受结果 r 之前,会等待来自不同副本的具有有效签名的 f+1 回复,并且具有相同的 t 和 r。这可确保结果有效,因为最多 f 个副本可能是错误的。
If the client does not receive replies soon enough, it broadcasts the
request to all replicas. If the request has already been processed, the
replicas simply re-send the reply; replicas remember the last reply message
they sent to each client. Otherwise, if the replica is not the primary,
it relays the request to the primary. If the primary does not multicast
the request to the group, it will eventually be suspected to be faulty
by enough replicas to cause a view change.
如果客户端没有足够快地收到回复,它会将请求广播到所有副本。如果请求已被处理,则副本只需重新发送回复;副本会记住它们发送给每个客户端的最后一条回复消息。否则,如果副本不是主副本,它会将请求中继到主副本。如果主数据库没有对组的请求进行多播,则最终会因足够多的副本而怀疑它是错误的,从而导致视图更改。
In this paper we assume that the client waits for one request to complete
before sending the next one. But we can allow a client to make asynchronous
requests, yet preserve ordering constraints on them.
在本文中,我们假设客户端在发送下一个请求之前等待一个请求完成。但是我们可以允许 Client 端发出异步请求,但保留对它们的 Sequences 约束。
When the primary, p,
receives a client request, m,
it starts a three-phase protocol to atomically multicast the request to
the replicas. The primary starts the protocol immediately unless the number
of messages for which the protocol is in progress exceeds a given maximum.
In this case, it buffers the request. Buffered requests are multicast later
as a group to cut down on message traffic and CPU overheads under heavy
load; this optimization is similar to a group commit in transactional systems
[11]. For simplicity, we ignore
this optimization in the description below.
当主数据库 p 收到客户端请求 m 时,它会启动一个三阶段协议,以原子方式将请求多播到副本。主数据库立即启动协议,除非正在进行的协议的消息数超过给定的最大值。在这种情况下,它会缓冲请求。缓冲请求稍后将作为一个组进行多播,以减少高负载下的消息流量和 CPU 开销;这种优化类似于事务系统中的 group commit [11]。为简单起见,我们在下面的描述中忽略了此优化。
The three phases are pre-prepare, prepare, and commit.
The pre-prepare and prepare phases are used to totally order requests sent
in the same view even when the primary, which proposes the ordering of
requests, is faulty. The prepare and commit phases are used to ensure that
requests that commit are totally ordered across views.
这三个阶段是 pre-prepare、prepare 和 commit。pre-prepare 和 prepare 阶段用于对在同一视图中发送的请求进行完全排序,即使建议请求排序的主数据库出现故障。prepare 和 commit 阶段用于确保提交的请求在视图之间完全排序。
In the pre-prepare phase, the primary assigns a sequence number, n,
to the request, multicasts a pre-prepare message with m
piggybacked to all the backups, and appends the message to its log. The
message has the form
PRE-PREPARE, v, n, d, m,
where v indicates the view in which the message is being sent, m
is the client's request message, and d
is m's digest.
在预准备阶段,主数据库为请求分配一个序列号 n,将带有 m 的预准备消息多播到所有备份,并将该消息附加到其日志中。消息的格式为 PRE-PREPARE, v, n, d , m ,其中 v 表示发送消息的视图, m 是客户端的请求消息, d 是 m 的摘要。
Requests are not included in pre-prepare messages to keep them small.
This is important because pre-prepare messages are used as a proof that
the request was assigned sequence number n
in view v in view
changes. Additionally, it decouples the protocol to totally order requests
from the protocol to transmit the request to the replicas; allowing us
to use a transport optimized for small messages for protocol messages and
a transport optimized for large messages for large requests.
请求不包含在预准备消息中,以保持其较小。这一点很重要,因为预准备消息用作证明请求在视图更改的视图 v 中被分配了序列号 n。此外,它还将协议解耦,以完全对来自协议的请求进行排序,以将请求传输到副本;允许我们将针对小消息优化的传输方式用于协议消息,并使用针对大消息优化的传输方式来处理大请求。
A backup accepts a pre-prepare message provided:
备份接受提供的 pre-prepare 消息:
If backup i
accepts the PRE-PREPARE, v, n, d, m
message, it enters the prepare phase by multicasting a
PREPARE, v, n, d, i
message to all other replicas and adds both messages to its log. Otherwise,
it does nothing.
如果备份 i 接受 PRE-PREPARE, v, n, d , m 消息,它将通过将 PREPARE, v, n, d, i 消息多播 到所有其他副本来进入准备阶段,并将两条消息添加到其日志中。否则,它不执行任何操作。
A replica (including the primary) accepts prepare messages and adds
them to its log provided their signatures are correct, their view number
equals the replica's current view, and their sequence number is between
h and H.
副本(包括主副本)接受 prepare 消息并将其添加到其日志中,前提是它们的签名正确,它们的视图号等于副本的当前视图,并且它们的序列号在 h 和 H 之间。
We define the predicate prepared(m, v, n, i)
to be true if and only if replica i
has inserted in its log: the request m,
a pre-prepare for m in view v with
sequence number n, and 2f prepares
from different backups that match the pre-prepare. The replicas verify
whether the prepares match the pre-prepare by checking that they have the
same view, sequence number, and digest.
当且仅当副本 i 已插入到其日志中时,我们将谓词 prepared(m, v, n, i) 定义为 true:请求 m、视图 v 中 m 的预准备(序列号为 n)和 2f 从与预准备匹配的不同备份中准备。副本通过检查它们是否具有相同的视图、序列号和摘要来验证 prepares 是否与 pre-prepare 匹配。
The pre-prepare and prepare phases of the algorithm guarantee that non-faulty
replicas agree on a total order for the requests within a view. More precisely,
they ensure the following invariant: if prepared(m, v, n, i)
is true then prepared(m', v, n, j)
is false for any non-faulty replica j
(including i = j)
and any m' such
that D(m')D(m). This
is true because prepared(m, v, n, i) and |R| = 3f+1
imply that
at least f+1 non-faulty
replicas have sent a pre-prepare or prepare for m
in view v with
sequence number n.
Thus, for prepared(m', v, n, j)
to be true at least one of these replicas needs to have sent two conflicting
prepares (or pre-prepares if it is the primary for v),
i.e., two prepares with the same view and sequence number and a different
digest. But this is not possible because the replica is not faulty. Finally,
our assumption about the strength of message digests ensures that the probability
that mm' and D(m) = D(m')
is negligible.
算法的 pre-prepare 和 prepare 阶段可保证无故障副本就视图中请求的总顺序达成一致。更准确地说,它们确保以下不变量:如果 prepared(m, v, n, i) 为 true,则 prepared(m', v, n, j) 对于任何无错误的副本 j(包括 i = j)和任何 m' 为 false,使得 D(m') D(m)。这是真的,因为 prepared(m, v, n, i) 和 |R|= 3f+1 表示至少有 f+1 个无故障副本在视图 v 中发送了序列号为 n 的 m 的预准备或准备。因此,要使 prepared(m', v, n, j) 为 true,这些副本中至少有一个需要发送两个冲突的 prepares(如果它是 v 的主要 pre-prepares),即两个具有相同视图和序列号的 prepares 以及不同的摘要。但这是不可能的,因为副本没有故障。最后,我们对消息摘要强度的假设确保 m m' 和 D(m) = D(m') 的概率可以忽略不计。
Replica i
multicasts a COMMIT, v, n, D(m), i
to the other replicas when prepared(m, v, n, i)
becomes true. This starts the commit phase. Replicas accept commit messages
and insert them in their log provided they are properly signed, the view
number in the message is equal to the replica's current view, and the sequence
number is between h and H.
当 prepared(m, v, n, i) 变为 true 时,副本 i 将 COMMIT, v, n, D(m), i 多播到其他副本。这将启动提交阶段。副本接受提交消息并将其插入到其日志中,前提是它们已正确签名,消息中的视图号等于副本的当前视图,并且序列号介于 h 和 H 之间。
We define the committed and committed-local predicates
as follows:
committed(m, v, n)
is true if and only if prepared(m, v, n, i)
is true for all i
in some set of f+1
non-faulty replicas; and committed-local(m, v, n, i)
is true if and only if prepared(m, v, n, i)
is true and i
has accepted 2f+1
commits (possibly including its own) from different replicas that match
the pre-prepare for m;
a commit matches a pre-prepare if they have the same view, sequence number,
and digest.
我们按如下方式定义 committed 和 committed-local 谓词:当且仅当 prepared(m, v, n, i) 对于一组 f+1 非故障副本中的所有 i 都为 true,则 committed(m, v, n) 为 true;当且仅当 prepared(m, v, n, i) 为 true,并且 I 已经接受了来自不同副本的 2f+1 提交(可能包括自己的提交)时,这些提交与 m 的预准备匹配;如果 pre-prepare 具有相同的视图、序列号和摘要,则提交会匹配它们。
The commit phase ensures the following invariant: if committed-local(m, v, n, i) is true for some non-faulty i
then committed(m, v, n) is true. This invariant and the view-change
protocol described in Section
4.4 ensure that non-faulty replicas
agree on the sequence numbers of requests that commit locally even if they
commit in different views at each replica. Furthermore, it ensures that
any request that commits locally at a non-faulty replica will commit at
f+1 or more non-faulty replicas eventually.
提交阶段确保以下不变性:如果 committed-local(m, v, n, i) 对于某些无故障的 i 为 true,则 committed(m, v, n) 为 true。此不变量和第 4.4 节中描述的视图更改协议可确保无故障副本就本地提交的请求的序列号达成一致,即使它们在每个副本的不同视图中提交也是如此。此外,它还确保在非故障副本本地提交的任何请求最终将在 f+1 或多个非故障副本处提交。
Each replica i
executes the operation requested by m
after committed-local(m, v, n, i) is true and i's
state reflects the sequential execution of all requests with lower sequence
numbers. This ensures that all non-faulty replicas execute requests in
the same order as required to provide the safety property. After executing
the requested operation, replicas send a reply to the client. Replicas
discard requests whose timestamp is lower than the timestamp in the last
reply they sent to the client to guarantee exactly-once semantics.
在 committed-local(m, v, n, i) 为 true 后,每个副本 i 执行 m 请求的操作,并且 i 的状态反映了所有序列号较低的请求的顺序执行。这可确保所有无故障副本都按照提供 safety 属性所需的相同顺序执行请求。执行请求的操作后,副本会向客户端发送回复。副本会丢弃时间戳低于它们发送给客户端的最后一次回复中的时间戳的请求,以保证 Exactly-once 语义。
We do not rely on ordered message delivery, and therefore it is possible
for a replica to commit requests out of order. This does not matter since
it keeps the pre-prepare, prepare, and commit messages logged until the
corresponding request can be executed.
我们不依赖于有序消息传输,因此副本可能会不按顺序提交请求。这无关紧要,因为它会记录 pre-prepare、prepare 和 commit 消息,直到可以执行相应的请求为止。
Figure 1 shows the operation of
the algorithm in the normal case of no primary faults. Replica 0 is the
primary, replica 3 is faulty, and C
is the client.
图 1 显示了该算法在没有主要故障的正常情况下的操作。副本 0 是主副本,副本 3 是故障副本,C 是客户端。
This section discusses the mechanism used to discard messages from the
log. For the safety condition to hold, messages must be kept in a replica's
log until it knows that the requests they concern have been executed by
at least f+1 non-faulty
replicas and it can prove this to others in view changes. In addition,
if some replica misses messages that were discarded by all non-faulty replicas,
it will need to be brought up to date by transferring all or a portion
of the service state. Therefore, replicas also need some proof that the
state is correct.
本节讨论用于从日志中丢弃消息的机制。为了保持安全条件,消息必须保存在副本的日志中,直到它知道它们所关注的请求已经被至少 f+1 个无故障的副本执行,并且它可以在视图变化中向其他人证明这一点。此外,如果某些副本错过了被所有无故障副本丢弃的消息,则需要通过传输全部或部分服务状态来使其保持最新状态。因此,副本还需要一些证明来证明状态是正确的。
Generating these proofs after executing every operation would be expensive.
Instead, they are generated periodically, when a request with a sequence
number divisible by some constant (e.g., 100) is executed. We will refer
to the states produced by the execution of these requests as checkpoints
and we will say that a checkpoint with a proof is a stable checkpoint.
在执行每个操作后生成这些证明将非常昂贵。相反,它们是在执行序列号可被某个常量(例如 100)整除的请求时定期生成的。我们将执行这些请求产生的状态称为 checkpoint,我们将说带有证明的 checkpoint 是一个稳定的 checkpoint。
A replica maintains several logical copies of the service state: the
last stable checkpoint, zero or more checkpoints that are not stable, and
a current state. Copy-on-write techniques can be used to reduce the space
overhead to store the extra copies of the state, as discussed in Section
6.3.
副本维护服务状态的多个逻辑副本:最后一个稳定检查点、零个或多个不稳定的检查点以及当前状态。write时复制技术可用于减少存储状态的额外副本的空间开销,如Section 6.3中所述。
The proof of correctness for a checkpoint is generated as follows. When
a replica i produces
a checkpoint, it multicasts a message
CHECKPOINT, n, d, i
to the other replicas, where n
is the sequence number of the last request whose execution is reflected
in the state and d
is the digest of the state. Each replica collects checkpoint messages in
its log until it has 2f+1
of them for sequence number n
with the same digest d
signed by different replicas (including possibly its own such message).
These 2f+1 messages
are the proof of correctness for the checkpoint.
检查点的正确性证明生成如下。当副本 i 生成检查点时,它会将消息 CHECKPOINT, n, d, i 多播到其他副本,其中 n 是最后一个请求的序列号,其执行反映在状态中,d 是状态的摘要。每个副本在其日志中收集检查点消息,直到它有 2f+1 个用于序列号 n 的检查点消息,其中相同的摘要 d 由不同的副本签名(可能包括它自己的此类消息)。这些 2f+1 消息是检查点正确性的证明。
A checkpoint with a proof becomes stable and the replica discards all
pre-prepare, prepare, and commit messages with sequence number less than
or equal to n
from its log; it also discards all earlier checkpoints and checkpoint messages.
带有证明的 checkpoint 变得稳定,副本从其日志中丢弃所有序列号小于或等于 n 的 pre-prepare、prepare 和 commit 消息;它还会丢弃所有早期的 checkpoint 和 checkpoint 消息。
Computing the proofs is efficient because the digest can be computed
using incremental cryptography [1]
as discussed in Section 6.3, and
proofs are generated rarely.
计算证明是有效的,因为摘要可以使用增量加密 [1] 进行计算,如 Section 6.3 中所述,并且很少生成证明。
The checkpoint protocol is used to advance the low and high water marks
(which limit what messages will be accepted). The low-water mark
h
is equal to the sequence number of the last stable checkpoint. The high
water mark H = h + k,
where k is big
enough so that replicas do not stall waiting for a checkpoint to become
stable. For example, if checkpoints are taken every 100 requests, k
might be 200.
检查点协议用于推进低水位线和高水位线(这限制了将接受的消息)。低水位线 h 等于最后一个稳定检查点的序列号。高水位线 H = h + k,其中 k 足够大,因此副本不会停滞等待检查点变得稳定。例如,如果每 100 个请求执行一次检查点,则 k 可能是 200。
If the timer of backup i expires in view v,
the backup starts a view change to move the system to view v+1.
It stops accepting messages (other than checkpoint, view-change, and new-view
messages) and multicasts a
VIEW-CHANGE, v+1, n, C, P, i
message to all replicas. Here n
is the sequence number of the last stable checkpoint s
known to i, C
is a set of 2f+1
valid checkpoint messages proving the correctness of s,
and P is a set
containing a set Pm
for each request m
that prepared at i
with a sequence number higher than n.
Each set Pm contains
a valid pre-prepare message (without the corresponding client message)
and 2f matching,
valid prepare messages signed by different backups with the same view,
sequence number, and the digest of m.
如果 view v 中 backup i 的定时器过期,则 backup 将启动视图更改,将系统移动到 view v+1。它停止接受消息(checkpoint、view-change 和 new-view 消息除外),并将 VIEW-CHANGE、v+1、n、C、P、i 消息多播 到所有副本。其中 n 是 i 已知的最后一个稳定检查点 s 的序列号,C 是一组 2f+1 个有效的检查点消息,证明 s 的正确性,P 是一个集合,其中包含在 i 准备的每个请求 m 的一组 Pm,序列号大于 n。每个 Pm 集都包含一条有效的 pre-prepare 消息(没有相应的 Client 端消息)和 2f 匹配,这些消息由具有不同备份签名的有效 prepare 消息,具有相同的视图、序列号和 m 的摘要。
When the primary p
of view v+1 receives 2f
valid view-change messages for view v+1
from other replicas, it multicasts a
NEW-VIEW, v+1, V, O
message to all other replicas, where V
is a set containing the valid view-change messages received by the primary
plus the view-change message for v+1
the primary sent (or would have sent), and O
is a set of pre-prepare messages (without the piggybacked request).
O is computed as follows:
当视图 v+1 的主 p 从其他副本收到视图 v+1 的 2f 有效视图更改消息时,它会将 NEW-VIEW、v+1、V、O 消息多播到所有其他副本,其中 V 是一个包含主副本接收的有效视图更改消息以及主副本发送(或将要发送)的 v+1 视图更改消息的集合, O 是一组预先准备的消息(没有背负的请求)。O 的计算方式如下:
A backup accepts a new-view message for view v+1
if it is signed properly, if the view-change messages it contains are valid
for view v+1, and
if the set O
is correct; it verifies the correctness of O
by performing a computation similar to the one used by the primary to create
O.
Then it adds the new information to its log as described for the primary,
multicasts a prepare for each message in O
to all the other replicas, adds these prepares to its log, and enters view
v+1.
如果备份已正确签名,则备份将接受视图 v+1 的新视图消息,如果它包含的视图更改消息对视图 v+1 有效,并且设置的 O 正确;它通过执行类似于主数据库用于创建 O 的计算来验证 O 的正确性。然后,它将新信息添加到其日志中,如主副本所述,将 O 中每条消息的 prepare 组播到所有其他副本,将这些 prepare 添加到其日志中,并进入视图 v+1。
Thereafter, the protocol proceeds as described in Section 4.2.
Replicas redo the protocol for messages between min-s and max-s
but they avoid re-executing client requests (by using their stored information
about the last reply sent to each client).
此后,协议按照第 4.2 节中的描述进行。副本会重做 min-s 和 max-s 之间的消息的协议,但它们会避免重新执行客户端请求(通过使用它们存储的有关发送到每个客户端的最后一个回复的信息)。
A replica may be missing some request message
m
or a stable checkpoint (since these are not sent in new-view messages.)
It can obtain missing information from another replica. For example,
replica i
can obtain a missing checkpoint state s
from one of the replicas whose checkpoint messages certified its correctness
in V. Since f+1
of those replicas are correct, replica i
will always obtain s
or a later certified stable checkpoint. We can avoid sending the entire
checkpoint by partitioning the state and stamping each partition with the
sequence number of the last request that modified it. To bring a replica
up to date, it is only necessary to send it the partitions where it is
out of date, rather than the whole checkpoint.
副本可能缺少一些请求消息 m 或稳定的 checkpoint(因为这些不会在 new-view 消息中发送)。它可以从另一个副本获取缺失的信息。例如,副本 i 可以从其中一个副本中获取缺失的检查点状态 s,该副本的检查点消息在 V 中证明了其正确性。由于这些副本中的 f+1 是正确的,因此副本 i 将始终获得 s 或更高版本的认证稳定检查点。我们可以通过对 state 进行分区并使用修改它的最后一个请求的序列号标记每个 partition 来避免发送整个 checkpoint。要使副本保持最新状态,只需向其发送过期的分区,而不是整个检查点。
In Section 4.2, we showed that
if prepared(m, v, n, i)
is true, prepared(m', v, n, j)
is false for any non-faulty replica j
(including i = j)
and any m' such
that D(m')D(m). This
implies that two non-faulty replicas agree on the sequence number of requests
that commit locally in the same view at the two replicas.
在第 4.2 节中,我们展示了如果 prepared(m, v, n, i) 为 true,则 prepared(m', v, n, j) 对于任何无错误的副本 j(包括 i = j)和任何 m' 为 false,使得 D(m') D(m)。这意味着两个无故障副本在两个副本的同一视图中本地提交的请求序列号上达成一致。
The view-change protocol ensures that non-faulty replicas also agree
on the sequence number of requests that commit locally in different views
at different replicas. A request m
commits locally at a non-faulty replica with sequence number n
in view v only
if committed(m, v, n) is true. This means that there is a set
R1
containing at least f+1
non-faulty replicas such that prepared(m, v, n, i)
is true for every replica i
in the set.
视图更改协议可确保无故障副本也就在不同副本的不同视图中本地提交的请求序列号达成一致。仅当 committed(m, v, n) 为 true 时,请求 m 才会在视图 v 中序列号为 n 的无故障副本本地提交。这意味着有一个集合 R1 包含至少 f+1 个无故障副本,使得集合中的每个副本 i 的 prepared(m, v, n, i) 都为 true。
Non-faulty replicas will not accept a pre-prepare for view
v' > v
without having received a new-view message for v'
(since only at that point do they enter the view). But any correct new-view
message for view v' > v
contains correct view-change messages from every replica
i
in a set R2 of 2f+1
replicas. Since there are 3f+1
replicas, R1 and R2
must intersect in at least one replica k
that is not faulty. k's
view-change message will ensure that the fact that m
prepared in a previous view is propagated to subsequent views, unless the
new-view message contains a view-change message with a stable checkpoint
with a sequence number higher than n.
In the first case, the algorithm redoes the three phases of the atomic
multicast protocol for m
with the same sequence number n
and the new view number. This is important because it prevents any different
request that was assigned the sequence number n
in a previous view from ever committing. In the second case no replica
in the new view will accept any message with sequence number lower than
n.
In either case, the replicas will agree on the request that commits locally
with sequence number n.
未收到 v' 的新视图消息时,非故障副本将不接受视图 v' > v 的预准备(因为只有在此时它们才会进入视图)。但是,视图 v' > v 的任何正确新视图消息都包含来自一组 2f+1 副本的 R2 中每个副本 i 的正确视图更改消息。由于存在 3f+1 副本,因此 R1 和 R2 必须在至少一个没有故障的副本 k 中相交。k 的视图更改消息将确保在前一个视图中准备的 m 这一事实传播到后续视图,除非新视图消息包含具有稳定检查点且序列号大于 n 的视图更改消息。在第一种情况下,该算法使用相同的序列号 n 和新的视图编号重做 m 的原子组播协议的三个阶段。这很重要,因为它可以防止在上一个视图中分配了序列号 n 的任何不同请求提交。在第二种情况下,新视图中的任何副本都不会接受序列号小于 n 的任何消息。在任何一种情况下,副本都将同意在本地提交序列号为 n 的请求。
First, to avoid starting a view change too soon, a replica that multicasts
a view-change message for view v+1
waits for 2f+1 view-change
messages for view v+1
and then starts its timer to expire after some time T.
If the timer expires before it receives a valid new-view message for
v+1
or before it executes a request in the new view that it had not executed
previously, it starts the view change for view v+2
but this time it will wait 2T
before starting a view change for view v+3.
首先,为了避免过早地开始视图更改,为视图 v+1 多播视图更改消息的副本等待视图 v+1 的 2f+1 视图更改消息,然后启动其计时器在一段时间 T 后过期。如果计时器在收到 v+1 的有效新视图消息之前过期,或者在新视图中执行之前未执行的请求之前过期,它会启动视图 v+2 的视图更改,但这次它将等待 2T 后再开始视图 v+3 的视图更改。
Second, if a replica receives a set of f+1
valid view-change messages from other replicas for views greater than its
current view, it sends a view-change message for the smallest view in the
set, even if its timer has not expired; this prevents it from starting
the next view change too late.
其次,如果副本从其他副本收到一组 f+1 有效视图更改消息,这些消息的视图大于其当前视图,则它会为集中的最小视图发送视图更改消息,即使其计时器尚未过期;这样可以防止它太晚地开始下一个视图更改。
Third, faulty replicas are unable to impede progress by forcing frequent
view changes. A faulty replica cannot cause a view change by sending a
view-change message, because a view change will happen only if at least
f+1
replicas send view-change messages, but it can cause a view change when
it is the primary (by not sending messages or sending bad messages). However,
because the primary of view v
is the replica p
such that p = v mod |R|, the
primary cannot be faulty for more than f
consecutive views.
第三,有故障的副本无法通过强制频繁的视图更改来阻碍进度。有故障的副本不能通过发送视图更改消息来导致视图更改,因为只有当至少 f+1 个副本发送视图更改消息时,才会发生视图更改,但是当它是主副本时,它可能会导致视图更改(通过不发送消息或发送错误消息)。但是,由于视图 v 的主视图是副本 p,因此 p = v mod |R|,主数据库不能连续超过 f 个视图出错。
These three techniques guarantee liveness unless message delays grow
faster than the timeout period indefinitely, which is unlikely in a real
system.
这三种技术保证了活动性,除非消息延迟无限期地增长到超过超时时间的速度,这在实际系统中是不太可能的。
State machine replicas must be deterministic but many services involve
some form of non-determinism. For example, the time-last-modified in NFS
is set by reading the server's local clock; if this were done independently
at each replica, the states of non-faulty replicas would diverge. Therefore,
some mechanism to ensure that all replicas select the same value is needed.
In general, the client cannot select the value because it does not have
enough information; for example, it does not know how its request will
be ordered relative to concurrent requests by other clients. Instead, the
primary needs to select the value either independently or based on values
provided by the backups.
状态机副本必须是确定性的,但许多服务涉及某种形式的非确定性。例如,NFS 中的 time-last-modified 是通过读取服务器的本地时钟来设置的;如果在每个副本上独立执行此操作,则非故障副本的状态将发散。因此,需要某种机制来确保所有副本都选择相同的值。通常,客户端无法选择该值,因为它没有足够的信息;例如,它不知道相对于其他客户端的并发请求,其请求将如何排序。相反,主服务器需要独立选择值或根据备份提供的值选择值。
If the primary selects the non-deterministic value independently, it
concatenates the value with the associated request and executes the three
phase protocol to ensure that non-faulty replicas agree on a sequence number
for the request and value. This prevents a faulty primary from causing
replica state to diverge by sending different values to different replicas.
However, a faulty primary might send the same, incorrect, value to all
replicas. Therefore, replicas must be able to decide deterministically
whether the value is correct (and what to do if it is not) based only on
the service state.
如果主副本独立选择非确定性值,它会将该值与关联的请求连接起来,并执行三阶段协议,以确保非故障副本就请求和值的序列号达成一致。这可以防止有故障的主节点通过向不同的副本发送不同的值来导致副本状态发散。但是,有故障的主副本可能会向所有副本发送相同的错误值。因此,副本必须能够仅根据服务状态确定性地决定该值是否正确(如果不正确该怎么办)。
This protocol is adequate for most services (including NFS) but occasionally
replicas must participate in selecting the value to satisfy a service's
specification. This can be accomplished by adding an extra phase to the
protocol: the primary obtains authenticated values proposed by the backups,
concatenates 2f+1
of them with the associated request, and starts the three phase protocol
for the concatenated message. Replicas choose the value by a deterministic
computation on the 2f+1
values and their state, e.g., taking the median. The extra phase can be
optimized away in the common case. For example, if replicas need a value
that is ``close enough'' to that of their local clock, the extra phase
can be avoided when their clocks are synchronized within some delta.
此协议适用于大多数服务(包括 NFS),但偶尔会
副本必须参与选择值以满足服务的
规范。这可以通过向
protocol:主数据库获取备份建议的 authenticated 值,
将其中的 2F+1 与关联的请求连接起来,并启动连接消息的三阶段协议。副本通过在 2f+1 上进行确定性计算来选择值
值及其状态,例如,取 medin.额外相位可以是
optimized away 在常见情况下。例如,如果副本需要一个值
这与他们当地的时钟 Extra Phase 的 Clock 的 Far Phase “足够接近”
当它们的 clocks 在某个 delta 内同步时可以避免。
Next: Optimizations
Up:Contents Previous: Service
Properties
Next: 优化 Up – 目录 Previous: 服务属性