这是用户在 2024-11-1 22:32 为 https://pmg.csail.mit.edu/papers/osdi99_html/node4.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Next: Optimizations Up:Contents Previous: Service Properties
下一篇:优化上一篇:目录上一篇:服务属性

4 The Algorithm 4 算法

Our algorithm is a form of state machine replication [34, 17]: the service is modeled as a state machine that is replicated across different nodes in a distributed system. Each state machine replica maintains the service state and implements the service operations. We denote the set of replicas by R and identify each replica using an integer in {0, ..., |R| - 1}. For simplicity, we assume |R| = 3f+1 where f is the maximum number of replicas that may be faulty; although there could be more than 3f+1 replicas, the additional replicas degrade performance (since more and bigger messages are being exchanged) without providing improved resiliency.
我们的算法是状态机复制的一种形式 [34,17]:该服务被建模为在分布式系统的不同节点之间复制的状态机。每个状态机副本都维护服务状态并实现服务操作。我们用 R 表示副本集,并使用 {0, ..., |R|- 1}.为简单起见,我们假设 |R|= 3f+1,其中 f 是可能出错的最大副本数;尽管可能有多个 3F+1 个副本,但额外的副本 降低性能(因为交换的消息越来越多、更大) 没有提供改进的弹性。

The replicas move through a succession of configurations called views. In a view one replica is the primary and the others are backups. Views are numbered consecutively. The primary of a view is replica p such that p = v mod |R|, where v is the view number. View changes are carried out when it appears that the primary has failed. Viewstamped Replication [26] and Paxos [18] used a similar approach to tolerate benign faults (as discussed in Section 8.)
副本在一系列称为 views 的配置中移动。在视图中,一个副本是副本,其他副本是备份副本。视图是连续编号的。视图的主对象是副本 p,使得 p = v mod |R|,其中 v 是视图编号。当主数据库似乎失败时,将执行视图更改。Viewstamped Replication [26] 和 Paxos [18] 使用类似的方法来容忍良性故障(如 Section 8 所述)。

The algorithm works roughly as follows:
该算法的工作原理大致如下:

  1. A client sends a request to invoke a service operation to the primary
    客户端向主数据库发送请求以调用服务操作
  2. The primary multicasts the request to the backups
    主数据库将请求多播到备份
  3. Replicas execute the request and send a reply to the client
    副本执行请求并向客户端发送回复
  4. The client waits for f+1 replies from different replicas with the same result; this is the result of the operation.
    客户端等待来自不同副本的 f+1 回复,结果相同;这是操作的结果。
Like all state machine replication techniques [34], we impose two requirements on replicas: they must be deterministic (i.e., the execution of an operation in a given state and with a given set of arguments must always produce the same result) and they must start in the same state. Given these two requirements, the algorithm ensures the safety property by guaranteeing that all non-faulty replicas agree on a total order for the execution of requests despite failures.
与所有状态机复制技术一样 [34],我们对副本提出了两个要求:它们必须是确定性的(即,在给定状态下使用给定的参数集执行操作必须始终产生相同的结果),并且它们必须以相同的状态开始。鉴于这两个要求,该算法通过保证所有无故障副本在失败的情况下就执行请求的总顺序达成一致,从而确保安全属性。

The remainder of this section describes a simplified version of the algorithm. We omit discussion of how nodes recover from faults due to lack of space. We also omit details related to message retransmissions. Furthermore, we assume that message authentication is achieved using digital signatures rather than the more efficient scheme based on message authentication codes; Section 5 discusses this issue further. A detailed formalization of the algorithm using the I/O automaton model [21] is presented in [4].
本节的其余部分介绍该算法的简化版本。我们省略了对节点如何从由于空间不足而导致的故障中恢复的讨论。我们还省略了与消息重传相关的详细信息。此外,我们假设消息身份验证是使用数字签名实现的,而不是基于消息身份验证代码的更高效方案;第 5 节进一步讨论了这个问题。[4] 中介绍了使用 I/O 自动机模型 [21] 的算法的详细形式化。

4.1 The Client 4.1 客户端

A client c requests the execution of state machine operation o by sending a REQUEST, o, t, c message to the primary. Timestamp t is used to ensure exactly-once semantics for the execution of client requests. Timestamps for c's requests are totally ordered such that later requests have higher timestamps than earlier ones; for example, the timestamp could be the value of the client's local clock when the request is issued.
客户端 c 通过向主数据库发送 REQUEST, o, t, c 消息来请求执行状态机操作 o。时间戳 t 用于确保执行客户端请求的语义为恰好一次c 的时间戳 请求是完全有序的,因此后面的请求具有更高的时间戳 比早期的;例如,timestamp 可以是 发出请求时 client's local clock。

Each message sent by the replicas to the client includes the current view number, allowing the client to track the view and hence the current primary. A client sends a request to what it believes is the current primary using a point-to-point message. The primary atomically multicasts the request to all the backups using the protocol described in the next section.
副本发送到客户端的每条消息都包含当前视图编号,从而允许客户端跟踪视图,从而跟踪当前主视图。客户端使用点对点消息向它认为是当前主数据库的内容发送请求。主数据库使用下一节中描述的协议以原子方式将请求多播到所有备份。

A replica sends the reply to the request directly to the client. The reply has the form REPLY, v, t, c, i, r, where v is the current view number, t is the timestamp of the corresponding request, i is the replica number, and r is the result of executing the requested operation.
副本将请求的回复直接发送到客户端。回复的形式 为 REPLY, v, t, c, i, r ,其中 v 是当前视图编号,t 是相应请求的时间戳,i 是副本编号,r 是执行请求的操作的结果。

The client waits for f+1 replies with valid signatures from different replicas, and with the same t and r, before accepting the result r. This ensures that the result is valid, since at most f replicas can be faulty.
客户端在接受结果 r 之前,会等待来自不同副本的具有有效签名的 f+1 回复,并且具有相同的 tr。这可确保结果有效,因为最多 f 个副本可能是错误的。

If the client does not receive replies soon enough, it broadcasts the request to all replicas. If the request has already been processed, the replicas simply re-send the reply; replicas remember the last reply message they sent to each client. Otherwise, if the replica is not the primary, it relays the request to the primary. If the primary does not multicast the request to the group, it will eventually be suspected to be faulty by enough replicas to cause a view change.
如果客户端没有足够快地收到回复,它会将请求广播到所有副本。如果请求已被处理,则副本只需重新发送回复;副本会记住它们发送给每个客户端的最后一条回复消息。否则,如果副本不是主副本,它会将请求中继到主副本。如果主数据库没有对组的请求进行多播,则最终会因足够多的副本而怀疑它是错误的,从而导致视图更改。

In this paper we assume that the client waits for one request to complete before sending the next one. But we can allow a client to make asynchronous requests, yet preserve ordering constraints on them.
在本文中,我们假设客户端在发送下一个请求之前等待一个请求完成。但是我们可以允许 Client 端发出异步请求,但保留对它们的 Sequences 约束。

4.2 Normal-Case Operation
4.2 正常情况下的操作

The state of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer denoting the replica's current view. We describe how to truncate the log in Section 4.3.
每个副本的状态包括服务的状态、包含副本已接受的消息的消息日志以及表示副本当前视图的整数。我们在 Section 4.3 中描述了如何截断日志。

When the primary, p, receives a client request, m, it starts a three-phase protocol to atomically multicast the request to the replicas. The primary starts the protocol immediately unless the number of messages for which the protocol is in progress exceeds a given maximum. In this case, it buffers the request. Buffered requests are multicast later as a group to cut down on message traffic and CPU overheads under heavy load; this optimization is similar to a group commit in transactional systems [11]. For simplicity, we ignore this optimization in the description below.
当主数据库 p 收到客户端请求 m 时,它会启动一个三阶段协议,以原子方式将请求多播到副本。主数据库立即启动协议,除非正在进行的协议的消息数超过给定的最大值。在这种情况下,它会缓冲请求。缓冲请求稍后将作为一个组进行多播,以减少高负载下的消息流量和 CPU 开销;这种优化类似于事务系统中的 group commit [11]。为简单起见,我们在下面的描述中忽略了此优化。

The three phases are pre-prepare, prepare, and commit. The pre-prepare and prepare phases are used to totally order requests sent in the same view even when the primary, which proposes the ordering of requests, is faulty. The prepare and commit phases are used to ensure that requests that commit are totally ordered across views.
这三个阶段是 pre-prepare、preparecommit。pre-prepare 和 prepare 阶段用于对在同一视图中发送的请求进行完全排序,即使建议请求排序的主数据库出现故障。prepare 和 commit 阶段用于确保提交的请求在视图之间完全排序。

In the pre-prepare phase, the primary assigns a sequence number, n, to the request, multicasts a pre-prepare message with m piggybacked to all the backups, and appends the message to its log. The message has the form PRE-PREPARE, v, n, d, m, where v indicates the view in which the message is being sent, m is the client's request message, and d is m's digest.
在预准备阶段,主数据库为请求分配一个序列号 n,将带有 m 的预准备消息多播到所有备份,并将该消息附加到其日志中。消息的格式为 PRE-PREPARE, v, n, d , m ,其中 v 表示发送消息的视图, m 是客户端的请求消息, dm 的摘要。

Requests are not included in pre-prepare messages to keep them small. This is important because pre-prepare messages are used as a proof that the request was assigned sequence number n in view v in view changes. Additionally, it decouples the protocol to totally order requests from the protocol to transmit the request to the replicas; allowing us to use a transport optimized for small messages for protocol messages and a transport optimized for large messages for large requests.
请求不包含在预准备消息中,以保持其较小。这一点很重要,因为预准备消息用作证明请求在视图更改的视图 v 中被分配了序列号 n。此外,它还将协议解耦,以完全对来自协议的请求进行排序,以将请求传输到副本;允许我们将针对小消息优化的传输方式用于协议消息,并使用针对大消息优化的传输方式来处理大请求。

A backup accepts a pre-prepare message provided:
备份接受提供的 pre-prepare 消息:

The last condition prevents a faulty primary from exhausting the space of sequence numbers by selecting a very large one. We discuss how h and H advance in Section 4.3.
最后一个条件可防止有故障的主数据库耗尽空间 的序列号。我们将在 4.3 节中讨论 hH 是如何前进的。

If backup i accepts the PRE-PREPARE, v, n, d, m message, it enters the prepare phase by multicasting a PREPARE, v, n, d, i message to all other replicas and adds both messages to its log. Otherwise, it does nothing.
如果备份 i 接受 PRE-PREPARE, v, n, d , m 消息,它将通过将 PREPARE, v, n, d, i 消息多播 到所有其他副本来进入准备阶段,并将两条消息添加到其日志中。否则,它不执行任何操作。

A replica (including the primary) accepts prepare messages and adds them to its log provided their signatures are correct, their view number equals the replica's current view, and their sequence number is between h and H.
副本(包括主副本)接受 prepare 消息并将其添加到其日志中,前提是它们的签名正确,它们的视图号等于副本的当前视图,并且它们的序列号在 hH 之间。

We define the predicate prepared(m, v, n, i) to be true if and only if replica i has inserted in its log: the request m, a pre-prepare for m in view v with sequence number n, and 2f prepares from different backups that match the pre-prepare. The replicas verify whether the prepares match the pre-prepare by checking that they have the same view, sequence number, and digest.
当且仅当副本 i 已插入到其日志中时,我们将谓词 prepared(m, v, n, i) 定义为 true:请求 m、视图 vm 的预准备(序列号为 n)和 2f 从与预准备匹配的不同备份中准备。副本通过检查它们是否具有相同的视图、序列号和摘要来验证 prepares 是否与 pre-prepare 匹配。

The pre-prepare and prepare phases of the algorithm guarantee that non-faulty replicas agree on a total order for the requests within a view. More precisely, they ensure the following invariant: if prepared(m, v, n, i) is true then prepared(m', v, n, j) is false for any non-faulty replica j (including i = j) and any m' such that D(m')D(m). This is true because prepared(m, v, n, i) and |R| = 3f+1 imply that at least f+1 non-faulty replicas have sent a pre-prepare or prepare for m in view v with sequence number n. Thus, for prepared(m', v, n, j) to be true at least one of these replicas needs to have sent two conflicting prepares (or pre-prepares if it is the primary for v), i.e., two prepares with the same view and sequence number and a different digest. But this is not possible because the replica is not faulty. Finally, our assumption about the strength of message digests ensures that the probability that mm' and D(m) = D(m') is negligible.
算法的 pre-prepare 和 prepare 阶段可保证无故障副本就视图中请求的总顺序达成一致。更准确地说,它们确保以下不变量:如果 prepared(m, v, n, i) 为 true,则 prepared(m', v, n, j) 对于任何无错误的副本 j(包括 i = j)和任何 m' 为 false,使得 D(m') D(m)。这是真的,因为 prepared(m, v, n, i)|R|= 3f+1 表示至少有 f+1 个无故障副本在视图 v 中发送了序列号为 nm 的预准备或准备。因此,要使 prepared(m', v, n, j) 为 true,这些副本中至少有一个需要发送两个冲突的 prepares(如果它是 v 的主要 pre-prepares),即两个具有相同视图和序列号的 prepares 以及不同的摘要。但这是不可能的,因为副本没有故障。最后,我们对消息摘要强度的假设确保 m m'D(m) = D(m') 的概率可以忽略不计。

Replica i multicasts a COMMIT, v, n, D(m), i to the other replicas when prepared(m, v, n, i) becomes true. This starts the commit phase. Replicas accept commit messages and insert them in their log provided they are properly signed, the view number in the message is equal to the replica's current view, and the sequence number is between h and H.
prepared(m, v, n, i) 变为 true 时,副本 i COMMIT, v, n, D(m), i 多播到其他副本。这将启动提交阶段。副本接受提交消息并将其插入到其日志中,前提是它们已正确签名,消息中的视图号等于副本的当前视图,并且序列号介于 hH 之间。

We define the committed and committed-local predicates as follows: committed(m, v, n) is true if and only if prepared(m, v, n, i) is true for all i in some set of f+1 non-faulty replicas; and committed-local(m, v, n, i) is true if and only if prepared(m, v, n, i) is true and i has accepted 2f+1 commits (possibly including its own) from different replicas that match the pre-prepare for m; a commit matches a pre-prepare if they have the same view, sequence number, and digest.
我们按如下方式定义 committedcommitted-local 谓词:当且仅当 prepared(m, v, n, i) 对于一组 f+1 非故障副本中的所有 i 都为 true,则 committed(m, v, n) 为 true;且仅当 prepared(m, v, n, i) 为 true,并且 I 已经接受了来自不同副本的 2f+1 提交(可能包括自己的提交)时,这些提交与 m 的预准备匹配;如果 pre-prepare 具有相同的视图、序列号和摘要,则提交会匹配它们。

The commit phase ensures the following invariant: if committed-local(m, v, n, i) is true for some non-faulty i then committed(m, v, n) is true. This invariant and the view-change protocol described in Section 4.4 ensure that non-faulty replicas agree on the sequence numbers of requests that commit locally even if they commit in different views at each replica. Furthermore, it ensures that any request that commits locally at a non-faulty replica will commit at f+1 or more non-faulty replicas eventually.
提交阶段确保以下不变性:如果 committed-local(m, v, n, i) 对于某些无故障的 i 为 true,则 committed(m, v, n) 为 true。此不变量和第 4.4 节中描述的视图更改协议可确保无故障副本就本地提交的请求的序列号达成一致,即使它们在每个副本的不同视图中提交也是如此。此外,它还确保在非故障副本本地提交的任何请求最终将在 f+1 或多个非故障副本处提交。

Each replica i executes the operation requested by m after committed-local(m, v, n, i) is true and i's state reflects the sequential execution of all requests with lower sequence numbers. This ensures that all non-faulty replicas execute requests in the same order as required to provide the safety property. After executing the requested operation, replicas send a reply to the client. Replicas discard requests whose timestamp is lower than the timestamp in the last reply they sent to the client to guarantee exactly-once semantics.
committed-local(m, v, n, i) 为 true 后,每个副本 i 执行 m 请求的操作,并且 i 的状态反映了所有序列号较低的请求的顺序执行。这可确保所有无故障副本都按照提供 safety 属性所需的相同顺序执行请求。执行请求的操作后,副本会向客户端发送回复。副本会丢弃时间戳低于它们发送给客户端的最后一次回复中的时间戳的请求,以保证 Exactly-once 语义。

We do not rely on ordered message delivery, and therefore it is possible for a replica to commit requests out of order. This does not matter since it keeps the pre-prepare, prepare, and commit messages logged until the corresponding request can be executed.
我们不依赖于有序消息传输,因此副本可能会不按顺序提交请求。这无关紧要,因为它会记录 pre-prepare、prepare 和 commit 消息,直到可以执行相应的请求为止。

Figure 1 shows the operation of the algorithm in the normal case of no primary faults. Replica 0 is the primary, replica 3 is faulty, and C is the client.
1 显示了该算法在没有主要故障的正常情况下的操作。副本 0 是主副本,副本 3 是故障副本,C 是客户端。

  


Figure 1: Normal Case Operation
图 1:正常情况下的操作

4.3 Garbage Collection 4.3 垃圾回收

This section discusses the mechanism used to discard messages from the log. For the safety condition to hold, messages must be kept in a replica's log until it knows that the requests they concern have been executed by at least f+1 non-faulty replicas and it can prove this to others in view changes. In addition, if some replica misses messages that were discarded by all non-faulty replicas, it will need to be brought up to date by transferring all or a portion of the service state. Therefore, replicas also need some proof that the state is correct.
本节讨论用于从日志中丢弃消息的机制。为了保持安全条件,消息必须保存在副本的日志中,直到它知道它们所关注的请求已经被至少 f+1 个无故障的副本执行,并且它可以在视图变化中向其他人证明这一点。此外,如果某些副本错过了被所有无故障副本丢弃的消息,则需要通过传输全部或部分服务状态来使其保持最新状态。因此,副本还需要一些证明来证明状态是正确的。

Generating these proofs after executing every operation would be expensive. Instead, they are generated periodically, when a request with a sequence number divisible by some constant (e.g., 100) is executed. We will refer to the states produced by the execution of these requests as checkpoints and we will say that a checkpoint with a proof is a stable checkpoint.
在执行每个操作后生成这些证明将非常昂贵。相反,它们是在执行序列号可被某个常量(例如 100)整除的请求时定期生成的。我们将执行这些请求产生的状态称为 checkpoint,我们将说带有证明的 checkpoint 是一个稳定的 checkpoint

A replica maintains several logical copies of the service state: the last stable checkpoint, zero or more checkpoints that are not stable, and a current state. Copy-on-write techniques can be used to reduce the space overhead to store the extra copies of the state, as discussed in Section 6.3.
副本维护服务状态的多个逻辑副本:最后一个稳定检查点、零个或多个不稳定的检查点以及当前状态。write时复制技术可用于减少存储状态的额外副本的空间开销,如Section 6.3中所述。

The proof of correctness for a checkpoint is generated as follows. When a replica i produces a checkpoint, it multicasts a message CHECKPOINT, n, d, i to the other replicas, where n is the sequence number of the last request whose execution is reflected in the state and d is the digest of the state. Each replica collects checkpoint messages in its log until it has 2f+1 of them for sequence number n with the same digest d signed by different replicas (including possibly its own such message). These 2f+1 messages are the proof of correctness for the checkpoint.
检查点的正确性证明生成如下。当副本 i 生成检查点时,它会将消息 CHECKPOINT, n, d, i 多播到其他副本,其中 n 是最后一个请求的序列号,其执行反映在状态中,d 是状态的摘要。每个副本在其日志中收集检查点消息,直到它有 2f+1 个用于序列号 n 的检查点消息,其中相同的摘要 d 由不同的副本签名(可能包括它自己的此类消息)。这些 2f+1 消息是检查点正确性的证明。

A checkpoint with a proof becomes stable and the replica discards all pre-prepare, prepare, and commit messages with sequence number less than or equal to n from its log; it also discards all earlier checkpoints and checkpoint messages.
带有证明的 checkpoint 变得稳定,副本从其日志中丢弃所有序列号小于或等于 n 的 pre-prepare、prepare 和 commit 消息;它还会丢弃所有早期的 checkpoint 和 checkpoint 消息。

Computing the proofs is efficient because the digest can be computed using incremental cryptography [1] as discussed in Section 6.3, and proofs are generated rarely.
计算证明是有效的,因为摘要可以使用增量加密 [1] 进行计算,如 Section 6.3 中所述,并且很少生成证明。

The checkpoint protocol is used to advance the low and high water marks (which limit what messages will be accepted). The low-water mark h is equal to the sequence number of the last stable checkpoint. The high water mark H = h + k, where k is big enough so that replicas do not stall waiting for a checkpoint to become stable. For example, if checkpoints are taken every 100 requests, k might be 200.
检查点协议用于推进低水位线和高水位线(这限制了将接受的消息)。低水位线 h 等于最后一个稳定检查点的序列号。高水位线 H = h + k,其中 k 足够大,因此副本不会停滞等待检查点变得稳定。例如,如果每 100 个请求执行一次检查点,则 k 可能是 200。

4.4 View Changes 4.4 查看更改

The view-change protocol provides liveness by allowing the system to make progress when the primary fails. View changes are triggered by timeouts that prevent backups from waiting indefinitely for requests to execute. A backup is waiting for a request if it received a valid request and has not executed it. A backup starts a timer when it receives a request and the timer is not already running. It stops the timer when it is no longer waiting to execute the request, but restarts it if at that point it is waiting to execute some other request.
view-change 协议通过允许系统使 进度。视图更改由超时触发 ,以防止备份无限期地等待请求执行。 如果备份收到了有效的请求,并且 没有执行它。备份在收到请求时启动计时器,并且 计时器尚未运行。当计时器不再时,它会停止计时器 等待执行请求,但如果此时请求为 等待执行其他请求。

If the timer of backup i expires in view v, the backup starts a view change to move the system to view v+1. It stops accepting messages (other than checkpoint, view-change, and new-view messages) and multicasts a VIEW-CHANGE, v+1, n, C, P, i message to all replicas. Here n is the sequence number of the last stable checkpoint s known to i, C is a set of 2f+1 valid checkpoint messages proving the correctness of s, and P is a set containing a set Pm for each request m that prepared at i with a sequence number higher than n. Each set Pm contains a valid pre-prepare message (without the corresponding client message) and 2f matching, valid prepare messages signed by different backups with the same view, sequence number, and the digest of m.
如果 view v 中 backup i 的定时器过期,则 backup 将启动视图更改,将系统移动到 view v+1。它停止接受消息(checkpoint、view-change 和 new-view 消息除外),并将 VIEW-CHANGE、v+1、n、C、P、i 消息多播 到所有副本。其中 ni 已知的最后一个稳定检查点 s 的序列号,C 是一组 2f+1 个有效的检查点消息,证明 s 的正确性,P 是一个集合,其中包含在 i 准备的每个请求 m 的一组 Pm,序列号大于 n。每个 Pm 集都包含一条有效的 pre-prepare 消息(没有相应的 Client 端消息)和 2f 匹配,这些消息由具有不同备份签名的有效 prepare 消息,具有相同的视图、序列号和 m 的摘要。

When the primary p of view v+1 receives 2f valid view-change messages for view v+1 from other replicas, it multicasts a NEW-VIEW, v+1, V, O message to all other replicas, where V is a set containing the valid view-change messages received by the primary plus the view-change message for v+1 the primary sent (or would have sent), and O is a set of pre-prepare messages (without the piggybacked request). O is computed as follows:
当视图 v+1 的主 p 从其他副本收到视图 v+12f 有效视图更改消息时,它会将 NEW-VIEW、v+1、V、O 消息多播到所有其他副本,其中 V 是一个包含主副本接收的有效视图更改消息以及主副本发送(或将要发送)的 v+1 视图更改消息的集合, O 是一组预先准备的消息(没有背负的请求)。O 的计算方式如下:

  1. The primary determines the sequence number min-s of the latest stable checkpoint in V and the highest sequence number max-s in a prepare message in V.
    primary 确定 V 中最新稳定检查点的序列号 min-sV 中 prepare 消息中的最高序列号 max-s
  2. The primary creates a new pre-prepare message for view v+1 for each sequence number n between min-s and max-s. There are two cases: (1) there is at least one set in the P component of some view-change message in V with sequence number n; or (2) there is no such set. In the first case, the primary creates a new message PRE-PREPARE, v+1, n, d, where d is the request digest in the pre-prepare message for sequence number n with the highest view number in V. In the second case, it creates a new pre-prepare message PRE-PREPARE, v+1, n, dnull, where dnull is the digest of a special null request; a null request goes through the protocol like other requests, but its execution is a no-op. (Paxos [18] used a similar technique to fill in gaps.)
    主数据库为 min-smax-s 之间的每个序列号 n 为视图 v+1 创建一个新的预准备消息。有两种情况:(1) 在 V 中至少有一个序号 n 的某个视图更改消息的 P 组件集合;或者 (2) 没有这样的集合。在第一种情况下,主数据库会创建新消息 PRE-PREPARE, v+1, n, d ,其中 d 是序号 n 的预准备消息中的请求摘要,视图编号在 V 中最高。在第二种情况下,它会创建一个新的预准备消息 PRE-PREPARE, v+1, n, dnull ,其中 dnull 是特殊 null 请求的摘要;与其他请求一样,Null 请求也会通过协议,但其执行是无操作。(Paxos [18] 使用类似的技术来填补空白。
Next the primary appends the messages in O to its log. If min-s is greater than the sequence number of its latest stable checkpoint, the primary also inserts the proof of stability for the checkpoint with sequence number min-s in its log, and discards information from the log as discussed in Section 4.3. Then it enters view v+1: at this point it is able to accept messages for view v+1.
接下来,主数据库将 O 中的消息附加到其日志中。如果 min-s 大于其最新稳定 checkpoint 的序列号,则主数据库还会在其日志中插入序列号为 min-s 的 checkpoint 的稳定性证明,并丢弃日志中的信息,如 Section 4.3 中所述。然后它进入视图 v+1:此时它能够接受视图 v+1 的消息。

A backup accepts a new-view message for view v+1 if it is signed properly, if the view-change messages it contains are valid for view v+1, and if the set O is correct; it verifies the correctness of O by performing a computation similar to the one used by the primary to create O. Then it adds the new information to its log as described for the primary, multicasts a prepare for each message in O to all the other replicas, adds these prepares to its log, and enters view v+1.
如果备份已正确签名,则备份将接受视图 v+1 的新视图消息,如果它包含的视图更改消息对视图 v+1 有效,并且设置的 O 正确;它通过执行类似于主数据库用于创建 O 的计算来验证 O 的正确性。然后,它将新信息添加到其日志中,如主副本所述,将 O 中每条消息的 prepare 组播到所有其他副本,将这些 prepare 添加到其日志中,并进入视图 v+1

Thereafter, the protocol proceeds as described in Section 4.2. Replicas redo the protocol for messages between min-s and max-s but they avoid re-executing client requests (by using their stored information about the last reply sent to each client).
此后,协议按照第 4.2 节中的描述进行。副本会重做 min-smax-s 之间的消息的协议,但它们会避免重新执行客户端请求(通过使用它们存储的有关发送到每个客户端的最后一个回复的信息)。

A replica may be missing some request message m or a stable checkpoint (since these are not sent in new-view messages.) It can obtain missing information from another replica. For example, replica i can obtain a missing checkpoint state s from one of the replicas whose checkpoint messages certified its correctness in V. Since f+1 of those replicas are correct, replica i will always obtain s or a later certified stable checkpoint. We can avoid sending the entire checkpoint by partitioning the state and stamping each partition with the sequence number of the last request that modified it. To bring a replica up to date, it is only necessary to send it the partitions where it is out of date, rather than the whole checkpoint.
副本可能缺少一些请求消息 m 或稳定的 checkpoint(因为这些不会在 new-view 消息中发送)。它可以从另一个副本获取缺失的信息。例如,副本 i 可以从其中一个副本中获取缺失的检查点状态 s,该副本的检查点消息在 V 中证明了其正确性。由于这些副本中的 f+1 是正确的,因此副本 i 将始终获得 s 或更高版本的认证稳定检查点。我们可以通过对 state 进行分区并使用修改它的最后一个请求的序列号标记每个 partition 来避免发送整个 checkpoint。要使副本保持最新状态,只需向其发送过期的分区,而不是整个检查点。

4.5 Correctness 4.5 正确性

This section sketches the proof that the algorithm provides safety and liveness; details can be found in [4].
本节概述了算法提供安全性和 活性;详细信息可以在 [4] 中找到。

4.5.1 Safety 4.5.1 治理

As discussed earlier, the algorithm provides safety if all non-faulty replicas agree on the sequence numbers of requests that commit locally.
如前所述,该算法在所有无故障副本 就本地提交的请求的序列号达成一致。

In Section 4.2, we showed that if prepared(m, v, n, i) is true, prepared(m', v, n, j) is false for any non-faulty replica j (including i = j) and any m' such that D(m')D(m). This implies that two non-faulty replicas agree on the sequence number of requests that commit locally in the same view at the two replicas.
在第 4.2 节中,我们展示了如果 prepared(m, v, n, i) 为 true,则 prepared(m', v, n, j) 对于任何无错误的副本 j(包括 i = j)和任何 m' 为 false,使得 D(m') D(m)。这意味着两个无故障副本在两个副本的同一视图中本地提交的请求序列号上达成一致。

The view-change protocol ensures that non-faulty replicas also agree on the sequence number of requests that commit locally in different views at different replicas. A request m commits locally at a non-faulty replica with sequence number n in view v only if committed(m, v, n) is true. This means that there is a set R1 containing at least f+1 non-faulty replicas such that prepared(m, v, n, i) is true for every replica i in the set.
视图更改协议可确保无故障副本也就在不同副本的不同视图中本地提交的请求序列号达成一致。仅当 committed(m, v, n) 为 true 时,请求 m 才会在视图 v 中序列号为 n 的无故障副本本地提交。这意味着有一个集合 R1 包含至少 f+1 个无故障副本,使得集合中的每个副本 iprepared(m, v, n, i) 都为 true。

Non-faulty replicas will not accept a pre-prepare for view v' > v without having received a new-view message for v' (since only at that point do they enter the view). But any correct new-view message for view v' > v contains correct view-change messages from every replica i in a set R2 of 2f+1 replicas. Since there are 3f+1 replicas, R1 and R2 must intersect in at least one replica k that is not faulty. k's view-change message will ensure that the fact that m prepared in a previous view is propagated to subsequent views, unless the new-view message contains a view-change message with a stable checkpoint with a sequence number higher than n. In the first case, the algorithm redoes the three phases of the atomic multicast protocol for m with the same sequence number n and the new view number. This is important because it prevents any different request that was assigned the sequence number n in a previous view from ever committing. In the second case no replica in the new view will accept any message with sequence number lower than n. In either case, the replicas will agree on the request that commits locally with sequence number n.
未收到 v' 的新视图消息时,非故障副本将不接受视图 v' > v 的预准备(因为只有在此时它们才会进入视图)。但是,视图 v' > v 的任何正确新视图消息都包含来自一组 2f+1 副本的 R2 中每个副本 i 的正确视图更改消息。由于存在 3f+1 副本,因此 R1R2 必须在至少一个没有故障的副本 k 中相交。k 的视图更改消息将确保在前一个视图中准备的 m 这一事实传播到后续视图,除非新视图消息包含具有稳定检查点且序列号大于 n 的视图更改消息。在第一种情况下,该算法使用相同的序列号 n 和新的视图编号重做 m 的原子组播协议的三个阶段。这很重要,因为它可以防止在上一个视图中分配了序列号 n 的任何不同请求提交。在第二种情况下,新视图中的任何副本都不会接受序列号小于 n 的任何消息。在任何一种情况下,副本都将同意在本地提交序列号为 n 的请求。

4.5.2 Liveness 4.5.2 活跃度

To provide liveness, replicas must move to a new view if they are unable to execute a request. But it is important to maximize the period of time when at least 2f+1 non-faulty replicas are in the same view, and to ensure that this period of time increases exponentially until some requested operation executes. We achieve these goals by three means.
为了提供活动性,如果副本无法提供活动性,则必须移动到新视图 以执行请求。但最大限度地利用时间段很重要 当至少 2f+1 非故障副本都在同一 view 中,并保证此时间段 的时间呈指数级增长,直到执行某些请求的操作。 我们通过三种方式实现这些目标。

First, to avoid starting a view change too soon, a replica that multicasts a view-change message for view v+1 waits for 2f+1 view-change messages for view v+1 and then starts its timer to expire after some time T. If the timer expires before it receives a valid new-view message for v+1 or before it executes a request in the new view that it had not executed previously, it starts the view change for view v+2 but this time it will wait 2T before starting a view change for view v+3.
首先,为了避免过早地开始视图更改,为视图 v+1 多播视图更改消息的副本等待视图 v+12f+1 视图更改消息,然后启动其计时器在一段时间 T 后过期。如果计时器在收到 v+1 的有效新视图消息之前过期,或者在新视图中执行之前未执行的请求之前过期,它会启动视图 v+2 的视图更改,但这次它将等待 2T 后再开始视图 v+3 的视图更改。

Second, if a replica receives a set of f+1 valid view-change messages from other replicas for views greater than its current view, it sends a view-change message for the smallest view in the set, even if its timer has not expired; this prevents it from starting the next view change too late.
其次,如果副本从其他副本收到一组 f+1 有效视图更改消息,这些消息的视图大于其当前视图,则它会为集中的最小视图发送视图更改消息,即使其计时器尚未过期;这样可以防止它太晚地开始下一个视图更改。

Third, faulty replicas are unable to impede progress by forcing frequent view changes. A faulty replica cannot cause a view change by sending a view-change message, because a view change will happen only if at least f+1 replicas send view-change messages, but it can cause a view change when it is the primary (by not sending messages or sending bad messages). However, because the primary of view v is the replica p such that p = v mod |R|, the primary cannot be faulty for more than f consecutive views.
第三,有故障的副本无法通过强制频繁的视图更改来阻碍进度。有故障的副本不能通过发送视图更改消息来导致视图更改,因为只有当至少 f+1 个副本发送视图更改消息时,才会发生视图更改,但是当它是主副本时,它可能会导致视图更改(通过不发送消息或发送错误消息)。但是,由于视图 v 的主视图是副本 p,因此 p = v mod |R|,主数据库不能连续超过 f 个视图出错。

These three techniques guarantee liveness unless message delays grow faster than the timeout period indefinitely, which is unlikely in a real system.
这三种技术保证了活动性,除非消息延迟无限期地增长到超过超时时间的速度,这在实际系统中是不太可能的。

4.6 Non-Determinism 4.6 非确定性

State machine replicas must be deterministic but many services involve some form of non-determinism. For example, the time-last-modified in NFS is set by reading the server's local clock; if this were done independently at each replica, the states of non-faulty replicas would diverge. Therefore, some mechanism to ensure that all replicas select the same value is needed. In general, the client cannot select the value because it does not have enough information; for example, it does not know how its request will be ordered relative to concurrent requests by other clients. Instead, the primary needs to select the value either independently or based on values provided by the backups.
状态机副本必须是确定性的,但许多服务涉及某种形式的非确定性。例如,NFS 中的 time-last-modified 是通过读取服务器的本地时钟来设置的;如果在每个副本上独立执行此操作,则非故障副本的状态将发散。因此,需要某种机制来确保所有副本都选择相同的值。通常,客户端无法选择该值,因为它没有足够的信息;例如,它不知道相对于其他客户端的并发请求,其请求将如何排序。相反,主服务器需要独立选择值或根据备份提供的值选择值。

If the primary selects the non-deterministic value independently, it concatenates the value with the associated request and executes the three phase protocol to ensure that non-faulty replicas agree on a sequence number for the request and value. This prevents a faulty primary from causing replica state to diverge by sending different values to different replicas. However, a faulty primary might send the same, incorrect, value to all replicas. Therefore, replicas must be able to decide deterministically whether the value is correct (and what to do if it is not) based only on the service state.
如果主副本独立选择非确定性值,它会将该值与关联的请求连接起来,并执行三阶段协议,以确保非故障副本就请求和值的序列号达成一致。这可以防止有故障的主节点通过向不同的副本发送不同的值来导致副本状态发散。但是,有故障的主副本可能会向所有副本发送相同的错误值。因此,副本必须能够仅根据服务状态确定性地决定该值是否正确(如果不正确该怎么办)。

This protocol is adequate for most services (including NFS) but occasionally replicas must participate in selecting the value to satisfy a service's specification. This can be accomplished by adding an extra phase to the protocol: the primary obtains authenticated values proposed by the backups, concatenates 2f+1 of them with the associated request, and starts the three phase protocol for the concatenated message. Replicas choose the value by a deterministic computation on the 2f+1 values and their state, e.g., taking the median. The extra phase can be optimized away in the common case. For example, if replicas need a value that is ``close enough'' to that of their local clock, the extra phase can be avoided when their clocks are synchronized within some delta.
此协议适用于大多数服务(包括 NFS),但偶尔会 副本必须参与选择值以满足服务的 规范。这可以通过向 protocol:主数据库获取备份建议的 authenticated 值, 将其中的 2F+1 与关联的请求连接起来,并启动连接消息的三阶段协议。副本通过在 2f+1 上进行确定性计算来选择值 值及其状态,例如,取 medin.额外相位可以是 optimized away 在常见情况下。例如,如果副本需要一个值 这与他们当地的时钟 Extra Phase 的 Clock 的 Far Phase “足够接近” 当它们的 clocks 在某个 delta 内同步时可以避免。

Next: Optimizations Up:Contents Previous: Service Properties
Next: 优化 Up – 目录 Previous: 服务属性


Miguel Castro and Barbara Liskov,  "Practical Byzantine Fault Tolerance", in Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999.
Miguel Castro 和 Barbara Liskov,“实用拜占庭容错”,第三届操作系统设计和实现研讨会论文集,美国新奥尔良,1999 年 2 月。