Jun 12, 2012

An Introduction to Lock-Free Programming
无锁编程简介

Lock-free programming is a challenge, not just because of the complexity of the task itself, but because of how difficult it can be to penetrate the subject in the first place.
无锁编程是一项挑战，不仅因为任务本身的复杂性，还因为深入理解这一主题本身就相当困难。

I was fortunate in that my first introduction to lock-free (also known as lockless) programming was Bruce Dawson’s excellent and comprehensive white paper, Lockless Programming Considerations. And like many, I’ve had the occasion to put Bruce’s advice into practice developing and debugging lock-free code on platforms such as the Xbox 360.
我很幸运，第一次接触无锁（也称为 lockless）编程就是 Bruce Dawson 的优秀且全面的白皮书《Lockless Programming Considerations》。和许多人一样，我也有机会将 Bruce 的建议付诸实践，在 Xbox 360 等平台上开发和调试无锁代码。

Since then, a lot of good material has been written, ranging from abstract theory and proofs of correctness to practical examples and hardware details. I’ll leave a list of references in the footnotes. At times, the information in one source may appear orthogonal to other sources: For instance, some material assumes sequential consistency, and thus sidesteps the memory ordering issues which typically plague lock-free C/C++ code. The new C++11 atomic library standard throws another wrench into the works, challenging the way many of us express lock-free algorithms.
自那时起，涌现了大量优质资料，内容涵盖抽象理论、正确性证明到实际案例及硬件细节。我将在脚注中列出参考文献。有时，不同来源的信息可能看似互不相关：例如，某些资料假设了顺序一致性，从而避开了通常困扰无锁 C/C++ 代码的内存排序问题。而新的 C++11 原子库标准又带来了新的挑战，改变了我们许多人表达无锁算法的方式。

In this post, I’d like to re-introduce lock-free programming, first by defining it, then by distilling most of the information down to a few key concepts. I’ll show how those concepts relate to one another using flowcharts, then we’ll dip our toes into the details a little bit. At a minimum, any programmer who dives into lock-free programming should already understand how to write correct multithreaded code using mutexes, and other high-level synchronization objects such as semaphores and events.
在这篇文章中，我想重新介绍无锁编程，首先定义它，然后将大部分信息提炼为几个关键概念。我会用流程图展示这些概念之间的联系，接着我们会稍微深入一些细节。至少，任何深入研究无锁编程的程序员应该已经理解如何使用互斥锁（mutexes）以及其他高级同步对象（如信号量和事件）编写正确的多线程代码。

What Is It? 什么是无锁编程？

People often describe lock-free programming as programming without mutexes, which are also referred to as locks. That’s true, but it’s only part of the story. The generally accepted definition, based on academic literature, is a bit more broad. At its essence, lock-free is a property used to describe some code, without saying too much about how that code was actually written.
人们通常将无锁编程描述为不使用互斥锁（也称为锁）的编程方式。这没错，但这只是故事的一部分。基于学术文献的普遍定义更为宽泛。从本质上讲，无锁是一种用于描述某些代码特性的术语，而并不具体说明这些代码是如何编写的。

Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free. Conversely, if a given part of your code doesn’t satisfy these conditions, then that part is not lock-free.
基本上，如果你的程序某部分满足以下条件，那么这部分代码就可以正当地被视为无锁的。相反，如果代码的某部分不满足这些条件，那么这部分就不是无锁的。

In this sense, the lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” the entire application in some way, whether it’s deadlock, livelock – or even due to hypothetical thread scheduling decisions made by your worst enemy. That last point sounds funny, but it’s key. Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again. Of course, real operating systems don’t work that way – we’re merely defining terms.
从这个意义上说，“无锁” 中的 “锁” 并非直接指互斥锁，而是指以某种方式 “锁死” 整个应用程序的可能性，无论是死锁、活锁，还是由于假想中最恶劣的线程调度决策所导致。最后一点听起来有些滑稽，但却是关键所在。共享互斥锁显然被排除在外，因为一旦某个线程获得了互斥锁，你想象中的 “最坏敌人” 完全可以不再调度该线程。当然，实际操作系统并非如此运作 —— 我们只是在定义术语。

Here’s a simple example of an operation which contains no mutexes, but is still not lock-free. Initially, X = 0. As an exercise for the reader, consider how two threads could be scheduled in a way such that neither thread exits the loop.
这里有一个简单例子，展示了一个不含互斥锁但仍非无锁的操作。初始时，X = 0。作为读者的练习，请思考如何调度两个线程，使得两者都无法退出循环。

while (X == 0)
{
    X = 1 - X;
}

Nobody expects a large application to be entirely lock-free. Typically, we identify a specific set of lock-free operations out of the whole codebase. For example, in a lock-free queue, there might be a handful of lock-free operations such as push, pop, perhaps isEmpty, and so on.
没有人期望一个大型应用完全无锁。通常，我们会从整个代码库中识别出一组特定的无锁操作。例如，在一个无锁队列中，可能存在少量无锁操作，如 push 、 pop ，或许还有 isEmpty 等等。

Herlihy & Shavit, authors of The Art of Multiprocessor Programming, tend to express such operations as class methods, and offer the following succinct definition of lock-free (see slide 150): “In an infinite execution, infinitely often some method call finishes.” In other words, as long as the program is able to keep calling those lock-free operations, the number of completed calls keeps increasing, no matter what. It is algorithmically impossible for the system to lock up during those operations.
《多处理器编程的艺术》作者 Herlihy 与 Shavit 倾向于将此类操作表述为类方法，并给出了以下关于无锁的简洁定义（参见幻灯片 150）：“在无限执行中，某些方法调用会无限频繁地完成。” 换言之，只要程序能持续调用这些无锁操作，无论发生什么，已完成调用的数量都会不断增加。在这些操作期间，系统在算法上不可能陷入锁定状态。

One important consequence of lock-free programming is that if you suspend a single thread, it will never prevent other threads from making progress, as a group, through their own lock-free operations. This hints at the value of lock-free programming when writing interrupt handlers and real-time systems, where certain tasks must complete within a certain time limit, no matter what state the rest of the program is in.
无锁编程的一个重要推论是：若暂停单个线程，该线程永远不会阻碍其他线程作为一个整体通过其自身的无锁操作继续推进。这揭示了无锁编程在编写中断处理程序和实时系统时的价值 —— 在这些场景中，某些任务必须无论程序其余部分处于何种状态，都需在规定时限内完成。

A final precision: Operations that are designed to block do not disqualify the algorithm. For example, a queue’s pop operation may intentionally block when the queue is empty. The remaining codepaths can still be considered lock-free.
最后需要明确的是：设计为阻塞的操作不会使算法丧失无锁资格。例如，队列的弹出操作在队列为空时可能故意阻塞，其余代码路径仍可被视为无锁。

Lock-Free Programming Techniques
无锁编程技术

It turns out that when you attempt to satisfy the non-blocking condition of lock-free programming, a whole family of techniques fall out: atomic operations, memory barriers, avoiding the ABA problem, to name a few. This is where things quickly become diabolical.
事实证明，当你试图满足无锁编程的非阻塞条件时，会衍生出一系列技术：原子操作、内存屏障、避免 ABA 问题等，不一而足。这正是事情迅速变得棘手的地方。

So how do these techniques relate to one another? To illustrate, I’ve put together the following flowchart. I’ll elaborate on each one below.
那么这些技术之间如何关联呢？为了说明这一点，我整理了以下流程图，并将在下文逐一详解。

Atomic Read-Modify-Write Operations
原子性读 - 改 - 写操作

Atomic operations are ones which manipulate memory in a way that appears indivisible: No thread can observe the operation half-complete. On modern processors, lots of operations are already atomic. For example, aligned reads and writes of simple types are usually atomic.
原子操作是以一种看似不可分割的方式操作内存：任何线程都无法观察到操作完成一半的状态。在现代处理器上，许多操作本身已是原子的。例如，简单类型的对齐读写通常就是原子性的。

Read-modify-write (RMW) operations go a step further, allowing you to perform more complex transactions atomically. They’re especially useful when a lock-free algorithm must support multiple writers, because when multiple threads attempt an RMW on the same address, they’ll effectively line up in a row and execute those operations one-at-a-time. I’ve already touched upon RMW operations in this blog, such as when implementing a lightweight mutex, a recursive mutex and a lightweight logging system.
读 - 改 - 写（RMW）操作更进一步，允许你以原子方式执行更复杂的交易。它们在无锁算法需要支持多写入者时尤为有用，因为当多个线程尝试对同一地址进行 RMW 操作时，它们实际上会排成一行，逐个执行这些操作。我在本博客中已经提到过 RMW 操作，比如在实现轻量级互斥锁、递归互斥锁和轻量级日志系统时。

Examples of RMW operations include _InterlockedIncrement on Win32, OSAtomicAdd32 on iOS, and std::atomic<int>::fetch_add in C++11. Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform, so it’s best to know the capabilities of your platform and toolchain. You can call std::atomic<>::is_lock_free to make sure.
RMW 操作的例子包括 Win32 平台上的 _InterlockedIncrement 、iOS 上的 OSAtomicAdd32 以及 C++11 中的 std::atomic<int>::fetch_add 。需要注意的是，C++11 的原子标准并不保证在所有平台上的实现都是无锁的，因此最好了解你的平台和工具链的能力。你可以调用 std::atomic<>::is_lock_free 来确认这一点。

Different CPU families support RMW in different ways. Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done. The common RMW operations are usually sufficient.
不同的 CPU 家族以不同的方式支持 RMW。像 PowerPC 和 ARM 这样的处理器提供了加载 - 链接 / 存储 - 条件指令，这实际上允许你在底层实现自己的 RMW 原语，尽管这种做法并不常见。通常，常见的 RMW 操作已经足够使用。

As illustrated by the flowchart, atomic RMWs are a necessary part of lock-free programming even on single-processor systems. Without atomicity, a thread could be interrupted halfway through the transaction, possibly leading to an inconsistent state.
如流程图所示，即使在单处理器系统中，原子性的读 - 修改 - 写操作也是无锁编程的必要组成部分。缺乏原子性时，线程可能在事务执行中途被中断，从而导致状态不一致。

Compare-And-Swap Loops 比较并交换循环

Perhaps the most often-discussed RMW operation is compare-and-swap (CAS). On Win32, CAS is provided via a family of intrinsics such as _InterlockedCompareExchange. Often, programmers perform compare-and-swap in a loop to repeatedly attempt a transaction. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS:
讨论最多的读 - 修改 - 写操作或许是比较并交换（CAS）。在 Win32 平台上，CAS 通过一系列类似 _InterlockedCompareExchange 的内部函数实现。程序员通常会在循环中执行比较并交换操作，以反复尝试完成事务。这种模式通常包括将共享变量复制到局部变量，执行一些推测性工作，然后尝试使用 CAS 发布更改：

void LockFreeQueue::push(Node* newHead)
{
    for (;;)
    {
        // Copy a shared variable (m_Head) to a local.
        Node* oldHead = m_Head;

        // Do some speculative work, not yet visible to other threads.
        newHead->next = oldHead;

        // Next, attempt to publish our changes to the shared variable.
        // If the shared variable hasn't changed, the CAS succeeds and we return.
        // Otherwise, repeat.
        if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead)
            return;
    }
}

Such loops still qualify as lock-free, because if the test fails for one thread, it means it must have succeeded for another – though some architectures offer a weaker variant of CAS where that’s not necessarily true. Whenever implementing a CAS loop, special care must be taken to avoid the ABA problem.
这类循环仍属于无锁范畴，因为如果一个线程的测试失败，必然意味着另一个线程已成功 —— 尽管某些架构提供的 CAS 弱变体可能不符合此特性。在实现 CAS 循环时，必须特别注意避免 ABA 问题。

Sequential Consistency 顺序一致性

Sequential consistency means that all threads agree on the order in which memory operations occurred, and that order is consistent with the order of operations in the program source code. Under sequential consistency, it’s impossible to experience memory reordering shenanigans like the one I demonstrated in a previous post.
顺序一致性意味着所有线程对内存操作发生的顺序达成一致，且该顺序与程序源代码中的操作顺序保持一致。在顺序一致性模型下，不可能出现我之前文章中演示的那种内存重排序的混乱情况。

A simple (but obviously impractical) way to achieve sequential consistency is to disable compiler optimizations and force all your threads to run on a single processor. A processor never sees its own memory effects out of order, even when threads are pre-empted and scheduled at arbitrary times.
实现顺序一致性的一种简单（但显然不切实际）方法是禁用编译器优化，并强制所有线程在单处理器上运行。处理器永远不会看到自身内存操作乱序的现象，即使线程被抢占并以任意时间调度时也是如此。

Some programming languages offer sequentially consistency even for optimized code running in a multiprocessor environment. In C++11, you can declare all shared variables as C++11 atomic types with default memory ordering constraints. In Java, you can mark all shared variables as volatile. Here’s the example from my previous post, rewritten in C++11 style:
某些编程语言即使在多处理器环境下运行优化代码时也能提供顺序一致性。在 C++11 中，您可以将所有共享变量声明为具有默认内存顺序约束的 C++11 原子类型。在 Java 中，您可以将所有共享变量标记为 volatile 。以下是我前一篇文章中的示例，用 C++11 风格重写：

std::atomic<int> X(0), Y(0);
int r1, r2;

void thread1()
{
    X.store(1);
    r1 = Y.load();
}

void thread2()
{
    Y.store(1);
    r2 = X.load();
}

Because the C++11 atomic types guarantee sequential consistency, the outcome r1 = r2 = 0 is impossible. To achieve this, the compiler outputs additional instructions behind the scenes – typically memory fences and/or RMW operations. Those additional instructions may make the implementation less efficient compared to one where the programmer has dealt with memory ordering directly.
由于 C++11 原子类型保证了顺序一致性，结果 r1 = r2 = 0 是不可能出现的。为了实现这一点，编译器在幕后生成了额外的指令 —— 通常是内存屏障和 / 或 RMW 操作。与程序员直接处理内存排序的情况相比，这些额外指令可能会降低实现效率。

Memory Ordering 内存排序

As the flowchart suggests, any time you do lock-free programming for multicore (or any symmetric multiprocessor), and your environment does not guarantee sequential consistency, you must consider how to prevent memory reordering.
如流程图所示，每当为多核（或任何对称多处理器）进行无锁编程时，若运行环境无法保证顺序一致性，就必须考虑如何防止内存重排序。

On today’s architectures, the tools to enforce correct memory ordering generally fall into three categories, which prevent both compiler reordering and processor reordering:
在当今架构中，强制正确内存排序的工具通常分为三类，这些工具可同时防止编译器重排序和处理器重排序：

A lightweight sync or fence instruction, which I’ll talk about in future posts;
轻量级的同步或栅栏指令（我将在后续文章中详述）；
A full memory fence instruction, which I’ve demonstrated previously;
完整的内存屏障指令，我之前已经演示过；
Memory operations which provide acquire or release semantics.
提供获取或释放语义的内存操作。

Acquire semantics prevent memory reordering of operations which follow it in program order, and release semantics prevent memory reordering of operations preceding it. These semantics are particularly suitable in cases when there’s a producer/consumer relationship, where one thread publishes some information and the other reads it. I’ll also talk about this more in a future post.
获取语义防止程序顺序中后续操作的内存重排序，而释放语义防止其之前操作的内存重排序。这些语义特别适用于生产者 / 消费者关系的场景，即一个线程发布信息，另一个线程读取它。我将在未来的文章中进一步讨论这一点。

Different Processors Have Different Memory Models
不同处理器具有不同的内存模型

Different CPU families have different habits when it comes to memory reordering. The rules are documented by each CPU vendor and followed strictly by the hardware. For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but normally, the x86/64 family of processors from Intel and AMD do not. We say the former processors have a more relaxed memory model.
不同 CPU 家族在内存重排序方面有着不同的习惯。各 CPU 厂商会严格遵循并记录这些规则。例如，PowerPC 和 ARM 处理器可以改变内存存储相对于指令本身的顺序，但通常英特尔和 AMD 的 x86/64 系列处理器则不会。我们称前者的处理器拥有更为宽松的内存模型。

There’s a temptation to abstract away such platform-specific details, especially with C++11 offering us a standard way to write portable lock-free code. But currently, I think most lock-free programmers have at least some appreciation of platform differences. If there’s one key difference to remember, it’s that at the x86/64 instruction level, every load from memory comes with acquire semantics, and every store to memory provides release semantics – at least for non-SSE instructions and non-write-combined memory. As a result, it’s been common in the past to write lock-free code which works on x86/64, but fails on other processors.
人们常倾向于抽象掉这些平台特定的细节，尤其是 C++11 为我们提供了编写可移植无锁代码的标准方法。但目前，我认为大多数无锁程序员至少对平台差异有所了解。如果要记住一个关键区别，那就是在 x86/64 指令级别上，每次内存加载都带有获取语义，每次内存存储都提供释放语义 —— 至少对于非 SSE 指令和非写合并内存而言。因此，过去常见的情况是编写的无锁代码在 x86/64 上运行正常，但在其他处理器上却会失败。

If you’re interested in the hardware details of how and why processors perform memory reordering, I’d recommend Appendix C of Is Parallel Programming Hard. In any case, keep in mind that memory reordering can also occur due to compiler reordering of instructions.
如果你对处理器如何进行内存重排序及其原因的硬件细节感兴趣，我推荐阅读《并行编程难吗》的附录 C。无论如何，请记住内存重排序也可能由编译器对指令的重排序引起。

In this post, I haven’t said much about the practical side of lock-free programming, such as: When do we do it? How much do we really need? I also haven’t mentioned the importance of validating your lock-free algorithms. Nonetheless, I hope for some readers, this introduction has provided a basic familiarity with lock-free concepts, so you can proceed into the additional reading without feeling too bewildered. As usual, if you spot any inaccuracies, let me know in the comments.
在这篇文章中，我并未过多涉及无锁编程的实际应用方面，比如：我们何时需要它？实际需求有多大？我也未提及验证无锁算法的重要性。尽管如此，我希望这篇介绍能为部分读者提供对无锁概念的基本了解，以便在进一步阅读时不至于感到过于困惑。如常，若发现任何不准确之处，请在评论区告知。

[This article was featured in Issue #29 of Hacker Monthly.]
[本文曾刊登于《黑客月刊》第 29 期。]

Additional References 延伸阅读资料

Anthony Williams’ blog and his book, C++ Concurrency in Action
Anthony Williams 的博客及其著作《C++ 并发编程实战》
Dmitriy V’jukov’s website and various forum discussions
德米特里・维尤科夫（Dmitriy V’jukov）的网站及各类论坛讨论
Bartosz Milewski’s blog 巴托兹・米莱夫斯基（Bartosz Milewski）的博客
Charles Bloom’s Low-Level Threading series on his blog
查尔斯・布鲁姆（Charles Bloom）博客中的《低级线程》系列文章
Doug Lea’s JSR-133 Cookbook
道格・李（Doug Lea）的《JSR-133 Cookbook》
Howells and McKenney’s memory-barriers.txt document
豪厄尔斯与麦肯尼的 memory-barriers.txt 文档
Hans Boehm’s collection of links about the C++11 memory model
汉斯・波姆关于 C++11 内存模型的链接合集
Herb Sutter’s Effective Concurrency series
赫伯・萨特的《高效并发》系列