这是用户在 2025-4-11 17:04 为 https://preshing.com/20120625/memory-ordering-at-compile-time/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Preshing on ProgrammingPreshing on Programming  Preshing 谈编程

Memory Ordering at Compile Time
编译时的内存排序

Between the time you type in some C/C++ source code and the time it executes on a CPU, the memory interactions of that code may be reordered according to certain rules. Changes to memory ordering are made both by the compiler (at compile time) and by the processor (at run time), all in the name of making your code run faster.
从你输入 C/C++ 源代码到它在 CPU 上执行这段时间里,代码中的内存交互操作可能会根据特定规则被重新排序。这种内存排序的调整既发生在编译器层面(编译时),也发生在处理器层面(运行时),目的都是为了提升代码运行速度。

The cardinal rule of memory reordering, which is universally followed by compiler developers and CPU vendors, could be phrased as follows:
内存重排序的核心准则被所有编译器开发者和 CPU 厂商一致遵循,其表述如下:

Thou shalt not modify the behavior of a single-threaded program.
汝不得修改单线程程序的行为表现。

As a result of this rule, memory reordering goes largely unnoticed by programmers writing single-threaded code. It often goes unnoticed in multithreaded programming, too, since mutexes, semaphores and events are all designed to prevent memory reordering around their call sites. It’s only when lock-free techniques are used – when memory is shared between threads without any kind of mutual exclusion – that the cat is finally out of the bag, and the effects of memory reordering can be plainly observed.
由于这一规则的存在,内存重排序对编写单线程代码的程序员来说基本无感。在多线程编程中,它也常被忽视,因为互斥锁、信号量和事件等机制的设计初衷就是防止其调用点周围发生内存重排序。唯有当采用无锁技术时 —— 即线程间共享内存却未使用任何互斥手段 —— 问题才会彻底暴露,内存重排序的影响才变得清晰可见。

Mind you, it is possible to write lock-free code for multicore platforms without the hassles of memory reordering. As I mentioned in my introduction to lock-free programming, one can take advantage of sequentially consistent types, such as volatile variables in Java or C++11 atomics – possibly at the price of a little performance. I won’t go into detail about those here. In this post, I’ll focus on the impact of the compiler on memory ordering for regular, non-sequentially-consistent types.
请注意,为多核平台编写无锁代码时,确实可以规避内存重排序的困扰。正如我在无锁编程入门中所提,开发者可以利用顺序一致类型(如 Java 中的 volatile 变量或 C++11 原子类型),尽管可能会牺牲些许性能。本文将不深入探讨这些内容,而是聚焦于编译器对常规非顺序一致类型内存排序的影响。

Compiler Instruction Reordering
编译器指令重排序

As you know, the job of a compiler is to convert human-readable source code into machine-readable code for the CPU. During this conversion, the compiler is free to take many liberties.
众所周知,编译器的工作是将人类可读的源代码转换为 CPU 可读的机器代码。在此转换过程中,编译器拥有充分的自由度进行各种调整。

Once such liberty is the reordering of instructions – again, only in cases where single-threaded program behavior does not change. Such instruction reordering typically happens only when compiler optimizations are enabled. Consider the following function:
其中一项调整就是指令重排序 —— 当然,前提是单线程程序的行为不会因此改变。这类指令重排序通常仅在启用编译器优化时发生。请看以下函数示例:

int A, B;

void foo()
{
    A = B + 1;
    B = 0;
}

If we compile this function using GCC 4.6.1 without compiler optimization, it generates the following machine code, which we can view as an assembly listing using the -S option. The memory store to global variable B occurs right after the store to A, just as it does in the original source code.
如果我们使用未开启优化的 GCC 4.6.1 编译此函数,生成的机器码(可通过 -S 选项查看汇编列表)显示:对全局变量 B 的内存存储操作紧接在对 A 的存储之后,这与源代码中的顺序完全一致。

$ gcc -S -masm=intel foo.c
$ cat foo.s
        ...
        mov     eax, DWORD PTR _B  (redo this at home...)
        add     eax, 1
        mov     DWORD PTR _A, eax
        mov     DWORD PTR _B, 0
        ...

Compare that to the resulting assembly listing when optimizations are enabled using -O2:
现在对比使用 -O2 启用优化后生成的汇编列表:

$ gcc -O2 -S -masm=intel foo.c
$ cat foo.s
        ...
        mov     eax, DWORD PTR B
        mov     DWORD PTR B, 0
        add     eax, 1
        mov     DWORD PTR A, eax
        ...

This time, the compiler has chosen to exercise its liberties, and reordered the store to B before the store to A. And why shouldn’t it? The cardinal rule of memory ordering is not broken. A single-threaded program would never know the difference.
这一次,编译器决定行使它的自由,将 B 的存储操作重排到了 A 之前。它为何不该这么做呢?内存排序的基本原则并未被打破。单线程程序根本察觉不到这种差异。

On the other hand, such compiler reorderings can cause problems when writing lock-free code. Here’s a commonly-cited example, where a shared flag is used to indicate that some other shared data has been published:
另一方面,这类编译器重排可能导致无锁编程出现问题。这里引用一个常见示例:使用共享标志位来表示其他共享数据已发布完成:

int Value;
int IsPublished = 0;
 
void sendValue(int x)
{
    Value = x;
    IsPublished = 1;
}

Imagine what would happen if the compiler reordered the store to IsPublished before the store to Value. Even on a single-processor system, we’d have a problem: a thread could very well be pre-empted by the operating system between the two stores, leaving other threads to believe that Value has been updated when in fact, it hasn’t.
设想如果编译器将 IsPublished 的存储操作重排到 Value 之前会发生什么。即便在单处理器系统上也会出问题:线程很可能在两个存储操作之间被操作系统抢占,导致其他线程误以为 Value 已完成更新,而实际上并未更新。

Of course, the compiler might not reorder those operations, and the resulting machine code would work fine as a lock-free operation on any multicore CPU having a strong memory model, such as an x86/64 – or in a single-processor environment, any type of CPU at all. If that’s the case, we should consider ourselves lucky. Needless to say, it’s much better practice to recognize the possibility of memory reordering for shared variables, and to ensure that the correct ordering is enforced.
当然,编译器可能不会重排这些操作,生成的机器代码在任何具有强内存模型的多核 CPU(如 x86/64)或单处理器环境中的任何类型 CPU 上都能作为无锁操作正常工作。如果确实如此,我们应该感到幸运。但毋庸置疑,更好的做法是认识到共享变量可能存在的内存重排序问题,并确保强制执行正确的顺序。

Explicit Compiler Barriers
显式编译器屏障

The minimalist approach to preventing compiler reordering is by using a special directive known as a compiler barrier. I’ve already demonstrated compiler barriers in a previous post. The following is a full compiler barrier in GCC. In Microsoft Visual C++, _ReadWriteBarrier serves the same purpose.
防止编译器重排序的极简方法是使用一种称为编译器屏障的特殊指令。我在之前的文章中已经展示过编译器屏障。以下是 GCC 中的完整编译器屏障。在 Microsoft Visual C++ 中, _ReadWriteBarrier 具有相同作用。

int A, B;

void foo()
{
    A = B + 1;
    asm volatile("" ::: "memory");
    B = 0;
}

With this change, we can leave optimizations enabled, and the memory store instructions will remain in the desired order.
通过这一改动,我们可以保持优化开启,同时内存存储指令将保持所需的顺序。

$ gcc -O2 -S -masm=intel foo.c
$ cat foo.s
        ...
        mov     eax, DWORD PTR _B
        add     eax, 1
        mov     DWORD PTR _A, eax
        mov     DWORD PTR _B, 0
        ...

Similarly, if we want to guarantee our sendMessage example works correctly, and we only care about single-processor systems, then at an absolute minimum, we must introduce compiler barriers here as well. Not only does the sending operation require a compiler barrier, to prevent the reordering of stores, but receiving side needs one between the loads as well.
同理,如果我们想确保 sendMessage 示例能正确运行,并且只关心单处理器系统,那么至少必须在此处也引入编译器屏障。不仅发送操作需要编译器屏障以防止存储重排序,接收端在加载操作之间同样需要一个屏障。

#define COMPILER_BARRIER() asm volatile("" ::: "memory")

int Value;
int IsPublished = 0;

void sendValue(int x)
{
    Value = x;
    COMPILER_BARRIER();          // prevent reordering of stores
    IsPublished = 1;
}

int tryRecvValue()
{
    if (IsPublished)
    {
        COMPILER_BARRIER();      // prevent reordering of loads
        return Value;
    }
    return -1;  // or some other value to mean not yet received
}

As I mentioned, compiler barriers are sufficient to prevent memory reordering on a single-processor system. But it’s 2012, and these days, multicore computing is the norm. If we want to ensure our interactions happen in the desired order in a multiprocessor environment, and on any CPU architecture, then a compiler barrier is not enough. We need either to issue a CPU fence instruction, or perform any operation which acts as a memory barrier at runtime. I’ll write more about those in the next post, Memory Barriers Are Like Source Control Operations.
如前所述,在单处理器系统上,编译器屏障足以防止内存重排序。但现在是 2012 年,多核计算已成为常态。如果我们想确保在多处理器环境中的交互按预期顺序执行,并且适用于任何 CPU 架构,那么仅靠编译器屏障是不够的。我们需要发出 CPU 栅栏指令,或者执行任何在运行时充当内存屏障的操作。我将在下一篇文章《内存屏障如同版本控制操作》中详细讨论这些内容。

The Linux kernel exposes several CPU fence instructions through preprocessor macros such as smb_rmb, and those macros are reduced to simple compiler barriers when compiling for a single-processor system.
Linux 内核通过预处理器宏(如 smb_rmb )暴露了几种 CPU 栅栏指令,这些宏在编译为单处理器系统时会被简化为简单的编译器屏障。

Implied Compiler Barriers
隐含的编译器屏障

There are other ways to prevent compiler reordering. Indeed, the CPU fence instructions I just mentioned act as compiler barriers, too. Here’s an example CPU fence instruction for PowerPC, defined as a macro in GCC:
还有其他方法可以防止编译器重排序。实际上,我刚才提到的 CPU 栅栏指令同样起到了编译器屏障的作用。以下是一个针对 PowerPC 的 CPU 栅栏指令示例,在 GCC 中定义为宏:

#define RELEASE_FENCE() asm volatile("lwsync" ::: "memory")

Anywhere we place RELEASE_FENCE throughout our code, it will prevent certain kinds of processor reordering in addition to compiler reordering. For example, it can be used to make our sendValue function safe in a multiprocessor environment.
无论我们在代码中的何处放置 RELEASE_FENCE ,它除了防止编译器重排序外,还能阻止某些类型的处理器重排序。例如,它可以用来确保我们的 sendValue 函数在多处理器环境中的安全性。

void sendValue(int x)
{
    Value = x;
    RELEASE_FENCE();
    IsPublished = 1;
}

In the new C++11 (formerly known as C++0x) atomic library standard, every non-relaxed atomic operation acts as a compiler barrier as well.
在新的 C++11(原称 C++0x)原子库标准中,每一个非宽松的原子操作同样充当了编译器屏障的角色。

int Value;
std::atomic<int> IsPublished(0);

void sendValue(int x)
{
    Value = x;
    // <-- reordering is prevented here!
    IsPublished.store(1, std::memory_order_release);
}

And as you might expect, every function containing a compiler barrier must act as a compiler barrier itself, even when the function is inlined. (However, Microsoft’s documentation suggests that may not have been the case in earlier versions of the Visual C++ compiler. Tsk, tsk!)
正如你所料,每个包含编译器屏障的函数本身也必须充当编译器屏障,即使该函数被内联时也是如此。(不过,微软的文档表明,在早期版本的 Visual C++ 编译器中可能并非如此。啧啧!)

void doSomeStuff(Foo* foo)
{
    foo->bar = 5;
    sendValue(123);       // prevents reordering of neighboring assignments
    foo->bar2 = foo->bar;
}

In fact, the majority of function calls act as compiler barriers, whether they contain their own compiler barrier or not. This excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used. Other than those cases, a call to an external function is even stronger than a compiler barrier, since the compiler has no idea what the function’s side effects will be. It must forget any assumptions it made about memory that is potentially visible to that function.
事实上,大多数函数调用都充当了编译屏障,无论它们自身是否包含编译屏障。这排除了内联函数、使用 pure 属性声明的函数以及采用链接时代码生成的情况。除此之外,调用外部函数甚至比编译屏障更强,因为编译器无法预知该函数的副作用。它必须放弃对可能对该函数可见的内存所做的任何假设。

When you think about it, this makes perfect sense. In the above code snippet, suppose our implementation of sendValue exists in an external library. How does the compiler know that sendValue doesn’t depend on the value of foo->bar? How does it know sendValue will not modify foo->bar in memory? It doesn’t. Therefore, to obey the cardinal rule of memory ordering, it must not reorder any memory operations around the external call to sendValue. Similarly, it must load a fresh value for foo->bar from memory after the call completes, rather than assuming it still equals 5, even with optimization enabled.
细想之下,这完全合理。在上述代码片段中,假设我们的 sendValue 实现存在于外部库中。编译器如何知道 sendValue 不依赖于 foo->bar 的值?它又如何确定 sendValue 不会修改内存中的 foo->bar ?它并不知道。因此,为了遵守内存排序的基本原则,编译器不得围绕对外部函数 sendValue 的调用重新排序任何内存操作。同样,在调用完成后,它必须从内存中重新加载 foo->bar 的值,而非假设其仍等于 5,即使启用了优化也是如此。

$ gcc -O2 -S -masm=intel dosomestuff.c
$ cat dosomestuff.s
        ...
        mov    ebx, DWORD PTR [esp+32]
        mov    DWORD PTR [ebx], 5            // Store 5 to foo->bar
        mov    DWORD PTR [esp], 123
        call    sendValue                     // Call sendValue
        mov    eax, DWORD PTR [ebx]          // Load fresh value from foo->bar
        mov    DWORD PTR [ebx+4], eax
        ...

As you can see, there are many instances where compiler instruction reordering is prohibited, and even when the compiler must reload certain values from memory. I believe these hidden rules form a big part of the reason why people have long been saying that volatile data types in C are not usually necessary in correctly-written multithreaded code.
如你所见,编译器指令重排序在许多情况下是被禁止的,甚至在某些时候编译器必须从内存中重新加载某些值。我相信这些隐藏的规则在很大程度上解释了为什么长期以来人们都说,在正确编写的多线程代码中,C 语言的 volatile 数据类型通常是不必要的。

Out-Of-Thin-Air Stores  凭空写入(Out-Of-Thin-Air Stores)

Think instruction reordering makes lock-free programming tricky? Before C++11 was standardized, there was technically no rule preventing the compiler from getting up to even worse tricks. In particular, compilers were free to introduce stores to shared memory in cases where there previously was none. Here’s a very simplified example, inspired by the examples provided in multiple articles by Hans Boehm.
以为指令重排就让无锁编程够棘手了?在 C++11 标准确立前,编译器理论上甚至可能玩出更离谱的花招 —— 尤其是允许编译器在原本没有共享内存写入操作的情况下,凭空插入这样的写入操作。这里给出一个极度简化的示例,其灵感源自 Hans Boehm 多篇文章中提供的案例。

int A, B;

void foo()
{
    if (A)
        B++;
}

Though it’s rather unlikely in practice, nothing prevents a compiler from promoting B to a register before checking A, resulting in machine code equivalent to the following:
尽管在实践中不太可能发生,但编译器在检查 A 之前将 B 提升到寄存器并无任何限制,这会导致生成的机器代码等同于以下情况:

void foo()
{
    register int r = B;    // Promote B to a register before checking A.
    if (A)
        r++;
    B = r;          // Surprise! A new memory store where there previously was none.
}

Once again, the cardinal rule of memory ordering is still followed. A single-threaded application would be none the wiser. But in a multithreaded environment, we now have a function which can wipe out any changes made concurrently to B in other threads – even when A is 0. The original code didn’t do that. This type of obscure, technical non-impossibility is part of the reason why people have been saying that C++ doesn’t support threads, despite the fact that we’ve been happily writing multithreaded and lock-free code in C/C++ for decades.
再次强调,内存排序的基本原则依然被遵循。单线程应用程序对此毫无察觉。但在多线程环境中,我们现在拥有一个函数,它能抹去其他线程中对 B 并发做出的任何修改 —— 即使 A 为 0 时也是如此。原始代码并未实现这一功能。这类晦涩难懂、技术性而非绝对不可能的情况,正是人们长期以来声称 C++ 不支持线程的部分原因,尽管事实上我们几十年来一直愉快地用 C/C++ 编写多线程和无锁代码。

I don’t know anyone who ever fell victim to such “out-of-thin-air” stores in practice. Maybe it’s just because for the type of lock-free code we tend to write, there aren’t a whole lot of optimization opportunities fitting this pattern. I suppose if I ever caught this type of compiler transformation happening, I would search for a way to wrestle the compiler into submission. If it’s happened to you, let me know in the comments.
我从未听说有人在实际中遭遇过这种 “凭空出现” 的存储操作。或许是因为在我们通常编写的无锁代码类型中,符合这种模式的优化机会并不多。我想,如果我曾发现编译器进行此类转换,我会设法迫使编译器就范。如果你遇到过这种情况,请在评论区告诉我。

In any case, the new C++11 standard explictly prohibits such behavior from the compiler in cases where it would introduce a data race. The wording can be found in and around §1.10.22 of the most recent C++11 working draft:
无论如何,新的 C++11 标准明确禁止编译器在可能引发数据竞争的情况下出现此类行为。具体措辞可查阅最新 C++11 工作草案中 §1.10.22 章节及其周边内容:

Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.
本标准通常禁止编译器转换将分配引入到抽象机器无法修改的潜在共享内存位置。

Why Compiler Reordering?
为什么需要编译器重排序?

As I mentioned at the start, the compiler modifies the order of memory interactions for the same reason that the processor does it – performance optimization. Such optimizations are a direct consequence of modern CPU complexity.
正如开头所述,编译器调整内存交互顺序的原因与处理器相同 —— 都是为了性能优化。这类优化是现代 CPU 复杂架构的直接产物。

I may going out on a limb, but I somehow doubt that compilers did a whole lot of instruction reordering in the early 80’s, when CPUs had only a few hundred thousand transistors at most. I don’t think there would have been much point. But since then, Moore’s Law has provided CPU designers with about 10000 times the number of transistors to play with, and those transistors have been spent on tricks such as pipelining, memory prefetching, ILP and more recently, multicore. As a result of some of those features, we’ve seen architectures where the order of instructions in a program can make a significant difference in performance.
或许我的观点有些大胆,但我怀疑上世纪 80 年代初期的编译器几乎不会进行指令重排序,那时 CPU 最多只有几十万个晶体管,这样做意义不大。但此后摩尔定律为 CPU 设计者带来了约万倍的晶体管数量增长,这些晶体管被用于流水线、内存预取、指令级并行(ILP)乃至近年来的多核技术。正是这些特性使得程序中的指令顺序能对性能产生显著影响。

The first Intel Pentium released in 1993, with its so-called U and V-pipes, was the first processor where I really remember people talking about pipelining and the significance of instruction ordering. More recently, though, when I step through x86 disassembly in Visual Studio, I’m actually surprised how little instruction reordering there is. On the other hand, out of the times I’ve stepped through SPU disassembly on Playstation 3, I’ve found that the compiler really went to town. These are just anecdotal experiences; it may not reflect the experience of others, and certainly should not influence the way we enforce memory ordering in our lock-free code.
1993 年发布的首款英特尔奔腾处理器凭借其所谓的 U 型和 V 型流水线,成为我记忆中人们首次真正讨论流水线技术及指令排序重要性的处理器。然而最近,当我在 Visual Studio 中逐步查看 x86 反汇编代码时,却惊讶地发现指令重排序的情况其实很少。另一方面,在 Playstation 3 上逐步查看 SPU 反汇编代码时,我发现编译器确实进行了大量优化。这些只是个人经验之谈,可能无法反映其他人的体验,当然也不应影响我们在无锁代码中强制执行内存排序的方式。

Comments (18)  评论 (18)

Loading... Logging you in...
  • Logged in as
Commenting Disabled  评论已关闭
Further commenting on this page has been disabled by the blog admin.
博客管理员已禁用此页面的进一步评论。
I don’t know anyone who ever fell victim to such “out-of-thin-air” stores in practice.
我从未听说有谁在实践中遭遇过这种 “凭空出现” 的存储问题。


See: http://www.airs.com/blog/archives/79
参见:http://www.airs.com/blog/archives/79

Before the C++11 / C11 memory model formalization work, the Linux kernel and gcc developers had arguments about these problems on a regular basis.
在 C++11/C11 内存模型标准化工作之前,Linux 内核与 GCC 开发者曾就这些问题频繁争论。
Reply  回复
1 reply · active 603 weeks ago
Nice example! Looks like the author chose to patch GCC and eliminate the optimization shortly after that post was written. At least in the case where the compiler is told to generate multithreaded code.
很好的例子!看来作者在那篇文章发布后不久就选择修补 GCC 并移除了该优化,至少在编译器被告知生成多线程代码的情况下如此。


They even discuss the same example I gave in the mailing list discussion.
他们甚至讨论了我邮件列表讨论中提到的同一个例子。
Reply  回复
What about clang? What is the compiler intrinsic for it to prevent reordering?
那 clang 呢?它防止指令重排的编译器内置函数是什么?
Reply  回复
1 reply · active 603 weeks ago
Clang supports gcc's extended inline assembler syntax[1], so:
Clang 支持 GCC 的扩展内联汇编语法 [1],因此:

__asm__ volatile("" : : : "memory");
also works as a compiler only memory barrier with clang.
在 Clang 中同样可作为仅针对编译器的内存屏障使用。


But most of the time, what you really want are the intrinsics that correspond to the C11 memory model, see for example FreeBSD's implementation of C11's stdatomic.h[2]. The "order" parameter is explained in LLVM's Atomic Instructions and Concurrency Guide[3].
但大多数情况下,你真正需要的是与 C11 内存模型对应的内部函数,可以参考 FreeBSD 对 C11 标准库 stdatomic.h 的实现 [2]。"order" 参数的解释详见 LLVM 的《原子指令与并发指南》[3]。


[1]: http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.ht...
[2]: http://svnweb.freebsd.org/base/head/include/stdat...
[3]: http://llvm.org/docs/Atomics.html
Reply  回复
Man, reading your blog should be mandatory !
老兄,读你的博客应该成为必修课!
Reply  回复
1 reply · active 479 weeks ago
Theo's avatar

Theo   西奥 · 479 weeks ago  ・479 周前

No, it should be taught in primary school!
不,这应该在小学就教!
Reply  回复
There's a 2010 report of an "out of thin air store" issue here:
这里有一份 2010 年的报告,提到了一个 “凭空存储” 的问题:

http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html
Reply  回复
We all kind of luck out because most of the time we develop on x86 CPUs and, for historical reasons, x86 CPUs pin their speculative fetches to caches lines, invalidating the speculative fetch if the cache line is invalidated. Things get really ugly really quickly if you try to develop on any other platform, such as Itanium, where this is not the case. And the longer we keep working on x86 CPUs, the longer we'll keep learning that things "just work" even when they're not guaranteed.
某种程度上我们都算幸运,因为大多数时候我们在 x86 CPU 上进行开发,而由于历史原因,x86 CPU 会将预取操作固定在缓存行上,如果缓存行失效,预取也会随之失效。如果你尝试在其他平台上开发,比如安腾(Itanium),情况就会变得非常棘手,因为那里并非如此。而我们越长时间在 x86 CPU 上工作,就越会习惯性地认为某些事情 “就是能行”,即便它们并没有得到保证。
Reply  回复
Kevin's avatar

Kevin   凯文 · 464 weeks ago  ・464 周前

You mentioned in another posting that x86 is strongly ordered, where the order of writes are preserved in visibility to readers. Does that imply that if you're only targeting x86, that a compile barrier is all that's needed to ensure consistency?
你在另一篇帖子中提到 x86 是强有序的,写入的顺序对读者可见性得以保留。这是否意味着如果仅针对 x86 平台,编译屏障就足以确保一致性?
Reply  回复
1 reply · active 356 weeks ago
Travis Downs's avatar

Travis Downs   特拉维斯・唐斯 · 356 weeks ago  356 周前

No, it doesn't. The strength of ordering is a spectrum, and while x86 is towards the stronger end, it is definitely not "sequentially consistent" (fully ordered). Loads can pass earlier stores (the "store buffer" effect), and CPUs can see their own stores out of order with respect to the global order (the "store forwarding" effect). So you often need hardware memory barriers even on x86. You don't actually see the primary explicit hardware barrer, mfence, much in x86 code, however - because all lock-prefixed instructions for atomic operations such as exchange-add and compare-and-exchange also imply a full barrier: and in most cases were a barrier is needed you are also using one of those.
不,并非如此。内存排序的强度是一个连续谱系,虽然 x86 架构位于较强的一端,但它绝对不是 “顺序一致”(完全有序)的。加载操作可以越过先前的存储操作(“存储缓冲区” 效应),且 CPU 可能看到自己的存储操作与全局顺序不一致(“存储转发” 效应)。因此,即使在 x86 上,你通常也需要硬件内存屏障。不过,在 x86 代码中你并不会经常看到主要的显式硬件屏障指令 mfence—— 因为所有带 lock 前缀的原子操作指令(如交换加法、比较并交换等)都隐含了完整屏障:而在大多数需要屏障的场景中,你同时也在使用这类指令。
Reply  回复
Peter Burman's avatar

Peter Burman   彼得・伯曼 · 454 weeks ago  ・454 周前

Reply  回复
Jithu's avatar

Jithu   吉图 · 435 weeks ago  ・435 周前

Great write Up. One thing that is not clear for me though is how you make this conclusion midway :
写得很棒。不过有一点我不太明白的是,你如何在中途得出这个结论:

"As I mentioned, compiler barriers are sufficient to prevent memory reordering on a single-processor system"
"正如我提到的,在单处理器系统上,编译器屏障足以防止内存重排序"


I get the following things :
我理解以下几点:


1. CPU vendors (and compiler vendors) honor, single thread program behavior
1. CPU 供应商(及编译器供应商)遵循单线程程序的行为规范

2. Compiler barriers are necessary to ensure that this behavior is maintained in a multi thread environment (on a single processor system)
2. 在多线程环境(单处理器系统)中,编译器屏障对于确保这一行为得以维持是必要的

3. How do you conclude that compiler barriers(2) are sufficient for correctness in a single processor multi thread environment.
3. 你如何得出编译器屏障 (2) 在单处理器多线程环境中足以保证正确性的结论


I would assume that CPU barriers too are required
我认为 CPU 屏障同样不可或缺
Reply  回复
2 replies · active 346 weeks ago
Joel Croteau's avatar

Joel Croteau   乔尔・克罗托 · 392 weeks ago  ・392 周前

On a single core system without any hardware parallelism, all multi-threading is implemented in software. As far as the CPU is concerned, everything is run in a single thread. Since the CPU is bound not to break single-threaded behavior, you can assume that any instruction re-ordering the CPU does will not affect your result, and so CPU barriers are unnecessary.
在没有硬件并行性的单核系统上,所有多线程都是在软件中实现的。就 CPU 而言,所有操作都在单线程中运行。由于 CPU 必须保证不破坏单线程行为,因此可以认为 CPU 执行的任何指令重排序都不会影响结果,所以 CPU 屏障是不必要的。
Reply  回复
Dirk's avatar

Dirk   德克 · 346 weeks ago  ・346 周前

Thanks Joel. Is this true for hyperthreaded CPUs as well?
谢谢 Joel。对于超线程 CPU 来说也是这样吗?
Reply  回复
LearningLockFreee's avatar

LearningLockFreee   学习无锁编程 · 385 weeks ago  ・385 周前

In the "Similarly, if we want to guarantee our sendMessage example works correctly" example above, the two stores cannot be re-ordered relative to one another because of the memory barrier. This is clear.
在上述 “同理,如果我们想确保 sendMessage 示例正确运行” 的例子中,由于内存屏障的存在,两个存储操作不能相互重排序。这一点很明确。


But there is only one load in the example-- "// prevent reordering of loads". I'm confuesed--prevent reordering of the load relative to what?
但示例中只有一个加载操作 ——“// 防止加载重排序”。我有些困惑 —— 防止加载操作相对于什么被重排序?
Reply  回复
1 reply · active 385 weeks ago
LearningLockFree's avatar

LearningLockFree   学习无锁编程 · 385 weeks ago  ・385 周前

oh, nvm--I see now that IsPublished is being read too, as part of the conditional.
哦,不用了 —— 我现在看到 IsPublished 也被读取了,作为条件判断的一部分。
Reply  回复
Poxma's avatar

Poxma   波克斯马 · 346 weeks ago  ・346 周前

"The Linux kernel exposes several CPU fence instructions through preprocessor macros such as smb_rmb, and those macros are reduced to simple compiler barriers when compiling for a single-processor system".
"Linux 内核通过预处理宏(如 smb_rmb)暴露了多条 CPU 内存屏障指令,这些宏在针对单处理器系统编译时会被简化为简单的编译器屏障。"

When "compiling", is it possible for the compiler to known that the program will run in "a single-processor system" ?
在 “编译” 时,编译器能否知道程序将在 “单处理器系统” 上运行?
Reply  回复
1 reply · active 346 weeks ago
Loading...
preshing's avatar - Go to profile

preshing 65p · 346 weeks ago
・346 周前

Yes, if you make it a configuration option in your own project! Linux has its own configuration option for this: CONFIG_SMP. It's checked throughout the Linux source code to know whether it's compiling a kernel that will run on single processor system or multiple processor system.
可以,如果你在自己的项目中将其设为配置选项!Linux 就有这样的配置选项:CONFIG_SMP。在 Linux 源代码中会检查这个选项,以确定编译的是运行在单处理器系统还是多处理器系统上的内核。
Reply  回复

Comments by