Compute Shader 计算着色器
Core in version 核心版本 | 4.6 | |
---|---|---|
Core since version 自版本 | 4.3 | |
Core ARB extension 核心 ARB 扩展 | ARB_compute_shader |
A Compute Shader is a Shader Stage that is used entirely for computing arbitrary information. While it can do rendering, it is generally used for tasks not directly related to drawing triangles and pixels.
Compute Shader 是一个 Shader 阶段,完全用于计算任意信息。虽然它可以进行渲染,但它通常用于与绘制三角形和像素没有直接关系的任务。
Execution model[edit]
执行模型[编辑]
Compute shaders operate differently from other shader stages. All of the other shader stages have a well-defined set of input values, some built-in and some user-defined. The frequency at which a shader stage executes is specified by the nature of that stage; vertex shaders execute once per input vertex, for example (though some executions can be skipped via caching). Fragment shader execution is defined by the fragments generated from the rasterization process.
计算着色器的运行方式与其他着色器阶段不同。所有其他着色器阶段都有一组定义明确的输入值,有些是内置的,有些是用户定义的。着色器阶段的执行频率由该阶段的性质指定;例如,顶点着色器为每个输入顶点执行一次(尽管某些执行可以通过缓存跳过)。片段着色器执行由栅格化过程生成的片段定义。
Compute shaders work very differently. The "space" that a compute shader operates on is largely abstract; it is up to each compute shader to decide what the space means. The number of compute shader executions is defined by the function used to execute the compute operation. Most important of all, compute shaders have no user-defined inputs and no outputs at all. The built-in inputs only define where in the "space" of execution a particular compute shader invocation is.
计算着色器的工作方式非常不同。计算着色器操作的 “空间” 在很大程度上是抽象的;由每个计算着色器决定空间的含义。计算着色器执行的次数由用于执行计算操作的函数定义。最重要的是,计算着色器没有用户定义的输入,也根本没有输出。内置输入仅定义特定计算着色器调用在执行“空间”中的位置。
Therefore, if a compute shader wants to take some values as input, it is up to the shader itself to fetch that data, via texture access, arbitrary image load, shader storage blocks, or other forms of interface. Similarly, if a compute shader is to actually compute anything, it must explicitly write to an image or shader storage block.
因此,如果计算着色器想要将某些值作为输入,则由着色器本身通过纹理访问、任意图像加载、着色器存储块或其他形式的接口来获取该数据。同样,如果计算着色器要实际计算任何内容,它必须显式写入图像或着色器存储块。
Compute space[edit]
计算空间[编辑]
The space that compute shaders operate within is abstract. There is the concept of a work group; this is the smallest amount of compute operations that the user can execute. Or to put it another way, the user can execute some number of work groups.
计算着色器运行的空间是抽象的。有工作组的概念;这是用户可以执行的最小计算操作量。或者换句话说,用户可以执行一定数量的工作组。
The number of work groups that a compute operation is executed with is defined by the user when they invoke the compute operation. The space of these groups is three dimensional, so it has a number of "X", "Y", and "Z" groups. Any of these can be 1, so you can perform a two-dimensional or one-dimensional compute operation instead of a 3D one. This is useful for processing image data or linear arrays of a particle system or whatever.
执行计算操作时使用的工作组数量由用户在调用计算操作时定义。这些组的空间是三维的,因此它有许多 “X”、“Y” 和 “Z” 组。这些值中的任何一个都可以是 1,因此您可以执行二维或一维计算操作,而不是 3D 计算操作。这对于处理图像数据或粒子系统的线性阵列或其他任何内容都很有用。
When the system actually computes the work groups, it can do so in any order. So if it is given a work group set of (3, 1, 2), it could execute group (0, 0, 0) first, then skip to group (1, 0, 1), then jump to (2, 0, 0), etc. So your compute shader should not rely on the order in which individual groups are processed.
当系统实际计算工作组时,它可以按任何顺序进行计算。因此,如果给定一个工作组集 (3, 1, 2),它可以先执行组 (0, 0, 0),然后跳到组 (1, 0, 1),然后跳到 (2, 0, 0) 等。因此,计算着色器不应依赖于各个组的处理顺序。
Do not think that a single work group is the same thing as a single compute shader invocation; there's a reason why it is called a "group". Within a single work group, there may be many compute shader invocations. How many is defined by the compute shader itself, not by the call that executes it. This is known as the local size of the work group.
不要认为单个工作组与单个计算着色器调用是一回事;它被称为 “组” 是有原因的。在单个工作组中,可能有许多计算着色器调用。多少个由计算着色器本身定义,而不是由执行它的调用定义。这称为工作组的本地大小。
Every compute shader has a three-dimensional local size (again, sizes can be 1 to allow 2D or 1D local processing). This defines the number of invocations of the shader that will take place within each work group.
每个计算着色器都有一个三维局部大小(同样,大小可以是 1 以允许 2D 或 1D 局部处理)。这定义了每个工作组中将发生的着色器调用次数。
Therefore, if the local size of a compute shader is (128, 1, 1), and you execute it with a work group count of (16, 8, 64), then you will get 1,048,576 separate shader invocations. Each invocation will have a set of inputs that uniquely identifies that specific invocation.
因此,如果计算着色器的本地大小为 (128, 1, 1),并且您使用工作组计数 (16, 8, 64) 执行它,则您将获得 1,048,576 个单独的着色器调用。每个调用都将具有一组输入,用于唯一标识该特定调用。
This distinction is useful for doing various forms of image compression or decompression; the local size would be the size of a block of image data (8x8, for example), while the group count will be the image size divided by the block size. Each block is processed as a single work group.
这种区别对于执行各种形式的图像压缩或解压缩非常有用;本地大小是图像数据块的大小(例如 8x8),而组计数将是图像大小除以块大小。每个数据块都作为单个工作组进行处理。
The individual invocations within a work group will be executed "in parallel". The main purpose of the distinction between work group count and local size is that the different compute shader invocations within a work group can communicate through a set of shared variables and special functions. Invocations in different work groups (within the same compute shader dispatch) cannot effectively communicate. Not without potentially deadlocking the system.
工作组中的各个调用将“并行”执行。区分工作组计数和本地大小的主要目的是,工作组内的不同计算着色器调用可以通过一组共享变量和特殊函数进行通信。不同工作组(在同一个计算着色器分派中)中的调用无法有效通信。并非没有可能导致系统死锁。
Dispatch[edit]
调度[编辑]
Compute shaders are not part of the regular rendering pipeline. So when executing a Drawing Command, the compute shader linked into the current program or pipeline is not involved.
计算着色器不是常规渲染管道的一部分。因此,在执行 Drawing Command 时,不涉及链接到当前程序或管道的计算着色器。
There are two functions to initiate compute operations. They will use whichever compute shader is currently active (via glBindProgramPipeline or glUseProgram, following the usual rules for determining the active program for a stage). Though they are not Drawing Commands, they are Rendering Commands, so they can be conditionally executed.
有两个函数可用于启动计算操作。他们将使用当前处于活动状态的任何计算着色器(通过 glBindProgramPipeline 或 glUseProgram,遵循确定阶段活动程序的通常规则)。虽然它们不是 Drawing Commands,但它们是 Rendering 命令,因此它们可以有条件地执行。
void glDispatchCompute(GLuint num_groups_x, GLuint num_groups_y, GLuint num_groups_z);
The num_groups_* parameters define the work group count, in three dimensions. These numbers cannot be zero. There are limitations on the number of work groups that can be dispatched.
num_groups_* 参数在三个维度中定义工作组计数。这些数字不能为零。可以分派的工作组数量存在限制。
It is possible to execute dispatch operations where the work group counts come from information stored in a Buffer Object. This is similar to indirect drawing for vertex data:
可以执行调度操作,其中工作组计数来自存储在 Buffer Object 中的信息。这类似于顶点数据的间接绘制:
void glDispatchComputeIndirect(GLintptr indirect);
The indirect parameter is the byte-offset to the buffer currently bound to the GL_DISPATCH_INDIRECT_BUFFER target. Note that the same limitations on the work group counts still apply; however, indirect dispatch bypasses OpenGL's usual error checking. As such, attempting to dispatch with out-of-bounds work group sizes can cause a crash or even a GPU hard-lock, so be careful when generating this data.
indirect 参数是当前绑定到 GL_DISPATCH_INDIRECT_BUFFER 目标的缓冲区的字节偏移量。请注意,对工作组计数的相同限制仍然适用;但是,间接调度会绕过 OpenGL 的常规错误检查。因此,尝试使用超出范围的工作组大小进行分派可能会导致崩溃甚至 GPU 硬锁定,因此在生成此数据时要小心。
Inputs[edit]
输入[编辑]
Compute shaders cannot have any user-defined input variables. If you wish to provide input to a CS, you must use the implementation-defined inputs coupled with resources like storage buffers or Textures. You can use the shader's invocation and work group indices to decide which data to fetch and process.
计算着色器不能具有任何用户定义的输入变量。如果您希望向 CS 提供输入,则必须使用实现定义的输入以及存储缓冲区或 Textures 等资源。您可以使用着色器的调用和工作组索引来决定要获取和处理哪些数据。
Compute Shaders have the following built-in input variables.
Compute Shader具有以下内置输入变量。
in uvec3 gl_NumWorkGroups;
in uvec3 gl_WorkGroupID;
in uvec3 gl_LocalInvocationID;
in uvec3 gl_GlobalInvocationID;
in uint gl_LocalInvocationIndex;
- gl_NumWorkGroups
- This variable contains the number of work groups passed to the dispatch function.
此变量包含传递给 dispatch 函数的工作组数。 - gl_WorkGroupID
- This is the current work group for this shader invocation. Each of the XYZ components will be on the half-open range [0, gl_NumWorkGroups.XYZ).
这是此着色器调用的当前工作组。每个 XYZ 组件都将位于半开范围 [0, gl_NumWorkGroups.XYZ) 上。 - gl_LocalInvocationID
- This is the current invocation of the shader within the work group. Each of the XYZ components will be on the half-open range [0, gl_WorkGroupSize.XYZ).
这是工作组中着色器的当前调用。每个 XYZ 分量都将位于半开范围 [0, gl_WorkGroupSize.XYZ] 上。 - gl_GlobalInvocationID
- This value uniquely identifies this particular invocation of the compute shader among all invocations of this compute dispatch call. It's a short-hand for the math computation:
此值在此计算调度调用的所有调用中唯一标识计算着色器的此特定调用。它是数学计算的简写:
gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID;
- gl_LocalInvocationIndex
- This is a 1D version of gl_LocalInvocationID. It identifies this invocation's index within the work group. It is short-hand for this math computation:
这是 gl_LocalInvocationID 的 1D 版本。它在工作组中标识此调用的索引。它是此数学计算的简写:
gl_LocalInvocationIndex =
gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
gl_LocalInvocationID.y * gl_WorkGroupSize.x +
gl_LocalInvocationID.x;
Local size[edit]
本地大小[编辑]
The local size of a compute shader is defined within the shader, using a special layout input declaration:
计算着色器的本地大小在着色器中使用特殊的布局输入声明定义:
layout(local_size_x = X, local_size_y = Y, local_size_z = Z) in;
By default, the local sizes are 1, so if you only want a 1D or 2D work group space, you can specify just the X or the X and Y components. They must be integral constant expressions of value greater than 0. Their values must abide by the limitations imposed below; if they do not, a compiler or linker error occurs.
默认情况下,本地大小为 1,因此,如果只需要 1D 或 2D 工作组空间,则可以只指定 X 或 X 和 Y 组件。它们必须是 value 大于 0 的整数常量表达式。他们的价值观必须遵守以下限制;否则,将发生编译器或链接器错误。
The local size is available to the shader as a compile-time constant variable, so you don't need to define it yourself:
局部大小可作为编译时常量变量提供给着色器,因此您无需自己定义它:
const uvec3 gl_WorkGroupSize;
Outputs[edit]
输出[编辑]
Compute shaders do not have output variables. If you wish to have a CS generate some output, you must use a resource to do so. Shader storage buffers and Image Load Store operations are useful ways to output data from a CS.
计算着色器没有输出变量。如果您希望 CS 生成一些输出,则必须使用资源来实现。着色器存储缓冲区和图像加载存储操作是从 CS 输出数据的有用方法。
[edit]
[编辑]
Global variables in compute shaders can be declared with the shared storage qualifier. The value of such variables are shared between all invocations within a work group. You cannot declare any opaque types as shared, but aggregates (arrays and structs) are fine.
计算着色器中的全局变量可以使用 shared storage 限定符进行声明。此类变量的值在工作组内的所有调用之间共享。您不能将任何不透明类型声明为 shared,但聚合 (数组和结构) 很好。
At the beginning of a work group, these values are uninitialized. Also, the variable declaration cannot have initializers, so this is illegal:
在工作组开始时,这些值是未初始化的。此外,变量声明不能有初始化器,所以这是非法的:
shared uint foo = 0; // No initializers for shared variables.
If you want to initialize a shared variable to a particular value, then one of the invocations must explicitly set the variable to that value. And only one invocation must do so, due to the following.
如果要将共享变量初始化为特定值,则其中一个调用必须将变量显式设置为该值。由于以下原因,只有一个调用必须执行此操作。
[edit]
主条目: 内存模型#非相干内存访问
Shared variable access uses the rules for incoherent memory access. This means that the user must perform certain synchronization in order to ensure that shared variables are visible.
共享变量访问使用非一致性内存访问规则。这意味着用户必须执行某些同步,以确保共享变量可见。
Shared variables are all implicitly declared coherent, so you don't need to (and can't use) that qualifier. However, you still need to provide an appropriate memory barrier.
共享变量都是隐式声明的 coherent,因此您不需要(也不能使用)该限定符。但是,您仍然需要提供适当的内存屏障。
The usual set of memory barriers is available to compute shaders, but they also have access to memoryBarrierShared(); this barrier is specifically for shared variable ordering. groupMemoryBarrier() acts like memoryBarrier(), ordering memory writes for all kinds of variables, but it only orders read/writes for the current work group.
通常的内存屏障集可用于计算着色器,但它们也可以访问 memoryBarrierShared();此 barrier 专门用于共享变量排序。groupMemoryBarrier() 的作用类似于 memoryBarrier(),对各种变量的内存写入进行排序,但它只对当前工作组的读/写进行排序。
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are executing in lock-step. If you need to ensure that an invocation has written to some variable so that you can read it, you need to synchronize execution with the invocations, not just issue a memory barrier (you still need the memory barrier though).
虽然工作组中的所有调用都被称为“并行”执行,但这并不意味着您可以假设所有调用都是以锁步方式执行的。如果你需要确保调用已经写入某个变量以便你可以读取它,你需要将执行与调用同步,而不仅仅是发出一个内存屏障(尽管你仍然需要内存屏障)。
To synchronize reads and writes between invocations within a work group, you must employ the barrier() function. This forces an explicit synchronization between all invocations in the work group. Execution within the work group will not proceed until all other invocations have reach this barrier. Once past the barrier(), all shared variables previously written across all invocations in the group will be visible.
要在工作组内的调用之间同步读取和写入,必须使用 barrier() 函数。这会强制在工作组中的所有调用之间进行显式同步。在所有其他调用都达到此屏障之前,工作组内的执行不会继续。通过 barrier() 后,之前在组中的所有调用中写入的所有共享变量都将可见。
There are limitations on how you can call barrier(). However, compute shaders are not as limited as Tessellation Control Shaders in their use of this function. barrier() can be called from flow-control, but it can only be called from uniform flow control. All expressions that lead to the evaluation of a barrier() must be dynamically uniform.
调用 barrier() 的方式是有限制的。但是,计算着色器在使用此函数时并不像 Tessellation Control Shaders 那样受到限制。barrier() 可以从 flow-control 调用,但只能从 uniform flow control 调用。导致 barrier() 求值的所有表达式都必须是动态统一的。
In short, if you execute the same compute shader, no matter how different the data they fetch is, every execution must hit the exact same set of barrier() calls in the exact same order. Otherwise, serious errors may occur.
简而言之,如果您执行相同的计算着色器,无论它们获取的数据有多么不同,每次执行都必须以完全相同的顺序命中完全相同的 barrier() 调用集。否则,可能会出现严重错误。
Atomic operations[edit]
原子操作[编辑]
主条目: 着色器存储缓冲区对象#原子操作
A number of atomic operations can be performed on shared variables of integral type (and vectors/arrays/structs of them). These functions are shared with Shader Storage Buffer Object atomics.
可以对整型共享变量(以及它们的向量/数组/结构)执行许多原子操作。这些函数与 Shader Storage Buffer Object 原子共享。
All of the atomic functions return the original value. The term "nint" can be int or uint.
所有原子函数都返回原始值。术语 “nint” 可以是 int 或 uint。
nint atomicAdd(inout nint mem, nint 数据)
Adds data to mem.
向 mem 添加数据。
nint atomicMin(inout nint mem, nyou data)
The mem's value is no lower than data.
mem 的值不低于 data。
nint atomicMax(inout nint mem, nyou data)
The mem's value is no greater than data.
mem 的值不大于 data。
nint atomicAnd (inout nint mem, nint data)
mem becomes the bitwise-and between mem and data.
mem 成为 mem 和 data 之间的按位 and。
nint atomicOr(inout nint mem, nint data)
mem becomes the bitwise-or between mem and data.
mem 成为 mem 和 data 之间的按位 or。
nint atomicXor(inout nint mem, nint data)
mem becomes the bitwise-xor between mem and data.
mem 成为 mem 和 data 之间的按位 xor。
nint atomicExchange(inout nint mem, nint data)
Sets mem's value to data.
将 mem 的值设置为 data。
nint atomicCompSwap(inout nint mem, nint compare, nint data)
If the current value of mem is equal to compare, then mem is set to data. Otherwise it is left unchanged.
如果 mem 的当前值等于 compare,则 mem 设置为 data。否则,它将保持不变。
Limitations[edit]
限制[编辑]
The number of work groups that can be dispatched in a single dispatch call is defined by GL_MAX_COMPUTE_WORK_GROUP_COUNT. This must be queried with glGetIntegeri_v, with the index being on the closed range [0, 2], representing the X, Y and Z components of the maximum work group count. Attempting to call glDispatchCompute with values that exceed this range is an error. Attempting to call glDispatchComputeIndirect is much worse; it may result in program termination or other badness.
可以在单个 Dispatch 调用中分派的工作组数量由 GL_MAX_COMPUTE_WORK_GROUP_COUNT 定义。这必须使用 glGetIntegeri_v 进行查询,索引位于封闭范围 [0, 2] 上,表示最大工作组计数的 X、Y 和 Z 分量。尝试使用超出此范围的值调用 glDispatchCompute 是错误的。尝试调用 glDispatchComputeIndirect 的情况要糟糕得多;它可能会导致程序终止或其他不良情况。
Note that the minimum these values must be is 65535 in all three axes. So you've probably got a lot of room to work with.
请注意,这些值在所有三个轴上都必须为 65535。所以你可能有很大的空间可以处理。
There are limits on the local size as well; indeed, there are two sets of limitations. There is a general limitation on the local size dimensions, queried with GL_MAX_COMPUTE_WORK_GROUP_SIZE in the same way as above. Note that the minimum requirements here are much smaller: 1024 for X and Y, and a mere 64 for Z.
本地大小也有限制;事实上,有两组限制。本地大小维度有一个一般限制,以与上述相同的方式使用 GL_MAX_COMPUTE_WORK_GROUP_SIZE 进行查询。请注意,这里的最低要求要小得多:X 和 Y 为 1024,Z 仅为 64。
There is another limitation: the total number of invocations within a work group. That is, the product of the X, Y and Z components of the local size must be less than GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS. The minimum value here is 1024.
还有另一个限制:工作组内的调用总数。也就是说,局部大小的 X、Y 和 Z 分量的乘积必须小于 GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS。此处的最小值为 1024。
There is also a limit on the total storage size for all shared variables in a compute shader. This is GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, which is in bytes. The OpenGL-required minimum is 32KB. OpenGL does not specify the exact mapping between GL types and shared variable storage, though you could use the std140 layout rules and UBO/SSBO sizes as a general guideline.
计算着色器中所有共享变量的总存储大小也有限制。这是 GL_MAX_COMPUTE_SHARED_MEMORY_SIZE,以字节为单位。OpenGL 所需的最小值为 32KB。OpenGL 没有指定 GL 类型和共享变量存储之间的确切映射,但您可以使用 std140 布局规则和 UBO/SSBO 大小作为一般准则。