Home 家
This series of wiki pages are originally from http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm.
本系列的 wiki 页面最初来自 http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm。
Copyright by Prof. Robert van de Geijn (rvdg@cs.utexas.edu).
Robert van de Geijn 教授 (rvdg@cs.utexas.edu) 版权所有。
Adapted to Github Markdown Wiki by Jianyu Huang (jianyu@cs.utexas.edu).
改编自 Github Markdown Wiki,作者是 Jianyu Huang (jianyu@cs.utexas.edu)。
- The GotoBLAS/BLIS Approach to Optimizing Matrix-Matrix Multiplication - Step-by-Step
优化矩阵-矩阵乘法的 GotoBLAS/BLIS 方法 - 逐步 - NOTICE ON ACADEMIC HONESTY
学术诚信通知 - References 引用
- Set Up 建立
- Step-by-step optimizations
分步优化 -
Computing four elements of C at a time
一次计算 C 的 4 个元素 -
Computing a 4 x 4 block of C at a time
一次计算一个 4 x 4 的 C 块 - Acknowledgement 确认
The GotoBLAS/BLIS Approach to Optimizing Matrix-Matrix Multiplication - Step-by-Step
优化矩阵-矩阵乘法的 GotoBLAS/BLIS 方法 - 逐步
This page leads one, step by step, through the optimization of
matrix-matrix multiplication. For now, it is assumed that the
reader has a Linux account on an Intel processor based computer. We will use
the gcc compiler as part of the exercise.
本页将逐步介绍如何优化矩阵-矩阵乘法。目前,假设读者在基于 Intel 处理器的计算机上拥有 Linux 帐户。我们将使用 gcc 编译器作为练习的一部分。
Please send comments to rvdg@cs.utexas.edu.
请将评论发送至 rvdg@cs.utexas.edu。
If you use these materials for a class project, you MUST disclose where you found this information. You will risk being accused of academic dishonesty otherwise...
如果你将这些材料用于课堂项目,你必须披露你在哪里找到这些信息。否则,您将面临被指控学术不诚实的风险......
This work is based on two publications. You will want to read these when you are done with this exercise. If you use information on this page in other research, please reference these papers.
这项工作基于两份出版物。完成此练习后,您将需要阅读这些内容。如果您在其他研究中使用此页面上的信息,请参考这些论文。
-
Anatomy of high-performance matrix multiplication. Kazushige Goto, Robert A. van de Geijn. ACM Transactions on Mathematical Software (TOMS), 2008.
高性能矩阵乘法剖析。Kazushige Goto, Robert A. van de Geijn.ACM 数学软件汇刊 (TOMS),2008 年。(Available without charge at the following site: http://www.cs.utexas.edu/users/flame/FLAMEPublications.html)
(可在以下网站免费获得:http://www.cs.utexas.edu/users/flame/FLAMEPublications.html) -
BLIS: A Framework for Rapidly Instantiating BLAS Functionality. Field G. Van Zee, Robert A. van de Geijn. ACM Transactions on Mathematical Software (TOMS), 2015.
BLIS:用于快速实例化 BLAS 功能的框架。菲尔德 G. Van Zee,罗伯特 A. 范德盖恩。ACM 数学软件汇刊 (TOMS),2015 年。(Available without charge at the following site: http://www.cs.utexas.edu/users/flame/FLAMEPublications.html)
(可在以下网站免费获得:http://www.cs.utexas.edu/users/flame/FLAMEPublications.html)
For more advanced exercises with recent architectures (Intel Sandy/Ivy Bridges, Haswell, etc.), you may want to try BLISlab.
有关最新架构(Intel Sandy/Ivy Bridges、Haswell 等)的更高级练习,您可能需要尝试 BLISlab。
-
BLISlab: A Sandbox for Optimizing GEMM
BLISlab:用于优化 GEMM 的沙箱(Available at: https://github.com/flame/blislab)
(见:https://github.com/flame/blislab)
This wiki page assumes that you have access to an Intel-based processor, the gnu-cc compiler, and octave (an Open Source version of MATLAB that is part of a typical Linux or Unix install).
此 wiki 页面假定您有权访问基于 Intel 的处理器、gnu-cc 编译器和 octave(MATLAB 的开源版本,是典型 Linux 或 Unix 安装的一部分)。
To be able to follow along with the below examples, you will want to download some routines, as described on the
Set Up
page.
为了能够按照以下示例进行操作,您需要下载一些例程,如 设置 页面。
Make sure that the makefile
starts with the following lines:
确保 makefile
以以下行开头:
OLD := MMult0
NEW := MMult0
This indicates that the performance of
the version of matrix-matrix multiplication in
MMult0.c
is measured (by virtue of the statement OLD :=0
).
这表明测量了 MMult0.c
中矩阵-矩阵乘法版本的性能(通过语句 OLD :=0
)。
Next, to make sure that when plotting the graphs are properly scaled,
set certain parameters in the file proc_parameters.m
. See the
comments in that file. (Setting these parameters will ensure
that when plotting the y-axis ranges from 0 to the peak performance
of the architecture.)
接下来,要确保在绘制图形时正确缩放,请在文件 proc_parameters.m
中设置某些参数。请参阅该文件中的注释。(设置这些参数将确保在绘制 y 轴时的范围从 0 到架构的峰值性能。
Picking the right clock speed is a bit tricky, given that modern architectures have something called 'turbo boost' which changes the clock speed. For example, the Intel i5 core in my laptop has a clock speed of 1.7 GHz, but a turbo boost rate of 2.6 GHz. I chose to indicate in proc_parameters.m
that the processor has a clock speed of 2.6 GHz, since otherwise some of the results would show that the implementation attains greater than the peak speed of the processor...
选择正确的时钟速度有点棘手,因为现代架构有一种叫做 “turbo boost” 的东西,它可以改变时钟速度。例如,我笔记本电脑中的 Intel i5 内核的时钟速度为 1.7 GHz,但睿频加速率为 2.6 GHz。我选择在 proc_parameters.m
中表示处理器的时钟速度为 2.6 GHz,因为否则一些结果将显示实现达到的峰值速度大于处理器的峰值速度......
Execute 执行
-
make run
This will compile, link, and execute the test driver, linking to the implementation inMMult0.c
. The performance data is saved in fileoutput0.m
.make run
这将编译、链接和执行测试驱动程序,链接到MMult0.c
中的实现。性能数据保存在文件output0.m
中。 -
more output0.m
This will display the contents of the output fileoutput_MMult0.m
. It should look something like更多输出0.m
这将显示输出文件的内容output_MMult0.m
。它应该看起来像这样
version = 'MMult0';
MY_MMult = [
40 1.163636e+00 0.000000e+00
80 8.827586e-01 0.000000e+00
120 1.289071e+00 0.000000e+00
160 1.200469e+00 0.000000e+00
200 1.195100e+00 0.000000e+00
240 1.211038e+00 0.000000e+00
[ lines deleted ]
720 2.096185e-01 0.000000e+00
760 2.116985e-01 0.000000e+00
800 2.115609e-01 0.000000e+00
];
The first column equals the problem size. The second column the performance (in Gflops) when a matrix-matrix multiply with the indicated problem size m=n=k
is executed. The last column reports the maximum absolute difference encountered between the implementation in REF_MMult.c
and MMult0.c
. It should be close to 0.00000e+00 although as different optimizations are added the difference may not be perfectly zero.
第一列等于问题大小。第二列是执行矩阵矩阵乘以指示的问题大小 m=n=k
时的性能(以 Gflops 为单位)。最后一列报告 REF_MMult.c
和 MMult0.c
中的实现之间遇到的最大绝对差值。它应该接近 0.00000e+00,尽管由于添加了不同的优化,差异可能不是完全为零。
-
octave
This will start up octave. Then, in octave,八度
这将开始八度音程。然后,在八度音程中,
octave:1> PlotAll % this will create the plot
I usually start up a separate xterm session, in which I keep octave running, so that every time I want to make a new graph, I can just execute 'PlotAll' in that session.
我通常会启动一个单独的 xterm 会话,在这个会话中保持 octave 运行,这样每次我想创建一个新图形时,我都可以在该会话中执行 'PlotAll'。
The performance graph (on my 1.7GHz Intel Core i5 MacBook Air) looks something like
性能图(在我的 1.7GHz Intel Core i5 MacBook Air 上)如下所示
Notice that the two curves are right on top of each other because data for the same implementation are being compared. From the fact that the top of the graph represents peak performance, it is obvious that this simple implementation achieves only a fraction of the ideal performance.
请注意,两条曲线彼此重叠,因为正在比较同一实现的数据。从图表顶部表示峰值性能这一事实来看,很明显,这种简单的实现只能实现理想性能的一小部分。
A question, of course is, is this the best we can do? We are going to walk through a sequence of optimizations, culminating in performance marked by "NEW" in the following graph:
当然,一个问题是,这是我们能做的最好的吗?我们将演练一系列优化,最终在下图中以 “NEW” 标记的性能达到高潮:
We will now lead the visitor through a series of optimizations. In some cases, a new implementation (optimization) merely is a small step in the right direction. We change the code a little at a time in order to be able to make sure it remains correct.
现在,我们将引导访客完成一系列优化。在某些情况下,新的实现 (优化) 只是朝着正确方向迈出的一小步。我们一次更改一点代码,以确保它保持正确。
-
We first rewrite the basic implementation to hide the inner loop in a subroutine,
AddDot
:
我们首先重写基本实现,将内部循环隐藏在子例程AddDot
中:
This does not yield better performance:
这不会产生更好的性能:
It does set us up for the next step.
它确实为下一步做好了准备。
-
We compute C four elements at a time in a subroutine,
AddDot1x4
, which performs four inner products at a time:
我们在子例程AddDot1x4
中一次计算 C 四个元素,该子例程一次执行四个内部乘积: -
Now we inline the four separate inner products and fuse the loops into one, thereby computing the four inner products simultaneously in one loop:
现在我们内联四个单独的内积并将 loop 融合为一个,从而在一个 Loop 中同时计算四个内积:
At this point, we are starting to see some performance improvements:
此时,我们开始看到一些性能改进:
-
We accumulate the elements of C in registers and use a register for elements of A
我们在 register 中累积 C 的元素,并为 A 的元素使用 register -
We use pointers to address elements in B
我们使用指针来寻址 B 中的元素 -
We unroll the loop by four (a relatively arbitrary choice of unrolling factor)
我们将循环展开 4(相对任意的展开因子选择) -
We use indirect addressing to reduce the number of times the pointers need to be updated
我们使用间接寻址来减少需要更新指针的次数
There is considerable improvement for problem sizes that fit (at least partially) in the L2 cache. Still, there is a lot of room for improvement.
适合 L2 缓存的问题大小(至少部分)有了相当大的改进。尽管如此,仍有很大的改进空间。
We now compute a 4 x 4 block of C at a time in order to use vector instructions and vector registers effectively. The idea is as follows: There are special instructions as part of the SSE3 instruction set that allow one to perform two 'multiply accumulate' operations (two multiplies and two adds) per clock cycle for a total of four floating point operations per clock cycle. To use these, one has to place data in 'vector registers'. There are sixteen of these, each of which can hold two double precision numbers. So, we can keep 32 double precision numbers in registers. We will use sixteen of these to hold elements of C, a 4 x 4 block.
我们现在一次计算一个 4 x 4 的 C 块,以便有效地使用向量指令和向量寄存器。这个想法如下: SSE3 指令集有一些特殊指令,允许每个 clock cycle 执行两个 'multiply accumulate' 操作(两个乘法和两个加法),每个 clock cycle总共有四个浮点操作。要使用这些,必须将数据放在 'vector registers' 中。其中有 16 个,每个可以容纳两个双精度数字。因此,我们可以在 register 中保留 32 个双精度数字。我们将使用其中的 16 个来保存 C 的元素,一个 4 x 4 的块。
-
We compute C four elements at a time in a subroutine,
AddDot4x4
, which performs sixteen inner products at a time:
我们在子例程AddDot4x4
中一次计算 C 四个元素,该子例程一次执行 16 个内积: -
Now we inline the sixteen separate inner products and fuse the loops into one, thereby computing the sixteen inner products simultaneously in one loop:
现在我们内联 16 个单独的内积,并将 loop 融合为一个,从而在一个 Loop 中同时计算 16 个内积:
At this point, we are again starting to see some performance improvements:
此时,我们再次开始看到一些性能改进:
-
We accumulate the elements of C in registers and use a register for elements of A
我们在 register 中累积 C 的元素,并为 A 的元素使用 register -
We use pointers to address elements in B
我们使用指针来寻址 B 中的元素
We now start optimizing differently as we did for the 1x4 case.
我们现在开始以不同的方式进行优化,就像我们对 1x4 情况所做的那样。
-
We store a row of k x 4 matrix B in registers
我们在 registers 中存储一行 k x 4 矩阵 B
Notice that we now use MANY more regular registers than physically available...
请注意,我们现在使用的常规寄存器比实际可用的要多得多......
-
We rearrange the computation so that two rows of 4x4 block of C are computed at a time.
我们重新安排计算,以便一次计算两行 4x4 的 C 块。 -
We use vector registers and vector operations.
我们使用 vector registers 和 vector operations。
We notice a considerable performance boost:
我们注意到性能得到了相当大的提升:
Still, there is a lot of room for improvement.
尽管如此,仍有很大的改进空间。
-
In order to maintain the performance attained for smaller problem sizes, we block matrix C (and A and B correspondingly):
为了保持较小问题大小所获得的性能,我们分块矩阵 C(以及相应的 A 和 B):
Now, performance is maintained:
现在,性能保持不变:
-
First, we pack the block of A so that we march through it contiguously.
首先,我们打包 A 的块,以便我们连续地穿过它。 -
Optimization (4x4) 13 优化 (4x4) 13
This yields a surprisingly large performance boost:
这会产生令人惊讶的巨大性能提升:
-
Finally, we pack the block of B so that we march through it contiguously.
最后,我们打包 B 块,以便我们连续地穿过它。
We now attain 90% of the turbo boost peak of the processor!
我们现在实现了处理器睿频加速峰值的 90%!
This material was partially sponsored by grants from the National Science Foundation (Awards ACI-1148125/1340293).
本材料部分由美国国家科学基金会(ACI-1148125/1340293 奖)资助。
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
本材料中表达的任何观点、发现和结论或建议均为作者的观点,并不一定反映美国国家科学基金会 (NSF) 的观点。