Approximate computing is a promising alternative to improve energy efficiency for loT devices on the edge. This work proposes a piecewise-linearly-approximated and unbiased floating-point approximate multiplier with run-time configurability. We provide a theoretically sound formulation that turns multiplication approximation to an optimization problem. With the formulation and findings, a multi-level architecture is proposed to easily incorporate run-time configurability and module execution parallelism. Finally, the proposed multiplier is further optimized to reduce the circuit implementation complexity, making the multiplier linearly dependent on the precision requirement, instead of quadratically or exponentially as in prior work. When compared to the prior state-of-the-art approximate floating-point multiplier, ApproxLP M. Imani et al, “ApproxLP: Approximate multiplication with linearization and iterative error control,” in Proc. ACM/IEEE Des. Autom. Conf., 2019, pp. 1-6., the proposed multiplier outperforms in all the aspects including accuracy, area, and delay. By replacing a full-precision floating-point multiplier in GPU, the proposed design can improve the energy efficiency for various edge computing tasks. Even with Level 1 approximation, the proposed multiplier improves energy efficiency up to 20 xx20 \times for machine learning on CIFAR-10, with almost negligible accuracy loss. 近似计算是提高边缘物联网设备能效的一种有前景的替代方案。本研究提出了一种具有运行时可配置性的分段线性近似无偏浮点近似乘法器。我们提供了一个理论上合理的公式,将乘法近似转化为优化问题。基于该公式和研究结果,我们提出了一种多级架构,以便轻松整合运行时可配置性和模块执行并行性。最后,进一步优化了所提出的乘法器,以降低电路实现复杂度,使乘法器线性依赖于精度要求,而不是像先前工作中那样呈二次或指数函数关系。与之前最先进的近似浮点乘法器 ApproxLP 相比,M. Imani 等人在 Proc. ACM/IEEE Des. Autom. 上发表了题为“ApproxLP:具有线性化和迭代误差控制的近似乘法”的文章。 Conf.,2019,第 1-6 页。该乘法器在精度、面积和延迟等各方面均表现出色。通过替换 GPU 中的全精度浮点乘法器,该设计可以提升各种边缘计算任务的能效。即使在 1 级近似下,该乘法器也能将 CIFAR-10 数据集上的机器学习能效提升高达 20 xx20 \times ,且精度损失几乎可以忽略不计。
Index Terms-Approximate computing, multiplier, energy efficiency, floating-point 索引词——近似计算、乘法器、能量效率、浮点
1 Introduction 1 简介
DUE to the rapid growth of Artificial Intelligence (AI) and Internet-of-Things (IoT), energy efficiency has become a critical concern for IoT devices with constrained resources [2], [3], [4], [5]. There have been various research efforts for IoT energy efficiency optimization, ranging from algorithm, architecture, to circuit [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. Among such efforts, approximate computing has emerged as a promising alternative for designers to trade computational accuracy with energy efficiency. This is especially applicable to human sensory or machine 由于人工智能 (AI) 和物联网 (IoT) 的快速发展,能源效率已成为资源受限的物联网设备的关键关注点 [2]、[3]、[4]、[5]。针对物联网能源效率优化,已开展了多项研究,涵盖算法、架构和电路 [6]、[7]、[8]、[9]、[10]、[11]、[12]、[13]、[14]、[15]、[16]。在这些研究中,近似计算已成为设计人员在计算精度和能源效率之间权衡的一种有前景的替代方案。这尤其适用于人类感官或机器
Chuangtao Chen is with the College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China. E-mail: chtchen@zju.edu.cn. 陈创涛,浙江大学电气工程学院,浙江省杭州市 310027。电子邮箱: chtchen@zju.edu.cn 。
Weikang Qian is with the University of Michigan-Shanghai Jiao Tong University Joint Institute and MoE Key Laboratory of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai 200240, China. 钱伟康就职于密歇根大学—上海交通大学联合研究院和上海交通大学人工智能教育部重点实验室,上海 200240。
E-mail: qianwk@sjtu.edu.cn. 电子邮件: qianwk@sjtu.edu.cn 。
Mohsen Imani is with the Department of Computer Science and Engineering, University of California, Irvine, Irvine, CA 92697 USA. Mohsen Imani 就职于加州大学欧文分校计算机科学与工程系,地址:美国加利福尼亚州欧文市 92697。
E-mail: moimani@ucsd.edu. 电子邮件: moimani@ucsd.edu 。
Xunzhao Yin and Cheng Zhuo are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China. E-mail: {xzyin1, czhuo}@zju.edu.cn. 尹寻钊和卓程就职于浙江大学信息与电子工程学院,浙江省杭州市 310027。电子邮件:{xzyin1, czhuo}@zju.edu.cn。
Manuscript received 26 April 2021; revised 12 November 2021; accepted 21 November 2021. Date of publication 1 December 2021; date of current version 8 September 2022. 稿件收到日期:2021 年 4 月 26 日;修订日期:2021 年 11 月 12 日;接受日期:2021 年 11 月 21 日。出版日期:2021 年 12 月 1 日;当前版本日期:2022 年 9 月 8 日。
This work was supported in part by the National Key R&D Program of China under Grant 2018 YFE 01263002018 Y F E 0126300 and in part by the NSFC under Grants 62034007 and 61974133. 此项工作部分由国家重点研发计划 2018 YFE 01263002018 Y F E 0126300 资助,部分由国家自然科学基金委员会 62034007 和 61974133 资助。
(Corresponding author: Cheng Zhuo.) (通讯作者:程卓)
Recommended for acceptance by F. Lamberti. 经 F. Lamberti 推荐接受。
Digital Object Identifier no. 10.1109/TC.2021.3131850 数字对象标识符编号 10.1109/TC.2021.3131850
learning tasks where a small amount of inaccuracy is tolerable [17], [18], [19], [20]. 学习任务中,少量的不准确性是可以容忍的[17],[18],[19],[20]。
At the edge, IoT devices are designed to consume the minimum resource to achieve the desired accuracy. However, the conventional processors, such as CPU or GPU, can only conduct all the computations with predetermined but sometimes unnecessary precisions, inevitably degrading their energy efficiency. When running data-intensive applications, due to the large range of input operands, most conventional processors heavily rely on floating-point units (FPUs) [8], [21]. To cover the same dynamic range, the fixed-point unit has to consume up to 5xx5 \times larger area compared to its floating-point (FP) counterpart and hence is a far less common option [22]. Among different FP operations, multiplication is a widely used but possibly the most energy-consuming operation for various data-intensive scenarios, such as streaming, neural network, image processing, etc. In other words, when running inaccuracy-tolerable applications on the conventional processors, significant energy and time are spent on FP multipliers computing highly accurate outputs that are unnecessary. Thus, for FP multiplication in IoT devices, there is a need to optimize its energy efficiency by providing sufficient instead of excessively accurate computational precision. 在边缘,物联网设备的设计目标是消耗最少的资源来实现所需的精度。然而,传统处理器(如 CPU 或 GPU)只能以预定的但有时不必要的精度进行所有计算,这不可避免地降低了其能效。在运行数据密集型应用程序时,由于输入操作数的范围很大,大多数传统处理器严重依赖浮点单元 (FPU) [8],[21]。为了覆盖相同的动态范围,定点单元必须比浮点 (FP) 单元消耗高达 5xx5 \times 的面积,因此定点单元不太常用 [22]。在各种 FP 运算中,乘法是一种广泛使用但可能是各种数据密集型场景中最耗能的运算,例如流媒体、神经网络、图像处理等。换句话说,在传统处理器上运行对精度要求不高的应用程序时,大量的能量和时间都花在了 FP 乘法器上,这些乘法器用于计算不必要的高精度输出。因此,对于物联网设备中的 FP 乘法,需要通过提供足够而不是过于准确的计算精度来优化其能源效率。
As a common arithmetic component that has been studied for decades [23], [24], the past focus for FP multiplier is mainly placed upon area and performance. Recently, with awareness of the compromise between the stringent resource constraint and the accuracy tolerance for edge applications, researchers have growing interests in designing an approximate FP multiplier to improve energy efficiency. For example, Camus et al. redesigned the arithmetic components to reduce circuitry 作为一种常见的算术组件,FP 乘法器已被研究了数十年 [23], [24],过去对其的关注主要集中在面积和性能上。近年来,随着对边缘应用在严格的资源约束和精度容差之间的权衡的认识,研究人员对设计近似 FP 乘法器以提高能效的兴趣日益浓厚。例如,Camus 等人重新设计了算术组件,以减少电路
complexity [25], where the approximation error is controlled by construction. Prior work in [26], [27] used a hybrid method by employing both accurate and inaccurate multipliers for runtime configurable approximation. However, such FP multipliers can hardly guarantee unbiased error distribution with near-zero average error, causing the risk of aggregated error for applications with multiple multiplications in series. 复杂度[25],其中近似误差由构造控制。[26]、[27]中的先前研究采用了一种混合方法,即同时使用准确乘法器和不准确乘法器来实现运行时可配置的近似。然而,此类 FP 乘法器难以保证平均误差接近于零的无偏误差分布,这会导致在连续执行多个乘法运算的应用中出现聚集误差的风险。
To address the issues, several work proposed to design approximate multipliers at the algorithmic level to achieve configurability by combining different product sizes or truncating unwanted bits [28], [29]. Recently, ApproxLP was proposed to improve computational efficiency and configurability by directly approximating the product of two FP inputs with linear fitting [1]. However, due to the focus at algorithmic level, the proposed approaches suffered from quickly increased circuitry complexity and degraded efficiency with higher precision requirement, eventually impairing energy efficiency and computation time. Moreover, many prior designs happened to rely on hand-crafted structures or heuristics. How to achieve an optimal approximation with unbiased error distribution remains an open question. Thus, it is highly desired to develop a systematic methodology to design unbiased, configurable, and circuit-implementation-friendly FP multiplier with optimal piecewise-linear approximation. 为了解决这些问题,一些研究提出在算法层面设计近似乘法器,通过组合不同的乘积大小或截断不需要的位来实现可配置性 [28],[29]。最近,提出了一种 ApproxLP 方法,通过直接用线性拟合近似两个浮点运算输入的乘积来提高计算效率和可配置性 [1]。然而,由于这些方法侧重于算法层面,在对精度要求更高的情况下,电路复杂度迅速增加,效率下降,最终导致能效和计算时间的损失。此外,许多先前的设计依赖于手工设计的结构或启发式方法。如何实现具有无偏误差分布的最优近似仍然是一个悬而未决的问题。因此,迫切需要开发一种系统的方法来设计无偏、可配置、易于电路实现且具有最优分段线性近似的浮点乘法器。
Obviously, this is not a trivial task due to the following reasons. 显然,由于以下原因,这不是一项简单的任务。
Unlike many approximations in prior work that stem from heuristic findings [1], [14], [26], [28], we need to formally define the problem, including objective function and constraints, to enable the theoretically sound basis for optimal piecewise-linear approximation. 与先前工作中源自启发式发现的许多近似值 [1], [14], [26], [28] 不同,我们需要正式定义问题,包括目标函数和约束,以便为最佳分段线性近似提供理论上合理的基础。
When ensuring configurability, the underlying architecture should facilitate the circuitry implementation instead of introducing implementation-unfriendly logics or operations, thus preventing exponentially growing area complexity with higher precision requirement. 在保证可配置性的同时,底层架构应该有利于电路的实现,而不是引入不利于实现的逻辑或操作,从而避免面积复杂度随着精度要求的提高而呈指数级增长。
How to ensure unbiasedness and run-time configurability for the optimally-approximated FP multiplier is not straightforward. It is hard to achieve all the features in one design. 如何确保最优近似 FP 乘法器的无偏性和运行时可配置性并非易事,很难在一个设计中实现所有特性。
In this paper, by addressing the aforementioned challenges, we propose to design an Piecewise-Linearly-Approximated FP Multiplier, PAM, which is run-time configurable and unbiased in error distribution. The major contributions of our work are listed as follows. 本文针对上述挑战,提出设计一个分段线性近似浮点乘法器 (PAM),该乘法器运行时可配置,且误差分布无偏。本研究的主要贡献如下。
A theoretically sound optimization formulation is proposed to minimize the approximation error of the approximate multiplier and acts as the basis for multiplier architecture design. With the proposed formulation, the error can be symmetrically distributed, yielding an unbiased error distribution. 提出了一个理论上合理的优化公式,以最小化近似乘法器的近似误差,并作为乘法器架构设计的基础。利用该公式,误差可以对称分布,从而得到无偏的误差分布。
Based on the optimization formulation and findings, we propose a multi-level FP multiplier architecture that can easily incorporate run-time configurability. The accuracy is configured by adding up different levels of error compensation, while each level is designed with circuit-implementation-friendly operations, such as addition, inversion, etc. Moreover, the modules at different levels are independent and hence support parallel execution to achieve higher performance. 基于优化公式和研究结果,我们提出了一种可轻松实现运行时配置的多层浮点乘法器架构。精度通过叠加不同级别的误差补偿来配置,同时每层都设计了易于电路实现的运算,例如加法、求逆等。此外,不同层级的模块相互独立,因此支持并行执行,从而实现更高的性能。
Fig. 1. An example of 32-bit FP number and FP multiplication in a gen-eral-purpose processor according to IEEE 754. 图 1. 符合 IEEE 754 的通用处理器中的 32 位 FP 数和 FP 乘法的示例。
A common issue of the prior approximate FP multiplier designs is the quickly growing area complexity with the increased precision requirements. With the proposed architecture, we further optimize the circuit implementation to reduce the complexity from O(4^(n))O\left(4^{n}\right) to O(n)O(n), where nn is the number of approximation levels, while ensuring the same accuracy. 现有近似浮点乘法器设计的一个常见问题是随着精度要求的提高,面积复杂度会迅速增长。基于提出的架构,我们进一步优化了电路实现,将复杂度从 O(4^(n))O\left(4^{n}\right) 降低到 O(n)O(n) ,其中 nn 是近似级数,同时保证了相同的精度。
Our experimental results show that, with the proposed formulation to determine the optimal approximation, we can implement an energy-efficient run-time configurable approximate FP multiplier. The proposed multiplier PAM is found to have comprehensive superiority over many prior work [1], [14], [26], [27], [28]. When compared with a state-of-the-art (SOTA) multiplier, ApproxLP [1], as well as other representative approximate multipliers [28], [30], PAM can achieve accuracy improvements up to 37%37 \% in terms of mean square error (MSE) with 5.5%-22.2%5.5 \%-22.2 \% smaller area-delay-product for various benchmarks. Moreover, PAM can be easily configured to support higher precision requirements, which can reduce area cost by 10.4%10.4 \% and delay by 31.5%31.5 \% to achieve the comparable accuracy of a 32-bit full precision FP multiplier. Finally, when replacing a full precision FP multiplier with PAM in GPU and evaluating with various machine learning tasks, we can achieve 3.19-7.12 xx3.19-7.12 \times energy improvement and 7.33-20.46 xx7.33-20.46 \times energy efficiency improvement while the accuracy loss is almost negligible. 我们的实验结果表明,利用所提出的公式确定最优近似值,我们可以实现一个节能且运行时可配置的近似浮点乘法器。所提出的乘法器 PAM 被发现比许多先前的研究 [1]、[14]、[26]、[27]、[28] 具有全面的优势。与最先进的 (SOTA) 乘法器 ApproxLP [1] 以及其他具有代表性的近似乘法器 [28]、[30] 相比,PAM 在各种基准测试中,均方误差 (MSE) 的精度提升高达 37%37 \% ,面积延迟积则减小 5.5%-22.2%5.5 \%-22.2 \% 。此外,PAM 可以轻松配置以支持更高的精度要求,从而可以将面积成本降低 10.4%10.4 \% ,延迟降低 31.5%31.5 \% ,以达到与 32 位全精度浮点乘法器相当的精度。最后,当在 GPU 中用 PAM 替换全精度 FP 乘法器并使用各种机器学习任务进行评估时,我们可以实现 3.19-7.12 xx3.19-7.12 \times 能量改进和 7.33-20.46 xx7.33-20.46 \times 能效改进,而准确度损失几乎可以忽略不计。
2 BackGround 2 背景
2.1 Floating-Point Multiplication 2.1 浮点乘法
Compared to integer or fixed-point computing, FP arithmetic is usually more energy-consuming due to its capability to represent numbers in a wider range. According to IEEE 754 standard [31], which is a technical standard for FP arithmetic, an FP number consists of sign, exponent, and mantissa, as shown in Fig. 1. The mantissa of a normalized FP number is a fraction with its value between 1 and 2 . In a general-purpose processor, the multiplication of 32-bit FP 与整数或定点计算相比,浮点运算通常更耗能,因为它能够表示更大范围的数字。根据 IEEE 754 标准 [31](浮点运算的技术标准),浮点数由符号、指数和尾数组成,如图 1 所示。规范化浮点数的尾数是一个介于 1 和 2 之间的分数。在通用处理器中,32 位浮点数的乘法
numbers in Fig. 1 has different rules for different parts of an FP number. The sign bits are XORed together and the exponents are summed by an adder. Then, a bias of 2^("exponent_width ")-12^{\text {exponent_width }}-1 is subtracted from the sum to allow both negative and positive values for the exponent. Finally, two mantissas are multiplied and shift to the range of 1 and 2 to produce the normalized representation. The exponent will be adjusted if a shift happens. For an FP multiplication, the mantissa part is much more energy- and delay-consuming than the other two parts. Thus, similar as prior approximate FP multiplier studies [1], [14], [15], [26], in this work, we only focus on the approximation of the mantissa multiplication of the normalized FP number representation. ^(1){ }^{1} 图 1 中的数对于 FP 数的不同部分有不同的规则。将符号位进行异或运算,并通过加法器将指数相加。然后,从和中减去偏差 2^("exponent_width ")-12^{\text {exponent_width }}-1 ,以允许指数同时为负值和正值。最后,将两个尾数相乘并移位到 1 和 2 的范围以生成归一化表示。如果发生移位,指数将会调整。对于 FP 乘法,尾数部分比其他两个部分消耗更多的能量和延迟。因此,与之前的近似 FP 乘法器研究 [1]、[14]、[15]、[26] 类似,在本文中,我们仅关注归一化 FP 数表示的尾数乘法的近似。 ^(1){ }^{1}
2.2 Approximate Multiplier 2.2 近似乘数
Approximate arithmetic has been a popular research area in the past decade. Most prior work on approximate multiplier attempt to tackle the problem either from gate or algorithmic levels to reduce the product bit-width or critical path delay. For example, some work used approximate components, such as adders, compressors, to build the multiplier, so as to speed up addition or partial product generation [14], [25], [29], [32], [33], [34], [35]. Kulkarni et al. proposed to construct an approximate multiplier using a modified 2xx22 \times 2 multiply block [14], which acts as both a partial product generator and an approximate compressor. However, such a design failed to produce an unbiased error distribution as the approximation is always smaller than the accurate value. Zervakis et al. further proposed a general framework to synthesize such approximate components through heuristic optimization and netlist approximation [36]. From a higher design level, logarithm-based approximation was proposed using the classical Mitchell multiplier with an iterative procedure or compensation module to improve the accuracy [30], [37], [38]. To satisfy various accuracy requirements in different scenarios, another alternative is to include both approximate and accurate multipliers and adjust the computational accuracy by selecting the appropriate multiplier or appropriately truncating the inputs, thereby trading off between accuracy and cost [26], [27], [39]. However, these methods heavily rely on the accurate multiplier, which significantly increases the circuit area. Moreover, it is also difficult to predict whether approximate or accurate computation should be used. Thus, the maximum working frequency is limited by the slower accurate multiplier. 近似算法是近十年来的一个热门研究领域。之前关于近似乘法器的研究大多试图从门电路或算法层面来解决这个问题,以减少乘积位宽或关键路径延迟。例如,一些研究使用加法器、压缩器等近似组件来构建乘法器,以加快加法或部分乘积的生成速度[14]、[25]、[29]、[32]、[33]、[34]、[35]。Kulkarni 等人提出使用改进的 2xx22 \times 2 乘法模块[14]构建一个近似乘法器,该模块既可用作部分乘积生成器,又可用作近似压缩器。然而,由于近似值总是小于准确值,这种设计无法产生无偏的误差分布。Zervakis 等人进一步提出了一个通用框架,通过启发式优化和网表近似来综合此类近似组件[36]。从更高的设计层面,提出了基于对数的近似方法,使用经典的米切尔乘法器,并结合迭代过程或补偿模块来提高精度 [30], [37], [38]。为了满足不同场景下对精度的要求,另一种方案是同时包含近似乘法器和精确乘法器,并通过选择合适的乘法器或适当截断输入来调整计算精度,从而在精度和成本之间取得平衡 [26], [27], [39]。然而,这些方法严重依赖于精确乘法器,这显著增加了电路面积。此外,也很难预测应该使用近似计算还是精确计算。因此,最大工作频率受到速度较慢的精确乘法器的限制。
Besides, there are several issues of directly applying the prior works to the IoT devices at the edge. While these methods can precisely control the error, it is hard for many of them to guarantee unbiased output with zero-mean error distribution, which is essential in applications relying on multiple multiplications in series. On the other hand, configurability is highly demanded for versatile edge scenarios. The limitation in many prior approaches is either the lack of configurability, or the notably high cost to implement such configurability with higher precision requirements. 此外,将现有方法直接应用于边缘物联网设备还存在一些问题。虽然这些方法可以精确控制误差,但其中许多方法难以保证零均值误差分布的无偏输出,而这在依赖于多个串联乘法的应用中至关重要。另一方面,对于多功能边缘场景,可配置性要求极高。许多现有方法的局限性在于缺乏可配置性,或者在精度要求较高的情况下实现这种可配置性的成本非常高昂。
Recently, ApproxLP is proposed to approximate the mantissa product using linear fitting [1]. The design shows 最近,有人提出了 ApproxLP 方法,利用线性拟合来近似尾数积 [1]。设计结果显示
Fig. 2. Flow of the approximate FP multiplier ApproxLP in [1]. 图 2. [1] 中的近似 FP 乘法器 ApproxLP 的流程。
much higher performance for the given error rate when compared to the prior approximate multiplier solutions, which is hence considered as a state-of-the-art (SOTA) FP multiplier with significant advantages over prior approximation methods. Fig. 2 depicts the basic idea of ApproxLP. As shown in the figure, the ranges of the two mantissas are first partitioned into multiple sub-domains, with linear functions introduced to fit each sub-domain. The partitioning can further continue to deeper levels to improve the overall accuracy at the cost of area and delay. The sum of the outputs at each level gradually approaches the exact product. However, while ApproxLP improves the efficiency compared with the prior work, it still does not fully address the aforementioned challenges of large implementation cost and biased output error. Furthermore, the branch implementation in the tree structure of Fig. 2 to decide x >= yx \geq y contributes to significant latency overheads. Since the number of branches for sub-domain selection grows exponentially with the level, the hardware cost for branch decision and constant storage also rapidly increases. Finally, as in Fig. 2, the proposed fitting functions are heuristically designed, which raises a very natural question whether we can achieve a better approximation through more theoretically sound formulation. Thus, it is highly motivated for us to fully overcome the existing issues in the prior work and design optimally-approximated and unbiased FP multiplier with low hardware cost and run-time configurability. 与现有的近似乘法器方案相比,在给定错误率下,该方案的性能显著提高,因此被认为是最先进的(SOTA)浮点乘法器,与现有的近似方法相比具有显著优势。图 2 描述了 ApproxLP 的基本思想。如图所示,首先将两个尾数的范围划分为多个子域,并引入线性函数来拟合每个子域。这种划分可以进一步延伸到更深的层次,以提高整体精度,但代价是面积和延迟。每一层的输出之和逐渐接近精确乘积。然而,尽管 ApproxLP 与现有工作相比提高了效率,但它仍然没有完全解决上述高昂的实现成本和输出误差偏差的挑战。此外,图 2 中用于判定 x >= yx \geq y 的树形结构中的分支实现会带来显著的延迟开销。由于子域选择的分支数量随着层级的增加呈指数增长,分支决策和常量存储的硬件成本也迅速增加。最后,如图 2 所示,所提出的拟合函数是启发式设计的,这引出了一个很自然的问题:我们能否通过更理论化的公式实现更好的近似。因此,我们非常有动力去完全克服先前工作中存在的问题,并设计出具有低硬件成本和运行时可配置性的最优近似、无偏 FP 乘法器。
3 Approximate Multiplier Design 3 近似乘数设计
This section discusses the theoretical basis and the multilevel architecture for the proposed approximate FP multiplier. For clarity, Table 1 summarizes the symbols and their definitions used in this section. 本节讨论了所提出的近似 FP 乘法器的理论基础和多层架构。为了清晰起见,表 1 总结了本节中使用的符号及其定义。
3.1 Problem Formulation 3.1 问题表述
As discussed in Section 2, the key operation of an FP multiplication is the mantissa part. Since we focus on the normalized FP numbers, we can define the multiplication as a function 如第 2 节所述,FP 乘法的关键运算是尾数部分。由于我们关注的是归一化的 FP 数,因此我们可以将乘法定义为一个函数
f(x,y)=xyf(x, y)=x y
where xx and yy are two mantissas within the range of [1,2)[1,2). Our idea is to project the complex multiplication function to another space VV with lower dimension. Specifically, we select a group of bases 1,x,y,x^(2),y^(2)1, x, y, x^{2}, y^{2} to decouple the two inputs. We then can define 其中 xx 和 yy 是两个在 [1,2)[1,2) 范围内的尾数。我们的想法是将复数乘法函数投影到另一个维度较低的空间 VV 。具体来说,我们选择一组基数 1,x,y,x^(2),y^(2)1, x, y, x^{2}, y^{2} 来解耦这两个输入。然后我们可以定义
The special cases, such as overflow or underflow, is processed after the mantissa multiplication [31], and hence not the focus of our work. 溢出或下溢等特殊情况在尾数乘法之后处理[31],因此不是我们工作的重点。