UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition UniRepLKNet:用于音频、视频、点云、时间序列和图像识别的通用感知大内核卷积网络
Xiaohan Ding ^(1**){ }^{1 *} Yiyuan Zhang ^(2**){ }^{2 *} Yixiao Ge^(1)\mathrm{Ge}^{1} 小寒 丁 ^(1**){ }^{1 *} 一元 张 ^(2**){ }^{2 *} 艺孝 Ge^(1)\mathrm{Ge}^{1}Sijie Zhao ^(1)quad{ }^{1} \quad Lin Song ^(1)quad{ }^{1} \quad Xiangyu Yue ^(2)quad{ }^{2} \quad Ying Shan ^(1){ }^{1} 赵思杰 ^(1)quad{ }^{1} \quad 林 宋 ^(1)quad{ }^{1} \quad 翔宇 岳 ^(2)quad{ }^{2} \quad 英山 ^(1){ }^{1}^(1){ }^{1} Tencent AI Lab quad^(2)\quad{ }^{2} The Chinese University of Hong Kong ^(1){ }^{1} 腾讯人工智能实验室 quad^(2)\quad{ }^{2} 香港中文大学https://github.com/AILab-CVC/UniRepLKNet
Abstract 抽象
Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel Convv Nets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing largekernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0%88.0 \%, ADE20K mIoU of 55.6%55.6 \%, and COCO box AP of 56.4%56.4 \% ), demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. All the code and models are publicly available on GitHub and Huggingface. 大核卷积神经网络 (ConvNets) 最近受到了广泛的研究关注,但有两个未解决的关键问题需要进一步研究。1) 现有大核 Con vv Net 的架构在很大程度上遵循传统 ConvNet 或 transformer 的设计原则,而大核 ConvNet 的架构设计仍未得到充分解决。2) 由于 transformers 在多种模态中占据主导地位,ConvNets 是否在视觉以外的领域也具有强大的普遍感知能力还有待研究。在本文中,我们从两个方面做出了贡献。1) 我们提出了设计 largekernel ConvNets 的四个架构指南,其核心是利用大内核区别于小内核的基本特性——它们可以看得更宽而不深入。按照这样的指导方针,我们提出的大内核 ConvNet 在图像识别方面显示出领先的性能(ImageNet 精度为 88.0%88.0 \% 、ADE20K mIoU 为 55.6%55.6 \% 和 COCO box AP 为 56.4%56.4 \% ),表现出比最近强大的竞争对手更好的性能和更高的速度。2) 我们发现大型内核是解锁 ConvNet 在最初不熟悉的领域中卓越性能的关键。通过某些与模态相关的预处理方法,即使没有对架构进行特定于模态的定制,所提出的模型也能在时间序列预测和音频识别任务上实现最先进的性能。所有代码和模型都可以在 GitHub 和 Huggingface 上公开获得。
1. Introduction 1. 引言
The design paradigm of convolutional neural networks (ConvNets) with very large kernels originated from RepLKNet [19] when the status of ConvNets was challenged by Vision Transformers (ViTs) [20, 22, 50, 74, 80]. Inspired 具有超大内核的卷积神经网络 (ConvNets) 的设计范式起源于 RepLKNet [19],当时 ConvNets 的地位受到视觉转换器 (ViTs) [20, 22, 50, 74, 80] 的挑战。启发
by ViTs that use global attention [20, 67, 80] or attention with large windows [50, 62, 78], RepLKNet proposed to use very large conv kernels. In contrast to the common practice using small kernels (e.g., 3xx33 \times 3 ) [28, 31, 34, 61, 66, 71, 92], which fails to obtain a large Effective Receptive Field (ERF) [55] even with numerous small-kernel layers, RepLKNet realizes large ERF and impressive performance, especially on tasks such as object detection and semantic segmentation. 对于使用全局注意力 [20, 67, 80] 或具有大窗口的注意力 [50, 62, 78] 的 ViT,RepLKNet 建议使用非常大的 conv 内核。与使用小核(例如 3xx33 \times 3 )[28, 31, 34, 61, 66, 71, 92]的常见做法相比,即使有许多小核层也无法获得大的有效感受野(ERF)[55],而 RepLKNet 实现了大 ERF 和令人印象深刻的性能,特别是在目标检测和语义分割等任务上。
Nowadays, ConvNets with very large kernels become popular, which mostly focus on making the large kernels even larger [48], ways to apply them to multiple tasks [9, 54, 90], etc. However, we note that most architectures of the existing large-kernel ConvNets simply follow other models, e.g., RepLKNet [19] follows the architecture of Swin Transformer [49], and SLaK [48] follows ConvNeXt, which is a powerful architecture with mediumsized (7xx7)(7 \times 7) kernels. The architectural design for largekernel ConvNets remains under-explored. 如今,具有非常大内核的卷积网络开始流行起来,它主要集中在使大内核变得更大 [48],将它们应用于多个任务的方法 [9, 54, 90] 等。然而,我们注意到,现有的大核卷积神经网络的大多数架构只是简单地遵循其他模型,例如,RepLKNet [19] 遵循 Swin Transformer [49] 的架构,而 SLaK [48] 遵循 ConvNeXt,这是一种具有中等内核 (7xx7)(7 \times 7) 的强大架构。largekernel ConvNets 的架构设计仍未得到充分探索。
We explore large-kernel ConvNet architecture by rethinking the design of conventional models that employ a deep stack of small kernels. As we add a 3xx33 \times 3 conv to a small-kernel ConvNet, we expect it to take three effects simultaneously - 1) make the receptive field larger, 2) increase the abstract hierarchy of spatial patterns (e.g., from angles and textures to shapes of objects), and 3\mathbf{3} ) improve the model’s general representational capability via making it deeper, bringing in more learnable parameters and nonlinearities. In contrast, we argue that such three effects in a large-kernel architecture should be decoupled as the model should utilize the substantial strength of a large kernel - the ability to see wide without going deep. Since increasing the kernel size is much more effective than stacking more layers in enlarging the ERF [55], a sufficient ERF can be built up with a small number of large-kernel layers, so that the compute budget can be saved for other efficient structures that are more effective in increasing the abstract hierarchy of spatial patterns or generally increasing the depth. For example, when the objective is to extract higher-level local 我们通过重新思考采用小内核深度堆栈的传统模型的设计来探索大内核 ConvNet 架构。当我们向小内核 ConvNet 添加 3xx33 \times 3 conv 时,我们预计它会同时产生三种效果 - 1) 使感受野更大,2) 增加空间模式的抽象层次结构(例如,从角度和纹理到物体的形状),以及 3\mathbf{3} ) 通过使其更深入来提高模型的一般表示能力,引入更多可学习的参数和非线性。相比之下,我们认为大核架构中的这三种效应应该解耦,因为模型应该利用大核的巨大优势——在不深入的情况下看到宽广的能力。由于在扩大 ERF [55] 时增加核大小比堆叠更多层更有效,因此可以用少量的大核层构建足够的 ERT,这样就可以为其他更有效的结构节省计算预算,这些结构在增加空间模式的抽象层次结构或通常增加深度方面更有效。例如,当目标是提取更高级别的局部
spatial patterns from lower-level ones, a 3xx33 \times 3 might be a more suitable option than a large-kernel conv layer. The reason is that the latter demands more computations and may result in patterns no longer restricted to smaller local regions, which could be undesirable in specific scenarios. 空间模式,则 3xx33 \times 3 可能是比大核 conv 层更合适的选项。原因是后者需要更多的计算,并且可能导致模式不再局限于较小的局部区域,这在特定情况下可能是不可取的。
Concretely, we propose four architectural guidelines for large-kernel ConvNets - 1) use efficient structures such as SE Blocks [33] to increase the depth, 2) use a proposed Dilated Reparam Block to re-parameterize the large-kernel conv layer to improve the performance without inference costs, 3) decide the kernel size by the downstream task and usually use large kernels only in the middle- and high-level layers, and 4) add 3xx33 \times 3 conv instead of more large kernels while scaling up the model’s depth. A ConvNet built up following such guidelines (Fig. 1) realizes the aforementioned three effects separately, as it uses a modest number of large kernels to guarantee a large ERF, small kernels to extract more complicated spatial patterns more efficiently, and multiple lightweight blocks to further increase the depth to enhance the representational capacity. 具体来说,我们提出了大核卷积网络的四个架构指南——1)使用高效的结构,如 SE Block[33]来增加深度,2)使用提议的 Dilated Reparam Block 重新参数化大核卷积层,以提高性能而不产生推理成本,3)通过下游任务决定核大小,通常只在中高层使用大核, 4) 添加 3xx33 \times 3 conv 而不是更多的大内核,同时扩大模型的深度。按照这些准则构建的 ConvNet (图 1) 分别实现了上述三种效果,因为它使用少量的大内核来保证大的 ERF,使用小内核来更有效地提取更复杂的空间模式,并使用多个轻量级块来进一步增加深度以增强表示能力。
Our architecture achieves leading performance on ImageNet classification [12], ADE20K semantic segmentation [98], and COCO object detection [44], outperforming the existing large-kernel ConvNets such as RepLKNet [19], SLaK [48], and recent powerful architectures including ConvNeXt V2 [85], FastViT [77], Swin V2 [51] and DeiT III [75] in terms of both accuracy and efficiency. Moreover, our architecture demonstrates significantly higher shape bias [3,76][3,76] than existing ConvNets and ViTs, i.e., it makes predictions more based on the overall shapes of objects than the textures, which agrees with the human visual system and results in better generalization. This may explain its superiority in downstream tasks. See the Appendix for details. 我们的架构在 ImageNet 分类 [12]、ADE20K 语义分割 [98] 和 COCO 对象检测 [44] 上取得了领先的性能,在准确性和效率方面优于现有的大型内核卷积网络,如 RepLKNet [19]、SLaK [48],以及最近的强大架构,包括 ConvNeXt V2 [85]、FastViT [77]、Swin V2 [51] 和 DeiT III [75]。此外,我们的架构表现出 [3,76][3,76] 比现有的 ConvNets 和 ViT 高得多的形状偏差,即它更多地根据物体的整体形状而不是纹理进行预测,这与人类视觉系统一致,并产生了更好的泛化。这也许可以解释它在下游任务中的优越性。有关详细信息,请参阅附录。
RepLKNet [19] was proposed partly “in defense of ConvNets” as ViTs dominated multiple image recognition tasks that were once dominated by ConvNets. Moreover, considering transformers have shown universal perception capability in multiple modalities [93, 94], in this work, we seek to not only reclaim the leading position in image recognition tasks by surpassing ViTs’ performance but also contribute to areas where ConvNets were not traditionally dominant. Specifically, on audio, video, point cloud, and time-series tasks, we achieve impressive performance with amazingly universal and simple solutions. We use modality-specific preprocessing approaches to transform all the data into 3D embedding maps just like what we do with images and use the same architecture as the backbone to process the embedding maps. Our model shows universal perception ability across multiple modalities with a unified architecture so it is named UniRepLKNet. RepLKNet [19] 的提出部分是为了“捍卫卷积网络”,因为 ViT 主导了曾经由卷积网络主导的多个图像识别任务。此外,考虑到 transformers 在多种模态中表现出普遍感知能力 [93, 94],在这项工作中,我们不仅寻求通过超越 ViT 的性能来重新夺回图像识别任务的领先地位,而且还为 ConvNets 传统上不占主导地位的领域做出贡献。具体来说,在音频、视频、点云和时间序列任务上,我们通过惊人的通用和简单的解决方案实现了令人印象深刻的性能。我们使用特定于模态的预处理方法将所有数据转换为 3D 嵌入图,就像我们对图像所做的那样,并使用与主干相同的架构来处理嵌入图。我们的模型显示了具有统一架构的跨多种模态的普遍感知能力,因此将其命名为 UniRepLKNet。
Impressively, UniRepLKNet achieves remarkable results even on modalities that were not considered the stronghold of ConvNet, e.g., audio and temporal data. On a huge- 令人印象深刻的是,UniRepLKNet 即使在不被认为是 ConvNet 强项的模态(例如音频和时间数据)上也取得了显着的结果。在一个巨大的 -
Figure 1. Architectural design of UniRepLKNet. A LarK Block comprises a Dilated Reparam Block proposed in this paper, an SE Block [33], an FFN, and Batch Normalization (BN) [37] layers. The only difference between a SmaK Block and a LarK Block is that the former uses a depth-wise 3xx33 \times 3 conv layer in replacement of the Dilated Reparam Block in the latter. Stages are connected by downsampling blocks implemented by stride- 2 dense 3xx33 \times 3 conv layers. We may flexibly arrange the blocks in different stages and the details of our provided instances are shown in Table 5. 图 1.UniRepLKNet 的架构设计。LarK 块包括本文提出的膨胀 Reparam 块、SE 块 [33]、FFN 和批量归一化 (BN) [37] 层。SmaK Block 和 LarK Block 之间的唯一区别是,前者使用深度卷 3xx33 \times 3 积层来代替后者中的 Dilated Reparam Block。阶段由跨度 2 个密集 3xx33 \times 3 卷积层实现的下采样块连接。我们可以灵活地将区块安排在不同阶段,我们提供的实例的详细信息如表 5 所示。
scale time-series forecasting task that predicts the global temperature and wind speed, UniRepLKNet, a generalist model originally designed for image recognition, even outperforms the latest state-of-the-art transformer customized for the task. Such results not only signify a “comeback” for ConvNet in its original domain but also showcase largekernel ConvNet’s potential to “conquer” new territories, expanding its applicability and versatility in various tasks. UniRepLKNet 是一种最初为图像识别设计的通用模型,可以预测全球温度和风速的规模时间序列预报任务,其性能甚至优于为该任务定制的最新最先进的变压器。这样的结果不仅意味着 ConvNet 在其原始领域的 “卷土重来”,而且还展示了 largekernel ConvNet “征服”新领域的潜力,扩大了其在各种任务中的适用性和多功能性。
2. Related Work 2. 相关工作
Large kernels in early ConvNets. Classic ConvNets such as AlexNet [42] and Inceptions [68-70] used 7xx77 \times 7 or 11 xx1111 \times 11 in the low-level layers, but large kernels became not popular after VGG-Net [66]. Global Convolution Network (GCN) [57] used very large conv layers ( 1xxK1 \times \mathrm{K} followed by Kxx1\mathrm{K} \times 1 ) for semantic segmentation. Local Relation Networks (LR-Net) [32] adopted a spatial aggregation operator (LRLayer) to replace the standard conv layer, which can be viewed as a dynamic convolution. LR-Net benefited from a kernel size of 7xx77 \times 7 but degraded with 9xx99 \times 9. With a kernel size as large as the feature map, its top- 1 accuracy significantly reduced from 75.7%75.7 \% to 68.4%68.4 \%. 早期 ConvNet 中的大内核。经典的卷积神经网络,如 AlexNet [42]和 Inceptions [68-70],在低级层使用 7xx77 \times 7 或 11 xx1111 \times 11 用于低级层,但在 VGG-Net [66]之后,大型内核变得不流行。全局卷积网络 (GCN) [57] 使用非常大的卷积层 ( 1xxK1 \times \mathrm{K} 后跟 Kxx1\mathrm{K} \times 1 ) 进行语义分割。局部关系网络 (LR-Net) [32] 采用空间聚合算子 (LRLayer) 来代替标准 conv 层,它可以被视为动态卷积。LR-Net 受益于 的内核大小, 7xx77 \times 7 但随着 的出现而 9xx99 \times 9 降级。当内核大小与特征图一样大时,其 top- 1 准确率从 75.7%75.7 \% 显著降低 68.4%68.4 \% 到 。
Explorations with large kernels. The concept of kernel may be generalized beyond spatial convolution. Swin Transformer [50] used shifted attention with window sizes ranging from 7 to 12 , which can be seen as a dynamic kernel. Han et al. [27] replaced the attention layers in Swin with static or dynamic 7xx77 \times 7 conv and still maintained comparable results. MetaFormer [91] suggested largekernel pooling layer was an alternative to self-attention. Another representative work was Global Filter Network 大内核的探索。核的概念可以推广到空间卷积之外。Swin Transformer [50] 使用了窗口大小从 7 到 12 不等的转移注意力,这可以看作是一个动态内核。Han 等[27]用静态或动态 7xx77 \times 7 conv 替换了 Swin 中的注意力层,并且仍然保持了可比的结果。MetaFormer [91] 认为 largekernel 池化层是自我注意的替代方案。另一项代表性作品是 Global Filter Network
(GFNet) [63], which optimized the spatial connection weights in the Fourier domain. It is equivalent to circular global convolutions in the spatial domain. (GFNet)[63],它优化了傅里叶域中的空间连接权重。它相当于空间域中的圆形全局卷积。
Modern ConvNets with very large kernels. RepLKNet first proposed that simply scaling up the kernel size of existing ConvNets resulted in improvements, especially on downstream tasks [19]. It proposed several guidelines while using large kernels, which were focused on the microstructural design (e.g., using shortcut alongside large kernel) and application (large-kernel ConvNets should be evaluated on downstream tasks). In terms of the architecture, RepLKNet merely followed Swin Transformer for simplicity. In the past two years, large-kernel ConvNets have been intensively studied. Some works succeeded in further enlarging the kernel sizes [48], generalizing the idea to 3D scenarios [9] and many downstream tasks, e.g., image dehazing [54] and super-resolution [90]. However, we note that the architectural design for ConvNets with very large kernels remains under-explored. For example, SLaK [48] followed the architecture developed by ConvNeXt, which is a powerful architecture of medium-sized (7xx7)(7 \times 7) kernels. 具有非常大内核的现代卷积网络。RepLKNet 首先提出,简单地扩大现有卷积神经网络的内核大小就可以带来改进,尤其是在下游任务上 [19]。它提出了使用大型内核时的几项指导方针,这些指导方针侧重于微结构设计(例如,在大型内核旁边使用快捷方式)和应用(应在下游任务上评估大型内核 ConvNets)。在架构方面,为了简单起见,RepLKNet 仅遵循了 Swin Transformer。在过去的两年里,大核 ConvNets 得到了深入的研究。一些工作成功地进一步扩大了内核大小 [48],将这一想法推广到 3D 场景 [9] 和许多下游任务,例如图像去雾 [54] 和超分辨率 [90]。然而,我们注意到,具有非常大内核的 ConvNets 的架构设计仍未得到充分探索。例如,SLaK [48] 遵循了 ConvNeXt 开发的架构,这是一种强大的中型 (7xx7)(7 \times 7) 内核架构。
3. Architectural Design of UniRepLKNet 3. UniRepLKNet 的架构设计
We first summarize the architectural guidelines as follows. 1) Block design: use efficient structures that perform both inter-channel communications and spatial aggregations to increase the depth. 2) Re-parameterization: use dilated small kernels to re-parameterize a large kernel. 3) Kernel size: decide kernel size according to the downstream task and usually use large kernels in middle- and high-level layers. 4) Scaling Rule: while scaling up the depth, the added blocks should use small kernels. We describe the proposed Dilated Reparam Block in Sec. 3.1 and details in Sec. 3.2. 我们首先将架构指南总结如下。1) 块设计:使用执行通道间通信和空间聚合的高效结构来增加深度。2) 重新参数化:使用扩张的小内核重新参数化一个大内核。3) kernel size:根据下游任务决定 kernel size,通常在中高层使用大 kernel。4) 扩容规则:在扩容深度时,添加的块应使用小内核。我们在第 3.1 节中描述了拟议的扩张 Reparam 阻滞,并在第 3.2 节中进行了详细说明。
3.1. Dilated Reparam Block 3.1. 扩张的 Reparam 阻滞
It is reported a large-kernel conv should be used with a parallel small-kernel one because the latter helps capture the small-scale patterns during training [19]. Their outputs are added up after two respective Batch Normalization (BN) [37] layers. After training, with the Structural Reparameterization [13-18] methodology, we merge the BN layers into the conv layers so the small-kernel conv can be equivalently merged into the large-kernel one for inference. In this work, we note that except for small-scale patterns, enhancing the large kernel’s capability to capture sparse patterns (i.e., a pixel on a feature map may be more related to some distant pixels than its neighbors) may yield features of higher quality. The need to capture such patterns exactly matches the mechanism of dilated convolution - from a sliding-window perspective, a dilated conv layer with a dilation rate of rr scans the input channel to capture spatial patterns where each pixel of interest is r-1r-1 pixels away 据报道,大核 conv 应该与并行小核 conv 一起使用,因为后者有助于在训练过程中捕获小规模模式 [19]。它们的输出在两个相应的批量归一化 (BN) [37] 层之后相加。训练后,使用结构重新参数化 [13-18] 方法,我们将 BN 层合并到 conv 层中,以便小核 conv 可以等效地合并到大核 conv 层进行推理。在这项工作中,我们注意到,除了小尺度模式外,增强大内核捕获稀疏模式的能力(即特征图上的像素可能与某些远距离像素的相关性比其相邻像素更相关)可能会产生更高质量的特征。捕获此类模式的需求与膨胀卷积的机制完全匹配 - 从滑动窗口的角度来看,具有膨胀率的 rr 膨胀卷积层扫描输入通道以捕获每个感兴趣像素都相 r-1r-1 距像素的空间模式
from its neighbor. Therefore, we use dilated conv layers parallel to the large kernel and add up their outputs. 从它的邻居。因此,我们使用与大型内核平行的扩张 conv 层,并将它们的输出相加。
To eliminate the inference costs of the extra dilated conv layers, we propose to equivalently transform the whole block into a single non-dilated conv layer for inference. Since ignoring pixels of the input is equivalent to inserting extra zero entries into the conv kernel, a dilated conv layer with a small kernel can be equivalently converted into a non-dilated (i.e., r=1r=1 ) layer with a sparse larger kernel. Let kk be the kernel size of the dilated layer, by inserting zero entries, the kernel size of the corresponding nondilated layer will be (k-1)r+1(k-1) r+1, which is referred to as the equivalent kernel size for brevity. We further note that such transformation from the former kernel WinR^(k xx k)\mathrm{W} \in \mathcal{R}^{k \times k} to the latter W^(')inR^(((k-1)r+1)xx((k-1)r+1))\mathrm{W}^{\prime} \in \mathcal{R}^{((k-1) r+1) \times((k-1) r+1)} can be elegantly realized by a transpose convolution with a stride of rr and an identity kernel IinR^(1xx1)\mathrm{I} \in \mathcal{R}^{1 \times 1}, which is scalar 1 but viewed as a kernel tensor. ^(1){ }^{1} With pytorch-style pseudo code, that is 为了消除额外扩张的 conv 层的推理成本,我们建议将整个块等效地转换为单个非扩张的 conv 层进行推理。由于忽略 input 的像素相当于在 conv kernel 中插入额外的零条目,因此具有小内核的扩张 conv 层可以等效地转换为具有稀疏较大内核的非扩张(即 r=1r=1 )层。设 kk 为扩张层的内核大小,通过插入零个条目,对应的非扩张层的内核大小将为 (k-1)r+1(k-1) r+1 ,为简洁起见,称为等效内核大小。我们进一步注意到,这种从前一个内核 WinR^(k xx k)\mathrm{W} \in \mathcal{R}^{k \times k} 到后者 W^(')inR^(((k-1)r+1)xx((k-1)r+1))\mathrm{W}^{\prime} \in \mathcal{R}^{((k-1) r+1) \times((k-1) r+1)} 的转换可以通过一个 stride 为 rr 和 identity kernel IinR^(1xx1)\mathrm{I} \in \mathcal{R}^{1 \times 1} 的转置卷积优雅地实现,该 kernel 是标量 1,但被视为内核张量。 ^(1){ }^{1} 使用 pytorch 风格的伪代码,即
The equivalency can be easily verified - given an arbitrary WinR^(k xx k)\mathrm{W} \in \mathcal{R}^{k \times k} and an arbitrary input channel, a convolution with W and a dilation rate rr always yields identical results to a non-dilated convolution with W^(').^(2)\mathrm{W}^{\prime} .{ }^{2} 等效性很容易验证 - 给定任意 WinR^(k xx k)\mathrm{W} \in \mathcal{R}^{k \times k} 和任意输入通道,具有 W 和膨胀率 rr 的卷积总是产生与具有 W^(').^(2)\mathrm{W}^{\prime} .{ }^{2}
Based on such equivalent transformations, we propose a Dilated Reparam Block, which uses a non-dilated smallkernel and multiple dilated small-kernel layers to enhance a non-dilated large-kernel conv layer. Its hyper-parameters include the size of large kernel KK, sizes of parallel conv layers k\boldsymbol{k}, and the dilation rates r\boldsymbol{r}. The shown case (Fig. 2) with four parallel layers is denoted by K=9,r=(1,2,3,4)K=9, \boldsymbol{r}=(1,2,3,4), k=(5,5,3,3)k=(5,5,3,3). For a larger KK, we may use more dilated layers with larger kernel sizes or dilation rates. The kernel sizes and dilation rates of the parallel branches are flexible and the only constraint is (k-1)r+1 <= K(k-1) r+1 \leq K. For example, with K=13K=13 (the default setting in our experiments), we use five layers with k=(5,7,3,3,3),r=(1,2,3,4,5)\boldsymbol{k}=(5,7,3,3,3), \boldsymbol{r}=(1,2,3,4,5), so the equivalent kernel sizes will be ( 5,13,7,9,115,13,7,9,11 ), respectively. To convert a Dialted Reparam Block into a large-kernel conv layer for inference, we first merge every BN into the preceding conv layer, convert every layer with dilation r > 1r>1 with function 1 , and add up all the resultant kernels with appropriate zero-paddings. For example, the layer in Fig. 2 with k=3,r=3k=3, r=3 is converted into a sparse 7xx7kernel7 \times 7 \mathrm{kernel} and added to the 9xx99 \times 9 kernel with one-pixel zero paddings on each side. 基于这种等效的转换,我们提出了一个 Dilated Reparam Block,它使用一个非扩张的小内核和多个膨胀的小内核层来增强一个非膨胀的大内核 conv 层。它的超参数包括 size of large kernel KK 、 size of parallel conv layers k\boldsymbol{k} 和 dilation rates r\boldsymbol{r} 。显示的具有四个平行层的情况(图 2)用 K=9,r=(1,2,3,4)K=9, \boldsymbol{r}=(1,2,3,4) , k=(5,5,3,3)k=(5,5,3,3) 表示。对于较大的 KK ,我们可能会使用具有更大内核大小或膨胀率的更膨胀的层。平行分支的内核大小和膨胀率是灵活的,唯一的约束是 (k-1)r+1 <= K(k-1) r+1 \leq K 。例如,使用 K=13K=13 (实验中的默认设置),我们使用 5 层和 k=(5,7,3,3,3),r=(1,2,3,4,5)\boldsymbol{k}=(5,7,3,3,3), \boldsymbol{r}=(1,2,3,4,5) ,因此等效的内核大小分别为 ( 5,13,7,9,115,13,7,9,11 )。为了将 Dialted Reparam Block 转换为大核 conv 层进行推理,我们首先将每个 BN 合并到前面的 conv 层中, r > 1r>1 用函数 1 转换每个具有膨胀的层,并将所有结果内核与适当的零填充相加。例如,图 2 中的 layer k=3,r=3k=3, r=3 被转换为 sparse 7xx7kernel7 \times 7 \mathrm{kernel} 并添加到内核中 9xx99 \times 9 ,每侧都有 1 像素的零填充。
*Equal contributions. *同等贡献。
^(1){ }^{1} We showcase a single-channel conv and it is easy to generalize the transformation to multi-channel cases. See the Appendix for details. ^(1){ }^{1} 我们展示了一个单通道 conv,很容易将转换推广到多通道情况。有关详细信息,请参阅附录。 ^(2){ }^{2} In common cases where the shape of output equals that of input, i.e., the padding of the former is (k-1)/(2)\frac{k-1}{2}, note the padding of the latter should be ((k-1)r)/(2)\frac{(k-1) r}{2} since the size of the equivalent sparse kernel is (k-1)r+1(k-1) r+1. ^(2){ }^{2} 在输出形状等于输入形状的常见情况下,即前者的填充为 (k-1)/(2)\frac{k-1}{2} ,请注意,后者的填充应该是 ((k-1)r)/(2)\frac{(k-1) r}{2} ,因为等效稀疏核的大小为 (k-1)r+1(k-1) r+1 。