(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
（ECCV）软件包 eccv 警告：软件包 'hyperref' 加载了选项 'pagebackref'，这*不*推荐用于相机就绪版本

¹¹institutetext:

{}^{1}

OpenGVLab, Shanghai AI Laboratory
研究所文本：

{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

OpenGVLab，上海人工智能实验室

{}^{2}

The Chinese University of Hong Kong

{}^{3}

Fudan University

{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

香港中文大学

{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT

复旦大学

{}^{4}

Nanjing University

{}^{5}

Tsinghua University

{}^{6}

SenseTime Research

{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT

南京大学

{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT

清华大学

{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT

商汤科技研究院

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Vision-RWKV：具有类似 RWKV 架构的高效且可扩展的视觉感知

Yuchen Duan

{}^{*}

2211 Weiyun Wang

{}^{*}

3311 Zhe Chen

{}^{*}

4411 Xizhou Zhu 551166 Lewei Lu 66 Tong Lu 44 Yu Qiao 11 Hongsheng Li 22 Jifeng Dai 5511 Wenhai Wang^🖂
2211

Abstract 抽象

^†^†* Equal contribution; 🖂 Corresponding author (wangwenhai362@gmail.com)
* 同等贡献; 🖂 通讯作者（wangwenhai362@gmail.com）

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adap-ted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT’s performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV’s potential as a more efficient alternative for visual perception tasks. Code is released at https://github.com/OpenGVLab/Vision-RWKV.
Transformer 彻底改变了计算机视觉和自然语言处理，但其高计算复杂性限制了它们在高分辨率图像处理和长上下文分析中的应用。本文介绍了 Vision-RWKV （VRWKV），这是一种从 NLP 领域使用的 RWKV 模型改编而来的模型，对视觉任务进行了必要的修改。与 Vision Transformer （ViT）类似，我们的模型旨在有效处理稀疏输入并展示强大的全局处理能力，同时还可以有效地扩展，以适应大规模参数和广泛的数据集。它的独特优势在于降低了空间聚合复杂性，这使得它非常擅长无缝处理高分辨率图像，无需进行窗口操作。我们的评估表明，VRWKV 在图像分类方面的性能超过了 ViT，并且在处理高分辨率输入时具有明显更快的速度和更低的内存使用量。在密集的预测任务中，它的性能优于基于窗口的模型，并保持了相当的速度。这些结果突出了 VRWKV 作为视觉感知任务的更有效替代方案的潜力。代码在 https://github.com/OpenGVLab/Vision-RWKV 发布。

Keywords:

RWKV Visual Perception Linear Attention

关键字：

RWKV 视觉感知线性注意力

1 Introduction 1 介绍

Vision Transformers (ViTs) [12, 49, 52, 44, 18], renowned for their flexibility and global information processing capabilities, have established new benchmarks in a variety of vision tasks in the past few years. However, the quadratic computational complexity associated with ViTs limits their ability to efficiently process high-resolution images and lengthy sequences, posing a significant barrier to their broader application. As a result, the exploration of a vision architecture that integrates the versatility and comprehensive processing strengths of ViTs, while reducing computational demands, has emerged as a crucial area of research.
Vision Transformers （ViTs） [12， 49， 52， 44， 18] 以其灵活性和全局信息处理能力而闻名，在过去几年中在各种视觉任务中建立了新的基准。然而，与 ViT 相关的二次计算复杂性限制了它们有效处理高分辨率图像和冗长序列的能力，对其更广泛的应用构成了重大障碍。因此，探索一种集成 ViT 的多功能性和综合处理优势，同时减少计算需求的视觉架构已成为一个关键的研究领域。

Refer to caption — Figure 1: Performance and efficiency comparison of Vision-RWKV (VRWKV) and ViT. (a) $\rm AP^{b}$ comparison of VRWKV and ViT [49] with window attention and global attention on the COCO [27] dataset. (b) Inference speed comparison of VRWKV-T and ViT-T across input resolutions ranging from 224 to 2048. (c) GPU memory comparison of VRWKV-T and ViT-T across input resolutions from 224 to 2048.
图 1：Vision-RWKV （VRWKV）和 ViT 的性能和效率比较。（a） $\rm AP^{b}roman_AP start_POSTSUPERSCRIPT roman_b end_POSTSUPERSCRIPT$ VRWKV 和 ViT [49] 与 COCO 上的窗口关注和全局关注的比较 [27]数据。（b） VRWKV-T 和 ViT-T 在 224 到 2048 的输入分辨率范围内的推理速度比较。（c） VRWKV-T 和 ViT-T 在 224 到 2048 的输入分辨率下的 GPU 内存比较。

In recent developments within natural language processing (NLP), models like RWKV [35] and Mamba [15] have emerged as popular solutions for achieving heightened efficiency and processing lengthy texts. These innovative models have demonstrated attributes similar to transformers [9, 40, 43, 28, 38, 39, 3, 25, 45] in NLP tasks, including the ability to handle long-range dependencies and parallel processing. Furthermore, they have also proven to be scalable, performing well with large-scale NLP datasets. Considering the significant differences between image and text modalities, it remains challenging to envision these methods entirely supplanting ViTs for vision-related tasks. It is imperative to conduct an in-depth analysis of how these models are applied to vision tasks, examining their scalability concerning data and parameters, their efficiency in handling sparse visual data, such as masked image modeling, and the necessary techniques to ensure model stability during scaling up.
在自然语言处理（NLP）的最新发展中，RWKV [35] 和 Mamba [15] 等模型已成为提高效率和处理冗长文本的流行解决方案。这些创新模型在 NLP 任务中展示了类似于转换器 [9， 40， 43， 28， 38， 39， 3， 25， 45] 的属性，包括处理远程依赖关系和并行处理的能力。此外，它们还被证明是可扩展的，在大规模 NLP 数据集中表现良好。考虑到图像和文本模态之间的显着差异，设想这些方法完全取代 ViT 来完成与视觉相关的任务仍然具有挑战性。必须对这些模型如何应用于视觉任务进行深入分析，检查它们在数据和参数方面的可扩展性、它们处理稀疏视觉数据（例如蒙版图像建模）的效率，以及在扩展过程中确保模型稳定性的必要技术。

In this work, we introduce Vision-RWKV (VRWKV), which is designed to adapt the RWKV architecture for vision tasks. This adaptation preserves the core structure and benefits of RWKV [35] while integrating critical modifications to tailor it for processing visual data. Specifically, we introduce a quad-directional shift (Q-Shift) tailed for vision tasks and modify the original causal RWKV attention mechanism to a bidirectional global attention mechanism. The Q-Shift operation expands the semantic range of individual tokens, while the bidirectional attention enables the calculation of global attention within linear computational complexity in an RNN form forward and backward. We primarily make modifications to the exponent in the RWKV attention mechanism, releasing the limitations of the decay vector and transforming the absolute positional bias into a relative bias. These changes enhance the model’s capability while ensuring scalability and stability. In this way, our model inherits the efficiency of RWKV in handling global information and sparse inputs, while also being able to model the local concept of vision tasks. We implement layer scale [50] and layer normalization [2] where needed to stabilize the model’s output across different scales. These adjustments significantly improve the model’s stability when scaling up to a larger size.
在这项工作中，我们介绍了 Vision-RWKV （VRWKV），它旨在使 RWKV 架构适应视觉任务。这种适应保留了RWKV的核心结构和优点[35]，同时整合了关键修改以使其能够处理视觉数据。具体来说，我们引入了一个用于视觉任务的四向偏移（Q-Shift），并将原来的因果 RWKV 注意力机制修改为双向全局注意力机制。Q-Shift 操作扩展了单个标记的语义范围，而双向注意力则能够以 RNN 形式在线性计算复杂度内向前和向后计算全局注意力。我们主要对 RWKV 注意力机制中的指数进行修改，释放了衰减向量的限制，将绝对位置偏差转换为相对偏差。这些更改增强了模型的功能，同时确保了可扩展性和稳定性。通过这种方式，我们的模型继承了 RWKV 在处理全局信息和稀疏输入方面的效率，同时也能够对视觉任务的局部概念进行建模。我们在需要时实施层尺度 [50] 和层归一化 [2]，以稳定模型在不同尺度上的输出。这些调整显著提高了模型在缩放到更大尺寸时的稳定性。

Building on the aforementioned design, we develop a range of VRWKV models with different model scales, spanning from the VRWKV-Tiny (6M) to the VRWKV-Large (335M). These models are trained using large-scale datasets such as ImageNet-1K [8] and ImageNet-22K [8]. We train them using both common supervised classification and sparse input MAE methods [18] and evaluate their performance on visual perception tasks, including classification, detection, and segmentation. Under the same settings, VRWKV has comparable performance to ViT in these tasks with lower computational costs while maintaining stable scalability. This achievement enables VRWKV training parallelism, high flexibility, excellent performance, and low inference cost simultaneously, making it a promising alternative to ViT in a wide range of vision tasks, particularly in high-resolution scenarios.
在上述设计的基础上，我们开发了一系列具有不同模型比例的 VRWKV 模型，从 VRWKV-Tiny （6M）到 VRWKV-Large （335M）。这些模型使用 ImageNet-1K [8] 和 ImageNet-22K [8] 等大规模数据集进行训练。我们使用常见的监督分类和稀疏输入 MAE 方法 [18] 来训练它们，并评估它们在视觉感知任务上的表现，包括分类、检测和分割。在相同设置下，VRWKV 在这些任务中具有与 ViT 相当的性能，计算成本较低，同时保持稳定的可扩展性。这一成就同时实现了 VRWKV 训练并行性、高灵活性、优异的性能和低推理成本，使其成为 ViT 在广泛的视觉任务中很有前途的替代品，尤其是在高分辨率场景中。

In this paper, our main contributions are:
在本文中，我们的主要贡献是：

(1) We propose VRWKV as a low-cost alternative to ViT, achieving comprehensive substitution with lower computational costs. Our model not only retains the advantages of ViT, including the capability to capture long-range dependencies and flexibility in handling sparse inputs but also reduces complexity to a linear level. This significant reduction eliminates the need for window-based attention in processing high-resolution images, making VRWKV a more efficient and scalable solution for vision tasks.
（1）提出VRWKV作为ViT的低成本替代方案，实现了计算成本更低的全面替代。我们的模型不仅保留了ViT的优点，包括捕获长程依赖性的能力和处理稀疏输入的灵活性，而且还将复杂性降低到线性水平。这种显著的减少消除了在处理高分辨率图像时基于窗口的注意力的需求，使 VRWKV 成为更高效、更可扩展的视觉任务解决方案。

(2) To adapt to vision tasks, we introduce bidirectional global attention and a novel token shift method called Q-Shift, enabling the achievement of linear complexity in global attention. To ensure stable scalability, we make several efforts, including using a relative positional bias in the attention mechanism to avoid overflow, adopting layer scale in our model, and adding extra layer normalization in the calculation of key matrices.
（2）为了适应视觉任务，我们引入了双向全局注意力和一种称为Q-Shift的新型标记转移方法，使全局注意力的线性复杂度得以实现。为了确保稳定的可扩展性，我们做出了一些努力，包括在注意力机制中使用相对位置偏差来避免溢出，在我们的模型中采用层尺度，以及在关键矩阵的计算中添加额外的层归一化。

(3) Our model surpasses window-based ViTs and is comparable to global attention ViTs, demonstrating lower FLOPs and faster processing speeds as resolution increases. Notably, VRWKV-T achieves 75.1% top-1 accuracy trained only on the ImageNet-1K [8], outperforming DeiT-T [49] by 2.9 points. With large-scale parameters (i.e., 335M) and training data (i.e., ImageNet-22K), the top-1 accuracy of VRWKV-L is further boosted to 86.0%, which is higher than ViT-L [12] (86.04 vs. missing 85.15). In addition, on COCO [27], a challenging downstream benchmark, our best model VRWKV-L achieves 50.6% box mAP, 1.9 points better than ViT-L (50.6 vs. missing 48.7).
（3）我们的模型超越了基于窗口的 ViT，与全局注意力 ViT 相当，随着分辨率的提高，表现出更低的 FLOPs 和更快的处理速度。值得注意的是，VRWKV-T 仅在 ImageNet-1K [8] 上训练时就达到了 75.1% 的 top-1 准确率，比 DeiT-T [49] 高出 2.9 个百分点。在大规模参数（即 335M）和训练数据（即 ImageNet-22K）的情况下，VRWKV-L 的前 1 准确率进一步提高到 86.0%，高于 ViT-L [12]（86.04 vs. 缺失的 85.15）。此外，在具有挑战性的下游基准 COCO [27] 上，我们最好的模型 VRWKV-L 实现了 50.6% 的箱 mAP，比 ViT-L 高 1.9 个百分点（50.6 对缺失的 48.7）。

2 Related Works 阿拉伯数字相关作品

2.1 Vision Encoder
2.1 Vision 编码器

Recent advances in vision encoders have significantly pushed the boundaries of computer vision, demonstrating remarkable performance across a range of tasks. Convolutional neural networks (CNNs) served as the foundational model in computer vision. The advancement of computational resources, such as GPUs, has enabled the successful training of stacked convolutional blocks like AlexNet [23] and VGG [41] on large-scale image classification datasets (e.g., ImageNet [8]). This development paved the way for deeper and more sophisticated convolutional neural architectures, including GoogleNet [48], ResNet [20], and DenseNet [22].
视觉编码器的最新进展显著突破了计算机视觉的界限，在一系列任务中表现出卓越的性能。卷积神经网络（CNN）是计算机视觉的基础模型。GPU 等计算资源的进步使得在大规模图像分类数据集（例如 ImageNet [8]）上成功训练 AlexNet [23] 和 VGG [41] 等堆叠卷积块成为可能。这一发展为更深入、更复杂的卷积神经架构铺平了道路，包括 GoogleNet [48]、ResNet [20] 和 DenseNet [22]。

In addition to these innovations, significant advancements have also been made with architectures like SENet [21], which introduced a channel attention mechanism to enhance model sensitivity to informative features. Similarly, SKNet [26] merged multiple kernel sizes to adjust the receptive field adaptively. Further extending the CNN paradigm, recent models such as RepLKNet [11] and ConvNeXt [31] have refined the convolutional layers to improve efficiency and accuracy, while InternImage [55] explored the strategies to scale up the convolution-based vision model.
除了这些创新之外，SENet [21] 等架构也取得了重大进展，它引入了通道注意力机制，以提高模型对信息特征的敏感性。同样，SKNet [26] 合并了多个内核大小以自适应地调整感受野。RepLKNet [11] 和 ConvNeXt [31] 等最新模型进一步扩展了 CNN 范式，改进了卷积层以提高效率和准确性，而 InternImage [55] 则探索了扩展基于卷积的视觉模型的策略。

Drawing inspiration from the effectiveness of self-attention layers and transformer architectures in the NLP field, the Vision Transformer (ViT) [12] applied a transformer framework on image patches, offering a global receptive field and dynamic spatial aggregation. Due to the quadratically increasing computational complexity of the vanilla attention mechanism, approaches like PVT [56, 57] and Linformer [54] implemented global attention on down-sampled feature maps, whereas other approaches like Swin [58] and HaloNet [51, 6] introduced sampling techniques to enlarge the receptive field.
Vision Transformer （ViT） [12] 从 NLP 领域中自我注意层和 transformer 架构的有效性中汲取灵感，在图像补丁上应用了 transformer 框架，提供了一个全局感受野和动态空间聚合。由于原版注意力机制的计算复杂度呈二次方增加，PVT [56， 57] 和 Linformer [54] 等方法在下采样特征图上实现了全局注意力，而 Swin [58] 和 HaloNet [51， 6] 等其他方法引入了采样技术来扩大感受野。

Another research direction involved replacing self-attention layers in models with linear complexity layers. Representative works in this domain include LongNet [10], RWKV [35], RetNet [47], and Mamba [15], though few have concentrated on vision-related applications. Concurrently, attempts like Vim [63] and VMamba [29] have sought to integrate these linear attention layers into the vision domain. However, these endeavors have only been experimented with on small-scale models, and it remains uncertain whether their efficiency can scale up to larger models.
另一个研究方向涉及用线性复杂性层替换模型中的自我注意层。该领域的代表性工作包括 LongNet [10]、RWKV [35]、RetNet [47] 和 Mamba [15]，尽管很少有人专注于视觉相关应用。同时，像 Vim [63] 和 VMamba [29] 这样的尝试试图将这些线性注意力层整合到视觉域中。然而，这些努力仅在小规模模型上进行了实验，并且仍不确定它们的效率是否可以扩展到更大的模型。

2.2 Feature Aggregation Mechanism
2.2 元特征聚合机制

The research on feature aggregation has received significant attention in the field of artificial intelligence. For visual data processing, convolutional operators [24], known for their parameter sharing and local perception, enabled efficient handling of large-scale data through sliding computation. Despite their advantages, traditional CNN operators faced challenges in modeling long-range dependencies. To overcome this issue, advanced convolutional operators, such as the deformable convolution [5, 64, 60], have improved the flexibility of CNN operators, enhancing their long-range modeling capability.
特征聚合的研究在人工智能领域受到了极大的关注。对于视觉数据处理，以参数共享和局部感知而闻名的卷积算子 [24] 能够通过滑动计算高效处理大规模数据。尽管具有优势，但传统的 CNN 运算符在建模长期依赖关系方面面临着挑战。为了克服这个问题，先进的卷积算子，如可变形卷积 [5， 64， 60]，提高了 CNN 算子的灵活性，增强了它们的远程建模能力。

As for the field of NLP, RNN-based operators [13, 34, 37] have historically dominated because of their effectiveness in sequence modeling. RNNs and LSTMs excel in capturing temporal dependencies, making them suitable for tasks requiring an understanding of sequence dynamics. Subsequently, a significant shift occurred. The introduction of the transformer architecture [52] marked a turning point, with both NLP and computer vision fields shifting focus toward attention-based feature aggregation. The global attention mechanism overcomes the limitations of CNNs in modeling long-range dependencies and the shortcomings of RNNs in parallel computation while coming at a high computational cost.
至于 NLP 领域，基于 RNN 的运算符 [13， 34， 37] 因其在序列建模中的有效性而在历史上占据主导地位。RNN 和 LSTM 擅长捕获时间依赖性，使其适用于需要了解序列动力学的任务。随后，发生了重大转变。transformer 架构 [52] 的引入标志着一个转折点，NLP 和计算机视觉领域都将重点转向基于注意力的特征聚合。全局注意力机制克服了 CNN 在建模长距离依赖性方面的局限性以及 RNN 在并行计算中具有较高的计算成本的缺点。

To address the high computational cost of attention operators while modeling long sequences, researchers have introduced innovations such as window attention and spatial reduction attention. Window attention [30, 51, 6] restricts the self-attention computation within local windows, drastically reducing the computational complexity while preserving the receptive field through window-level interaction. Spatial reduction attention [56, 57], on the other hand, reduces the dimensionality of the feature space before applying the attention mechanism, effectively decreasing the computational requirements without significantly degrading the model’s performance.
为了解决在对长序列进行建模时注意力运算符的高计算成本问题，研究人员引入了窗口注意力和空间减少注意力等创新。窗口注意力 [30， 51， 6] 限制了局部窗口内的自我注意力计算，大大降低了计算复杂性，同时通过窗口级交互保留了感受野。另一方面，空间缩减注意力 [56， 57] 在应用注意力机制之前降低了特征空间的维数，有效地降低了计算要求，而不会显着降低模型的性能。

In addition to the efforts to optimize the global attention mechanism, various operators with linear complexity [35, 47, 15, 36, 46] have also been explored. For instance, RWKV [35] and RetNet [47] employed exponential decay to model global information efficiently while SSMs [16, 17, 42, 53] also exhibited linear complexity concerning sequence length and modification in Mamba [15] enable them to be input-dependent. Besides, XCA [1] achieved linear complexity by calculating the cross-variance between input tokens. However, the low efficiency of information interaction between tokens makes the need for the assistance of additional modules necessary to complete a comprehensive feature aggregation. Despite some concurrent efforts [29, 63, 14], adapting these NLP-derived techniques to vision tasks remains a challenge in maintaining stable training across larger and more complex vision models.
除了优化全局注意力机制的努力外，还探索了各种具有线性复杂度的算子 [35， 47， 15， 36， 46]。例如，RWKV [35] 和 RetNet [47] 采用指数衰减来有效地模拟全局信息，而 SSM [16， 17， 42， 53] 在 Mamba [15] 中也表现出序列长度和修饰的线性复杂性，使它们能够依赖于输入。此外，XCA [1] 通过计算输入标记之间的交叉方差实现了线性复杂性。然而，令牌之间信息交互的效率低下，因此需要额外的模块的帮助来完成全面的特征聚合。尽管同时进行了一些努力 [29， 63， 14]，但将这些 NLP 衍生的技术应用于视觉任务仍然是一个挑战，要在更大、更复杂的视觉模型中保持稳定的训练。

3 Vision-RWKV 3 愿景-RWKV

3.1 Overall Architecture
3.1 整体架构

In this section, we propose Vision-RWKV (VRWKV), an efficient vision encoder with a linear complexity attention mechanism. Our principle is to preserve the advantages of the original RWKV architecture [35], making only necessary modifications to enable its flexible application in vision tasks, supporting sparse input, and ensuring the stability of the training process after scaling up. An overview of our VRWKV is depicted in Fig. 2.
在本节中，我们提出了 Vision-RWKV （VRWKV），这是一种具有线性复杂度注意力机制的高效视觉编码器。我们的原则是保留原始 RWKV 架构的优点 [35]，只进行必要的修改，使其能够灵活地应用于视觉任务，支持稀疏输入，并确保扩展后训练过程的稳定性。我们的 VRWKV 概述如图 1 所示。 2.

VRWKV adopts a block-stacked image encoder design like ViT, where each block consists of a spatial-mix module and a channel-mix module. The spatial-mix module functions as an attention mechanism, performing linear complexity global attention computation while the channel mix module serves as a feed-forward network (FFN), performing feature fusion in the channel dimension. The entire VRWKV includes a patch embedding layer and a stack of $L$ identical VRWKV encoder layers, where each layer maintains the input resolution.
VRWKV 采用了类似 ViT 的块堆叠图像编码器设计，每个块由一个空间混合模块和一个通道混合模块组成。空间混合模块作为注意力机制，执行线性复杂度全局注意力计算，而通道混合模块作为前馈网络（FFN），在通道维度上执行特征融合。整个 VRWKV 包括一个补丁嵌入层和一堆 $L Litalic_L$ 相同的 VRWKV 编码器层，其中每个层都保持输入分辨率。

Data Flow. First, we transform the $H\times W\times 3$ image into $HW/p^{2}$ patches, where $p$ denotes the patch size. The patches after a linear projection add the position embedding to obtain image tokens of shape $T\times C$ , where $T=HW/p^{2}$ denotes the total number of tokens. These tokens are then input into the VRWKV encoder with $L$ layers.
数据流。首先，我们将 $H\times W\times 3italic_H × italic_W × 3$ 图像转换为 $HW/p^{2}italic_H italic_W / italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 补丁，其中 $p pitalic_p$ 表示补丁大小。线性投影后的补丁添加 position 嵌入，以获得 shape $T\times Citalic_T × italic_C$ 的图像标记，其中 $T=HW/p^{2}italic_T = italic_H italic_W / italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 表示标记的总数。然后，这些令牌将输入到具有层的 $L Litalic_L$ VRWKV 编码器中。

In each layer, tokens are first fed into the spatial-mix module which plays the role of a global attention mechanism. Specifically, as shown in Fig. 2(b), the input tokens are first shifted and fed into three parallel linear layers to obtain the matrices $R_{s},K_{s},V_{s}\in\mathbb{R}^{T\times C}$ :
在每一层中，令牌首先被馈送到 spatial-mix 模块中，该模块起到全局注意力机制的作用。具体来说，如图 1 所示。 2（b）中，首先将输入标记移位并馈送到三个平行线性层中，以获得矩阵 $R_{s},K_{s},V_{s}\in\mathbb{R}^{T\times C}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C end_POSTSUPERSCRIPT$ ：

\displaystyle R_{\text{s}}={\rm Q\mbox{-}Shift}_{R}(X)W_{R},~{}~{}~{}~{}K_{% \text{s}}={\rm Q\mbox{-}Shift}_{K}(X)W_{K},~{}~{}~{}~{}V_{\text{s}}={\rm Q% \mbox{-}Shift}_{V}(X)W_{V}.

(1)

Here, $K_{\text{s}}$ and $V_{\text{s}}$ are passed to calculate the global attention result $wkv\in\mathbb{R}^{T\times C}$ by a linear complexity bidirectional attention mechanism and multiplied with $\sigma(R)$ which controls the output $O_{\text{s}}$ probability:
这里， $K_{\text{s}}italic_K start_POSTSUBSCRIPT s end_POSTSUBSCRIPT$ 并通过 $V_{\text{s}}italic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT$ 线性复杂度双向注意力机制传递来计算全局注意力结果 $wkv\in\mathbb{R}^{T\times C}italic_w italic_k italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C end_POSTSUPERSCRIPT$ ，并乘以控制输出 $O_{\text{s}}italic_O start_POSTSUBSCRIPT s end_POSTSUBSCRIPT$ 概率的 $\sigma(R)italic_σ ( italic_R )$ 它：

	$\displaystyle O_{\text{s}}$	$\displaystyle=(\sigma(R_{\text{s}})\odot wkv)W_{O},$		(2)
	$\displaystyle\text{where}\ wkv$	$\displaystyle=\mathrm{Bi\mbox{-}WKV}(K_{\text{s}},V_{\text{s}}).$		(2)

Operator $\sigma$ denotes the sigmoid function, and $\odot$ means an element-wise multiplication is applied. The $\rm Q\mbox{-}Shift$ is a token shift function specially designed for the adaption of vision tasks. After an output linear projection, features are then stabilized using layer normalization [2].
运算符 $\sigmaitalic_σ$ 表示 sigmoid 函数，表示 $\odot⊙$ 应用元素乘法。这是 $\rm Q\mbox{-}Shiftroman_Q - roman_Shift$ 专为适应视觉任务而设计的令牌移位功能。在输出线性投影之后，然后使用层归一化 [2] 对特征进行稳定。

Subsequently, the tokens are passed into the channel-mix module for a channel-wise fusion. $R_{\text{c}}$ , $K_{\text{c}}$ are obtained in a similar manner as spatial-mix:
随后，将 tokens 传递到 channel-mix 模块中，以进行通道融合。 $R_{\text{c}}italic_R start_POSTSUBSCRIPT c end_POSTSUBSCRIPT$ ， $K_{\text{c}}italic_K start_POSTSUBSCRIPT c end_POSTSUBSCRIPT$ 以与 spatial-mix 类似的方式获得：

\displaystyle R_{\text{c}}={\rm Q\mbox{-}Shift}_{R}(X)W_{R},~{}~{}~{}~{}K_{% \text{c}}={\rm Q\mbox{-}Shift}_{K}(X)W_{K}.

(3)

Here, $V_{\text{c}}$ is a linear projection of $K$ after the activation function and the output $O_{\text{c}}$ is also controlled by a gate mechanism $\sigma(R_{\text{c}})$ before the output projection:
这里， $V_{\text{c}}italic_V start_POSTSUBSCRIPT c end_POSTSUBSCRIPT$ 是激活函数之后的 $K Kitalic_K$ 线性投影，输出 $O_{\text{c}}italic_O start_POSTSUBSCRIPT c end_POSTSUBSCRIPT$ 也由输出投影之前的门机制 $\sigma(R_{\text{c}})italic_σ ( italic_R start_POSTSUBSCRIPT c end_POSTSUBSCRIPT )$ 控制：

	$\displaystyle O_{\text{c}}$	$\displaystyle=(\sigma(R_{\text{c}})\odot V_{\text{c}})W_{O},$		(4)
	$\displaystyle\text{where}\ V_{\text{c}}$	$\displaystyle=\mathrm{SquaredReLU}(K_{\text{c}})W_{V}.$		(4)

Simultaneously, residual connections [20] are established from the tokens to each normalization layer to ensure that training gradients do not vanish in deep networks.
同时，建立了从标记到每个归一化层的残差连接 [20]，以确保训练梯度不会在深度网络中消失。

3.2 Linear Complexity Bidirectional Attention
3.2 线性复杂度双向注意力

Different from the vanilla RWKV [35], we make the following modifications to its original attention mechanism to adapt it to vision tasks: (1) Bidirectional attention: We extend the upper limit of original RWKV attention from $t$ (the current token) to $T-1$ (the last token) in the summation formula to ensure that all tokens are mutually visible in the calculation of each result. Thus, the original causal attention transforms into bidirectional global attention. (2) Relative bias: We compute the absolute value of the time difference $t-i$ and divide it by the total number of tokens (denoted as $T$ ) to represent the relative bias of tokens in images of different sizes. (3) Flexible decay: We no longer restrict the learnable decay parameter $w$ to be positive in the exponential term allowing the exponential decay attention to focus on tokens further away from the current token in different channels. The simple yet necessary modification achieves global attention calculation and maximizes the preservation of RWKV’s low complexity and adaptability to vision tasks.
与原版 RWKV [35] 不同，我们对其原有的注意力机制做了以下修改，以适应视觉任务：（1）双向注意力：我们在求和公式中将原始 RWKV 注意力的上限从 $t titalic_t$ （当前 Token）扩展到 $T - 1 T-1italic_T - 1$ （最后一个 Token），以确保所有 Token 在每个结果的计算中都是互相可见的。因此，原始的因果注意力转变为双向的全局注意力。（2）相对偏差：我们计算时间差 $t - i t-iitalic_t - italic_i$ 的绝对值，并将其除以令牌总数（表示为 $T Titalic_T$ ），以表示不同大小图像中令牌的相对偏差。（3）灵活的衰减：我们不再将可学习的衰减参数 $w witalic_w$ 限制为指数项中的正值，从而允许指数衰减的注意力集中在不同渠道中远离当前令牌的代币上。简单而必要的修改实现了全局注意力计算，并最大限度地保留了 RWKV 的低复杂性和对视觉任务的适应性。

Similar to the attention in RWKV, our bidirectional attention can also be equivalently expressed in a summation form (for the sake of clarity) and an RNN form (in practical implementation).
与 RWKV 中的注意力类似，我们的双向注意力也可以等效地用求和形式（为了清楚起见）和 RNN 形式（在实际实现中）来表达。

Summation Form. The attention calculation result for the $t$ -th token is given by the following formula:
求和表。第 $t titalic_t$ -th 标记的注意力计算结果由以下公式给出：

\displaystyle wkv_{t}=\mathrm{Bi\mbox{-}WKV}(K,V)_{t}=\frac{\sum^{T-1}_{i=0,i% \neq t}e^{-(|t-i|-1)/T\cdot w+k_{i}}v_{i}+e^{u+k_{t}}v_{t}}{\sum^{T-1}_{i=0,i% \neq t}e^{-(|t-i|-1)/T\cdot w+k_{i}}+e^{u+k_{t}}}.

(5)

Here, $T$ represents the total number of tokens, equal to $HW/p^{2}$ , $w$ and $u$ are two $C$ -dimensional learnable vectors that represent channel-wise spatial decay and the bonus indicating the current token, respectively. $k_{t}$ and $v_{t}$ denotes $t$ -th feature of $K$ and $V$ .
这里， $T Titalic_T$ 表示标记的总数，等于 $HW/p^{2}italic_H italic_W / italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ ， $w witalic_w$ 并且 $u uitalic_u$ 是二维 $C Citalic_C$ 可学习向量，分别表示通道空间衰减和表示当前标记的加成。 $k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT$ 和 $v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT$ 表示 $t titalic_t$ $K Kitalic_K$ 和 $V Vitalic_V$ 的第 -th 特征。

The summation formula indicates that the output $wkv_{t}$ is a weighted sum of $V$ along the token dimension from $0$ to $T-1$ , resulting in a $C$ -dimensional vector. It represents the result obtained by applying the attention operation to the t-th token. The weight is determined by the spatial decay vector $w$ , the relative bias between tokens $(|t-i|-1)/T$ , and $k_{i}$ collectively.
求和公式指示输出 $wkv_{t}italic_w italic_k italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT$ 是沿标记维度 from $0$ to $T - 1 T-1italic_T - 1$ 的 $V Vitalic_V$ 加权和，从而生成 $C Citalic_C$ -维向量。它表示通过将 attention 操作应用于第 t 个标记而获得的结果。权重由 spatial decay vector $w witalic_w$ 、 the relative bias between tokens $(| t - i | - 1) / T (|t-i|-1)/T( | italic_t - italic_i | - 1 ) / italic_T$ 和 $k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT$ 集体决定。

RNN Form. In the practical implementation, the above Eq. 5 can be transformed into a recursive formula in the form of RNN that the result of each token can be obtained through a fixed number of FLOPs. By splitting the summation term of the denominator and numerator in Eq. 5 with $t$ as the boundary, we can obtain 4 hidden states:
RNN 表单。在实际实现中，上述方程 5 可以转化为 RNN 形式的递归公式，即每个代币的结果都可以通过固定数量的 FLOP 获得。通过将方程 5 $t titalic_t$ 中分母和分子的求和项作为边界，我们可以得到 4 个隐藏状态：

	$\displaystyle a_{t-1}=$	$\displaystyle\sum^{t-1}_{i=0}e^{-(\|t-i\|-1)/T\cdot w+k_{i}}v_{i},$	$\displaystyle b_{t-1}=$	$\displaystyle\sum^{T-1}_{i=t+1}e^{-(\|t-i\|-1)/T\cdot w+k_{i}}v_{i},$		(6)
	$\displaystyle c_{t-1}=$	$\displaystyle\sum^{t-1}_{i=0}e^{-(\|t-i\|-1)/T\cdot w+k_{i}},$	$\displaystyle d_{t-1}=$	$\displaystyle\sum^{T-1}_{i=t+1}e^{-(\|t-i\|-1)/T\cdot w+k_{i}},$		(6)

that can be recursively computed. The update of hidden states only requires adding or subtracting one summation term and multiplying or dividing $e^{-w/T}$ . Then the $t$ -th result can be given as:
可以递归计算。隐藏状态的更新只需要添加或减去一个求和项并乘以或除以 $e^{-w/T}italic_e start_POSTSUPERSCRIPT - italic_w / italic_T end_POSTSUPERSCRIPT$ 。那么第 $t titalic_t$ -th 个结果可以给出为：

\displaystyle wkv_{t}

\displaystyle=\frac{a_{t-1}+b_{t-1}+e^{k_{t}+u}v_{t}}{c_{t-1}+d_{t-1}+e^{k_{t}% +u}}.

(7)

Each update step yields an attention result (i.e., $wkv_{t}$ ) for a token, so the entire $wkv$ matrix requires $T$ steps.
每个更新步骤都会为令牌生成一个 attention 结果（即 $wkv_{t}italic_w italic_k italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT$ ），因此整个 $w k v wkvitalic_w italic_k italic_v$ 矩阵都需要 $T Titalic_T$ 步骤。

When the input $K$ and $V$ are matrices with the shape of $T\times C$ , the computational cost of calculating the $wkv$ matrix is given by:
当输入 $K Kitalic_K$ 和 $V Vitalic_V$ 是形状为的 $T\times Citalic_T × italic_C$ 矩阵时，计算 $w k v wkvitalic_w italic_k italic_v$ 矩阵的计算成本由下式给出：

\text{FLOPs}(\text{Bi\mbox{-}WKV}(K,V))=13\times T\times C.\vspace{-0.5em}

(8)

Here, the number 13 is approximately from the updates of $(a,b,c,d)$ , the computation of the exponential, and the calculation of $wkv_{t}$ . $T$ is the total number of update steps and is equal to the number of image tokens. The above approximation shows that the complexity of the forward process is $O(TC)$ . The backward propagation of the operator can still be represented as a more complex RNN form, with a computational complexity of $O(TC)$ . The specific formula for backward propagation is provided in the Appendix.
这里，数字 13 大约来自 $(a, b, c, d) (a,b,c,d)( italic_a , italic_b , italic_c , italic_d )$ 的更新，指数的计算和 $wkv_{t}italic_w italic_k italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT$ 的计算。 $T Titalic_T$ 是更新步骤的总数，等于图像令牌的数量。上述近似表明，正向过程的复杂度为 $O (T C) O(TC)italic_O ( italic_T italic_C )$ 。运算符的向后传播仍然可以表示为更复杂的 RNN 形式，计算复杂度为 $O (T C) O(TC)italic_O ( italic_T italic_C )$ 。向后传播的具体公式在附录中提供。

3.3 Quad-Directional Token Shift
3.3 四向令牌偏移

By introducing an exponential decay mechanism, the complexity of global attention can be reduced from quadratic to linear, greatly enhancing the computational efficiency of the model in high-resolution images. However, the one-dimensional decay does not align with the neighboring relationships in two-dimensional images. Therefore, we introduce a quad-directional token shift (Q-Shift) in the first step of each spatial-mix and channel-mix module. The Q-Shift operation allows all tokens shifted and linearly interpolated with their neighboring tokens as follows:
通过引入指数衰减机制，可以将全局注意力的复杂度从二次降低到线性，大大提高了模型在高分辨率图像中的计算效率。然而，一维衰减与二维图像中的相邻关系不对齐。因此，我们在每个空间混合和通道混合模块的第一步中引入了四向令牌移位（Q-Shift）。Q-Shift 操作允许所有标记进行移动并与其相邻标记线性插值，如下所示：

$\displaystyle{\rm Q\mbox{-}Shift}_{(*)}(X)$	$\displaystyle=X+(1-\mu_{\text{(*)}})X^{\dagger},$	(9)
$\displaystyle\text{where}\ X^{{\dagger}}[h,w]$	$\displaystyle=\mathrm{Concat}(X[h-1,w,0:C/4],X[h+1,w,C/4:C/2],$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ X[h,w-1,C/2:3C/4],X[h,w+1,3C/4:C]).$

Subscript $(*)\in\{R,K,V\}$ denotes 3 interpolation of $X$ and $X^{\dagger}$ controlled by the learnable vectors $\mu_{(*)}$ for the later calculation of $R,K,V$ , respectively. $h$ and $w$ denote the row and column index of token $X$ , “:” is a slicing operation excluded the end index. The Q-Shift makes the attention mechanism of different channels obtain the prior of focusing on neighboring tokens internally without introducing many additional FLOPs. The Q-Shift operation also increases the receptive field of each token which greatly enhances the coverage of the token in the posterior layers.
下标 $(*)\in\{R,K,V\}( * ) ∈ { italic_R , italic_K , italic_V }$ 表示 3 个可学习向量的 $X Xitalic_X$ 插值和 $X^{\dagger}italic_X start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT$ 控制， $\mu_{(*)}italic_μ start_POSTSUBSCRIPT ( * ) end_POSTSUBSCRIPT$ 分别用于 $R, K, V R,K,Vitalic_R , italic_K , italic_V$ 的后续计算。 $h hitalic_h$ 并 $w witalic_w$ 表示 token $X Xitalic_X$ 的行列索引，“：” 是排除结束索引的切片操作。Q-Shift 使不同通道的注意力机制获得了在内部关注相邻 Token 的先例，而无需引入许多额外的 FLOPs。Q-Shift 操作还增加了每个标记的感受野，这大大增强了标记在后层的覆盖率。

3.4 Scale Up Stability
3,4 放大稳定性

Both the number of model layers increasing and the accumulation in the exponent term during the recursion can lead to instability in the model output and affect the stability of the training process. To mitigate the instability, we employ two simple but effective modifications to stabilize the scale-up of the model. (1) Bounded exponential: As the input resolution increases, both exponential decay and growth quickly exceed the range of floating-point numbers. Therefore, we divide the exponential term by the number of tokens (such as $\mathrm{exp}(-(|t-i|-1)/T\cdot w$ )), making the maximum decay and growth bounded. (2) Extra layer normalization: When the model gets deeper, we directly add layer normalization [2] after the attention mechanism and Squared ReLU operation to prevent the model’s output from overflowing. The two modifications enable stable scaling of both input resolution and model depth, allowing large models to train and converge stably. We also introduce layer scale [50] which contributes to the stability of the models as they scale up.
递归过程中模型层数的增加和指数项的积累都会导致模型输出不稳定，影响训练过程的稳定性。为了减轻这种不稳定性，我们采用了两种简单但有效的修改来稳定模型的放大。（1）有界指数：随着输入分辨率的增加，指数衰减和增长都会迅速超过浮点数的范围。因此，我们将指数项除以 token（例如 $\mathrm{exp}(-(|t-i|-1)/T\cdot wroman_exp ( - ( | italic_t - italic_i | - 1 ) / italic_T ⋅ italic_w$ ）的数量），使最大衰减和增长有界。（2）额外的层归一化：当模型变得更深时，我们在注意力机制和 Squared ReLU 操作之后直接添加层归一化 [2]，以防止模型的输出溢出。这两项修改实现了输入分辨率和模型深度的稳定缩放，使大型模型能够稳定地训练和收敛。我们还引入了层尺度 [50]，这有助于模型在放大时的稳定性。

Model	Emb Dim 嵌入暗淡	Hidden Dim 隐藏的昏暗	Depth 深度	Extra Norm 额外规范	#Param
VRWKV-T	192	768	12	✕	6.2M
VRWKV-S	384	1536	12	✕	23.8M
VRWKV-B	768	3072	12	✕	93.7M
VRWKV-L	1024	4096	24	✓	334.9M

Table 1: Default settings for Vision-RWKV of different scales. We report the embedding dimension, hidden dimension, and model depth for VRWKV-T/S/B/L. “Extra Norm” means additional layer normalization layers are used to stabilize the model’s outputs. “#Param” denotes the number of parameters.
表 1：不同比例的 Vision-RWKV 的默认设置。我们报告 VRWKV-T/S/B/L 的嵌入维度、隐藏维度和模型深度。“额外范数”意味着使用额外的层归一化层来稳定模型的输出。“#Param” 表示参数的数量。

3.5 Model Details
3,5 型号详细信息

Following ViT, the hyper-parameters for variants of VRWKV, including embedding dimension, hidden dimension in linear projection, and depth, are specified in Tab. 1. Due to the increased depth of the VRWKV-L model, additional layer normalizations as discussed in Sec. 3.4, are incorporated at appropriate positions to ensure output stability.
在 ViT 之后，VRWKV 变体的超参数，包括嵌入维度、线性投影中的隐藏维度和深度，在表 1 中指定。由于 VRWKV-L 模型的深度增加，第 3.4 节中讨论的额外层归一化被合并到适当的位置，以确保输出稳定性。

4 Experiments 4 实验

We comprehensively evaluate the substitutability of our VRWKV method for ViT in performance, scalability, flexibility, and efficiency. We validate the effectiveness of our model on the widely-used image classification dataset ImageNet [8]. For downstream dense prediction tasks, we select detection tasks on the COCO [27] dataset and semantic segmentation on the ADE20K [62] dataset.
我们全面评估了 VRWKV 方法在 ViT 的性能、可扩展性、灵活性和效率方面的可替代性。我们在广泛使用的图像分类数据集 ImageNet [8] 上验证了我们的模型的有效性。对于下游密集预测任务，我们在 COCO [27] 数据集上选择检测任务，在 ADE20K [62] 数据集上选择语义分割。

	Method 方法	Size 大小	#Param	FLOPs 失败	Top-1 Acc Top-1 累积
hierarchical 层次	ResNet-18 [20]	$224^{2}$	11.7M	1.8G	69.9
	PVT-T [56] PVT-T 型 [56]	$224^{2}$	13.2M	1.9G	75.1
	ResNet-50 [20]	$224^{2}$	25.6M	4.1G	76.6
	Swin-T [30] 斯温-T [30]	$224^{2}$	28.3M	4.4G	81.2
	PVT-M [56] PVT-M 型 [56]	$224^{2}$	44.2M	6.7G	81.2
	ResNet-101 [20]	$224^{2}$	44.6M	7.9G	78.0
	Swin-S [30] 斯温-S [30]	$224^{2}$	49.6M	8.7G	83.0
	PVT-L [56] PVT-L 型 [56]	$224^{2}$	61.4M	9.8G	81.7
	Swin-B [30] 斯温-B [30]	$224^{2}$	87.8M	15.1G	83.4
	DeiT-T [49]	$224^{2}$	5.7M	1.3G	72.2
	DeiT-S [49]	$224^{2}$	22.1M	4.6G	79.9
	XCiT-S12 [1]	$224^{2}$	26.0M	4.8G	82.0
	DeiT-B [49]	$224^{2}$	86.6M	17.6G	81.8
	XCiT-L24 [1]	$224^{2}$	189.0M	36.1G	82.9
	ViT-L [12]	$384^{2}$	309.5M	191.1G	85.2
	VRWKV-T	$224^{2}$	6.2M	1.2G	75.1
	VRWKV-S	$224^{2}$	23.8M	4.6G	80.1
	VRWKV-B	$224^{2}$	93.7M	18.2G	82.0
	VRWKV-L	$384^{2}$	334.9M	189.5G	86.0
non-hierarchical 非分层	VRWKV-L ${}^{\star}$	$384^{2}$	334.9M	189.5G	86.5

Table 2: Validation results on ImageNet-1K. VRWKV-T/S/B are trained from scratch using ImageNet-1K, while VRWKV-L is pre-trained on Imagenet-22K and fine-tuned on Imagenet-1K. “#Param” denotes the number of parameters, and “FLOPs” means the computational workload for processing an image at the resolution specified in the “Size” column. “

\star

” denotes Bamboo-47K [61] is used in the pre-training.
表 2：ImageNet-1K 上的验证结果。 VRWKV-T/S/B 使用 ImageNet-1K 从头开始训练，而 VRWKV-L 在 Imagenet-22K 上进行预训练，并在 Imagenet-1K 上进行微调。 “#Param”表示参数的数量，“FLOPs”表示以“Size”列中指定的分辨率处理图像的计算工作负载。 “

\star⋆

”表示 Bamboo-47K [61] 用于预训练。

4.1 Image Classification
4.1 图像分类

Settings. For -Tiny/Small/Base models, we conduct supervised training from scratch on ImageNet-1K [8]. Following the training strategy and data augmentation of DeiT [49], we use a batch size of 1024, AdamW [33] with a base learning rate of 5e-4, weight decay of 0.05, and cosine annealing schedule [32]. Images are cropped to the resolution of $224\times 224$ for training and validation. For the -Large models, we first pre-train them for 90 epochs on ImageNet-22K with a batch size of 4096 and resolution of $192\times 192$ , and then fine-tune them for 20 epochs on ImageNet-1K to a higher resolution of $384\times 384$ .
设置。对于 -Tiny/Small/Base 模型，我们在 ImageNet-1K [8] 上从头开始进行监督训练。遵循 DeiT [49] 的训练策略和数据增强，我们使用 1024 的批量大小、基本学习率为 5e-4 的 AdamW [33]、权重衰减为 0.05 和余弦退火计划 [32]。图像被裁剪为用于训练和验证的 $224\times 224224 × 224$ 分辨率。对于 -Large 模型，我们首先在 ImageNet-22K 上预训练它们 90 个时期，批次大小为 4096，分辨率为 $192\times 192192 × 192$ ，然后在 ImageNet-1K 上将它们微调 20 个时期，以达到更高的分辨率 $384\times 384384 × 384$ 。

Results. We compare the results of our VRWKV with other hierarchical and non-hierarchical backbones on the ImageNet-1K validation dataset. As shown in Tab. 2, with the same number of parameters, computational complexity, and training/testing resolutions, VRWKV achieves better results than ViT. For example, when VRWKV-T has slightly lower FLOPs than DeiT-T(1.2G vs. 1.3G), our VRWKV-T achieves a 2.9 point higher than DeiT-T on top-1 accuracy. When the model size scales up, VRWKV still demonstrates higher baseline performance. In the case of large models, VRWKV-L achieves a 0.8 point higher top-1 accuracy of 86.0% at the resolution of $384\times 384$ than ViT-L, with a slightly reduced computational cost. The superior performance from tiny to large-size models demonstrates that the VRWKV model possesses the scalability as ViT. Additionally, after using Bamboo-47K [61] in the pre-train process, the performance of VRWKV-L can be further boosted to 86.5%, indicating that our VRWKV also possesses the ability like ViT to benefit from pre-training on large-scale datasets. The exploration of VRWKV in classification tasks demonstrates its potential to be a viable alternative to traditional ViT models.
结果。我们将 VRWKV 的结果与 ImageNet-1K 验证数据集上的其他分层和非分层主干进行了比较。如表 2 所示，在参数数量、计算复杂度和训练/测试分辨率相同的情况下，VRWKV 取得了比 ViT 更好的结果。例如，当 VRWKV-T 的 FLOPs 略低于 DeiT-T（1.2G 对 1.3G）时，我们的 VRWKV-T 在 top-1 精度上比 DeiT-T 高 2.9 个百分点。当模型大小扩大时，VRWKV 仍表现出更高的基线性能。在大型模型的情况下，VRWKV-L 的分辨率 $384\times 384384 × 384$ 比 ViT-L 高出 0.8 点，达到 86.0% 的 top-1 精度，计算成本略有降低。从小型模型到大型模型的卓越性能表明，VRWKV 模型具有与 ViT 一样的可扩展性。此外，在预训练过程中使用 Bamboo-47K [61] 后，VRWKV-L 的性能可以进一步提升到 86.5%，这表明我们的 VRWKV 也具有像 ViT 一样的能力，可以从大规模数据集的预训练中受益。在分类任务中对 VRWKV 的探索表明它有可能成为传统 ViT 模型的可行替代方案。

4.2 Object Detection
4.2 对象检测

Method 方法	#Param	FLOPs 失败	$\rm AP^{b}$	$\rm AP^{m}$
ViT-T ${}^{\dagger}$ [49]	8.0M	95.4G	41.1	37.5
ViT-T [49]	8.0M	147.1G	41.6	37.9
VRWKV-T (ours) VRWKV-T（我们的）	8.4M	67.9G	41.7	38.0
ViT-S ${}^{\dagger}$ [49]	27.5M	241.2G	44.6	39.7
ViT-S [49]	27.5M	344.5G	44.9	40.1
VRWKV-S (ours) VRWKV-S（我们的）	29.3M	189.9G	44.8	40.2
ViT-B ${}^{\dagger}$ [49] 维生素B ${}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT$ [49]	99.5M	686.7G	46.2	41.5
ViT-B [49] 维生素B [49]	99.5M	893.3G	46.8	41.8
VRWKV-B (ours) VRWKV-B（我们的）	106.6M	599.0G	46.8	41.7
ViT-L ${}^{\dagger}$ [44]	327.0M	1799.3G	48.7	43.3
VRWKV-L (ours) VRWKV-L（我们的）	351.9M	1730.6G	50.6	44.9

Table 3: Object detection and instance segmentation on COCO val2017. All models adopt the ViT-Adapter [4] to generate multi-scale features for detection heads. -T/S/B models are initialized with ImageNet-1K pre-trained weights, and all -L models with ImageNet-22K weights. “#Param” denotes the number of backbone parameters. “FLOPs” denotes the computational workload of the backbone with an input image of

1333\times 800

. “

\dagger

” means window attention is adopted in ViT layers.
表 3：COCO val2017 上的对象检测和实例分割。所有型号均采用 ViT 适配器 [4] 为检测头生成多尺度特征。 -T/S/B 模型使用 ImageNet-1K 预训练权重进行初始化，所有 -L 模型都使用 ImageNet-22K 权重进行初始化。 “#Param” 表示主干参数的数量。“FLOPs” 表示主干网的计算工作负载，输入图像为

1333\times 8001333 × 800

。 “

\dagger†

” 表示 ViT 图层采用了窗口注意力。

Settings. In the detection tasks, we adopt Mask R-CNN [19] as the detection head. For the -Tiny/Small/Base models, the backbones use weights pre-trained on ImageNet-1K for 300 epochs. For the -Large model, weights pre-trained on ImageNet-22K are used. All models use a 1 $\times$ training schedule (i.e., 12 epochs) with a batch size of 16, and AdamW [33] optimizer with an initial learning rate of 1e-4 and weight decay of 0.05.
设置。在检测任务中，我们采用 Mask R-CNN [19] 作为检测头。对于 -Tiny/Small/Base 模型，主干使用在 ImageNet-1K 上预先训练的权重 300 个时期。对于 -Large 模型，使用在 ImageNet-22K 上预先训练的权重。所有模型都使用 1 个 $\times×$ 训练计划（即 12 个时期），批量大小为 16，以及初始学习率为 1e-4、权重衰减为 0.05 的 AdamW [33] 优化器。

Results. In Tab. 3, we report the detection results on the COCO val [27] dataset using VRWKV and ViT as backbones. As the results showed in Fig. 1 (a) and Tab. 3, due to the use of window attention in dense prediction tasks, VRWKV with global attention can achieve better performance than ViT with lower FLOPs. For example, VRWKV-T has approximately 30% lower backbone FLOPs compared to ViT-T ${}^{\dagger}$ , with an improvement of $\mathrm{AP^{b}}$ by 0.6 points. Similarly, VRWKV-L achieves a 1.9-point increase in $\mathrm{AP^{b}}$ with lower FLOPs compared to ViT-L ${}^{\dagger}$ . Additionally, we compare the performance of VRWKV and ViT using global attention. For instance, VRWKV-S achieves similar performance to ViT-S with 45% lower FLOPs. This demonstrates the effectiveness of VRWKV’s global attention mechanism in dense prediction tasks and the advantage of lower computational complexity compared to the original attention mechanism.
结果。在表 3 中，我们报告了使用 VRWKV 和 ViT 作为主干的 COCO val [27] 数据集上的检测结果。如图 2 所示。 1 （a）和表 3，由于在密集预测任务中使用了窗口注意力，具有全局关注度的 VRWKV 可以获得比具有较低 FLOPs 的 ViT 更好的性能。例如，与 ViT-T ${}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT$ $\mathrm{AP^{b}}roman_AP start_POSTSUPERSCRIPT roman_b end_POSTSUPERSCRIPT$ 相比，VRWKV-T 的主干 FLOP 降低了约 30%，提高了 0.6 个百分点。同样，与 ViT-L ${}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT$ 相比，VRWKV-L 的 FLOP 提高了 $\mathrm{AP^{b}}roman_AP start_POSTSUPERSCRIPT roman_b end_POSTSUPERSCRIPT$ 1.9 个百分点。此外，我们使用全球关注度比较了 VRWKV 和 ViT 的性能。例如，VRWKV-S 实现了与 ViT-S 相似的性能，但 FLOPs 降低了 45%。这证明了 VRWKV 的全局注意力机制在密集预测任务中的有效性，以及与原始注意力机制相比，计算复杂度更低的优势。

4.3 Semantic Segmentation
4.3 语义分割

Method 方法	#Param	FLOPs 失败	mIoU
ViT-T [49]	8.0M	20.9G	42.6
VRWKV-T (ours) VRWKV-T（我们的）	8.4M	16.6G	43.3
ViT-S [49]	27.5M	54.0G	46.2
VRWKV-S (ours) VRWKV-S（我们的）	29.3M	46.3G	47.2
ViT-B [49] 维生素B [49]	99.5M	157.9G	48.8
VRWKV-B (ours) VRWKV-B（我们的）	106.6M	146.0G	49.2
ViT-L [44]	327.0M	446.8G	53.4
VRWKV-L (ours) VRWKV-L（我们的）	351.9M	421.9G	53.5

Table 4: Semantic segmentation on the ADE20K val set. All models used ViT-Adapter [4] for multi-scale feature generation and are trained with UperNet [59] as the segmentation heads. For consistency in comparison, all -T/S/B models are initialized using ImageNet-1K pre-training, whereas -L models utilize ImageNet-22K pre-training. “#Param” refers to the number of parameters of the backbone. We report the FLOPs of backbones with the input size of

512\times 512

.
表 4：ADE20K val 集上的语义分割。所有模型都使用 ViT 适配器 [4] 进行多尺度特征生成，并使用 UperNet [59] 作为分割头进行训练。为了保持一致性，所有 -T/S/B 模型都使用 ImageNet-1K 预训练进行初始化，而 -L 模型使用 ImageNet-22K 预训练。 “#Param” 是指 backbone 的参数数量。我们报告了输入大小为

512\times 512512 × 512

的 backbone 的 FLOPs。

Settings. In the semantic segmentation task, we use UperNet [59] as the segmentation head. Specifically, all ViT models use global attention in the segmentation task. For the -Tiny/Small/Base models, the backbones use weights pre-trained on ImageNet-1K. And for the -Large model, weights pre-trained on ImageNet-22K are used. We employ the AdamW optimizer with an initial learning rate of 6e-5 for the -Small/Base/Large models and 12e-5 for the -Tiny model, a batch size of 16, and a weight decay of 0.01. All models are trained for 160k iterations on the training set of the ADE20K dataset [62].
设置。在语义分割任务中，我们使用 UperNet [59] 作为分割头。具体来说，所有 ViT 模型都在分割任务中使用全局注意力。对于 -Tiny/Small/Base 模型，主干使用在 ImageNet-1K 上预先训练的权重。对于 -Large 模型，使用在 ImageNet-22K 上预训练的权重。我们采用 AdamW 优化器，-Small/Base/Large 模型的初始学习率为 6e-5，-Tiny 模型的初始学习率为 12e-5，批量大小为 16，权重衰减为 0.01。所有模型都在 ADE20K 数据集的训练集上进行了 160k 迭代的训练 [62]。

Results. As shown in Tab. 4, when using UperNet for semantic segmentation, models based on VRWKV consistently outperform those based on ViT with global attention, while also being more efficient. For example, VRWKV-S achieves 1 point higher than ViT-S with a 14% FLOPs decrease. VRWKV-L creates a result of 53.5 mIoU similar to ViT-L while the computation of the backbone has a 25G FLOPs lower. These results demonstrate that our VRWKV backbones can extract better features for semantic segmentation compared to ViT backbones while also being more efficient, benefiting from the linear complexity attention mechanism.
结果。如表 4 所示，当使用 UperNet 进行语义分割时，基于 VRWKV 的模型始终优于基于 ViT 的模型，同时具有更高的效率。例如，VRWKV-S 的得分比 ViT-S 高 1 分，FLOPs 降低了 14%。VRWKV-L 产生的结果为 53.5 mIoU，类似于 ViT-L，而主干网的计算则低 25G FLOPs。这些结果表明，与 ViT 主干相比，我们的 VRWKV 主干可以提取更好的语义分割特征，同时也更高效，受益于线性复杂性注意力机制。

4.4 Ablation Study
4.4 消融研究

Settings. We conduct ablation studies of the tiny-size VRWKV on ImageNet-1K [8] to validate the effectiveness of various key components like Q-Shift and bidirectional attention. The experimental settings are consistent with Sec. 4.1.
设置。我们在 ImageNet-1K [8] 上对小尺寸 VRWKV 进行了消融研究，以验证 Q-Shift 和双向注意力等各种关键组件的有效性。实验设置与第 4.1 节一致。

Token Shift. We compare the performance of not using token shift, using the original shift method in RWKV [35], and our Q-Shift. As shown in Tab. 5, the variation in the shift method shows performance differences. Variant 1 without token shift leads to a poor performance of 71.5, which is 3.6 points lower than our model. Even with the use of our global attention, the model using the original token shift still has a 0.7-point gap with our model.
令牌转移。我们比较了不使用标记移位、使用 RWKV [35] 中的原始移位方法和我们的 Q-Shift 的性能。如表 5 所示，移位方法的变化显示了性能差异。没有标记偏移的变体 1 导致 71.5 的糟糕性能，比我们的模型低 3.6 分。即使使用了我们的全球关注度，使用原始标记偏移的模型仍然与我们的模型有 0.7 个百分点的差距。

Bidirectional Attention. The bidirectional attention mechanism enables the model to achieve global attention while the original RWKV attention has a causal mask internally. The result of Variant 3 shows that the global attention mechanism brings a 2.3 points increase in the top-1 accuracy.
双向注意力。双向注意力机制使模型能够实现全局注意力，而原始 RWKV 注意力在内部具有因果掩码。变体 3 的结果表明，全局注意力机制使 top-1 准确率提高了 2.3 个百分点。

Method 方法	Token Shift 令牌转移	Bidirectional 双向	Top-1 Acc Top-1 累积
Method 方法	Token Shift 令牌转移	Attention 注意力	Top-1 Acc Top-1 累积
RWKV [35]	original 源语言	✕	71.1 (-4.0) 71.1 （-4.0）
Variant 1 变式1	none 没有	✓	71.5 (-3.6) 71.5 （-3.6）
Variant 2 变式2	original 源语言	✓	74.4 (-0.7) 74.4 （-0.7）
Variant 3 变式3	Q-Shift Q-移位	✕	72.8 (-2.3) 72.8 （-2.3）
VRWKV-T (ours) VRWKV-T（我们的）	Q-Shift Q-移位	✓	75.1

Table 5: Ablation on key components of the proposed VRWKV. All models are trained from scratch on ImageNet-1K. The “original” means the original token shift in RWKV [35] which mixes the token in a single direction is used.
表 5：拟议 VRWKV 关键组件的消融。所有模型都在 ImageNet-1K 上从头开始训练。“原始”是指 RWKV [35] 中的原始标记移位，它使用单一方向混合标记。

Effective Receptive Field (ERF). We analyze the impact of different designs on the ERF of models based on [11] and visualize it in Fig. 3(a). We visualize the ERF of the central pixel with an input size of 1024 × 1024. In Fig. 3(a), “No Shift” represents the absence of the token shift method (Q-Shift), and “RWKV Attn” indicates using the original RWKV attention mechanism without our modifications for vision tasks. From the comparison in the figure, all models except the “RWKV Attn” one achieved global attention while the global capacity of the VRWKV-T model is better than that of ViT-T. Despite the assistance of Q-Shift, the central pixel in “RWKV Attn” still cannot attend to the pixels on the bottom of the image due to the large input resolution. The results of “No Shift” and Q-Shift show that the Q-Shift method expands the core range of the receptive field, enhancing the inductive bias of global attention.

Efficiency Analysis. We gradually increase the input resolution from $224\times 224$ to $2048\times 2048$ and compare the inference and memory efficiency of VRWKV-T and ViT-T. The results were tested on an Nvidia A100 GPU, as shown in Fig. 1. From the curves presented in Fig. 1(b), it is observed that at lower resolutions, such as $224\times 224$ with around 200 image tokens, VRWKV-T and ViT-T exhibit comparable memory usage, though VRWKV-T has a slightly lower FPS compared to ViT-T. However, with increasing resolution, VRWKV-T’s FPS rapidly exceeds that of ViT-T, thanks to its linear attention mechanism. Additionally, VRWKV-T’s RNN-like computational framework ensures a slow increase in memory usage. By the time the resolution hits $2048\times 2048$ (equivalent to 16384 tokens), VRWKV-T’s inference speed is 10 times faster than ViT-T, and its memory consumption is reduced by 80% compared to ViT-T.
效率分析。我们逐渐提高输入分辨率 $224\times 224224 × 224$ ， $2048\times 20482048 × 2048$ 并比较了 VRWKV-T 和 ViT-T 的推理和内存效率。结果在 Nvidia A100 GPU 上进行了测试，如图 2 所示。 1.从图 1 所示的曲线 1（b）中，据观察，在较低的分辨率下，例如 $224\times 224224 × 224$ 使用大约 200 个图像令牌，VRWKV-T 和 ViT-T 表现出相当的内存使用量，尽管 VRWKV-T 的 FPS 与 ViT-T 相比略低。然而，随着分辨率的提高，VRWKV-T 的 FPS 迅速超过了 ViT-T，这要归功于它的线性注意力机制。此外，VRWKV-T 的类似 RNN 的计算框架确保了内存使用量的缓慢增加。当分辨率达到 $2048\times 20482048 × 2048$ 时（相当于 16384 个令牌），VRWKV-T 的推理速度比 ViT-T 快 10 倍，内存消耗比 ViT-T 减少 80%。

We also compare the speed of our ${\rm Bi\mbox{-}WKV}$ and flash attention [7], reported in Fig. 3(b). Flash attention is highly efficient at low resolutions yet plagued by quadratic complexity, its speed decreases rapidly as the resolution increases. In high-resolution scenarios, our linear operator ${\rm Bi\mbox{-}WKV}$ demonstrates a significant speed advantage. For instance, when the input is $2048\times 2048$ (i.e., 16384 tokens) with the channel and head number settings according to ViT-B and VRWKV-B, our ${\rm Bi\mbox{-}WKV}$ operator is $2.8\times$ faster than flash attention in inference runtime and $2.7\times$ faster in the combined forward and backward pass.
我们还比较了 our ${\rm Bi\mbox{-}WKV}roman_Bi - roman_WKV$ 和 flash attention [7] 的速度，如图 2 所示。 3（b）的。Flash 注意力在低分辨率下非常有效，但受到二次复杂度的困扰，其速度会随着分辨率的增加而迅速降低。在高分辨率场景中，我们的线性算子 ${\rm Bi\mbox{-}WKV}roman_Bi - roman_WKV$ 表现出显著的速度优势。例如，当输入为 $2048\times 20482048 × 2048$ （即 16384 个令牌）并根据 ViT-B 和 VRWKV-B 设置通道和头号时，我们的 ${\rm Bi\mbox{-}WKV}roman_Bi - roman_WKV$ 运算符在推理运行时比 Flash 注意 $2.8\times2.8 ×$ 更快，在 $2.7\times2.7 ×$ 组合的正向和反向传递中更快。

MAE Pre-training. Similar to ViT, our VRWKV model can handle sparse inputs and benefits from the MAE pre-training [19]. By simply modifying the Q-Shift to perform a bidirectional shift operation, VRWKV can be pre-trained using MAE. The pre-trained weights can be directly fine-tuned for other tasks using a Q-Shift manner. Following the same MAE pre-training setting as ViT, and subsequent classification training similar to Sec. 4.1, our VRWKV-L achieves a top-1 accuracy improvement from 86.0% to 86.2% on ImageNet-1K val, showing the ability to acquire visual prior from masked image modeling.
MAE 预训练。与 ViT 类似，我们的 VRWKV 模型可以处理稀疏输入，并从 MAE 预训练中受益 [19]。通过简单地修改 Q-Shift 来执行双向移位操作，就可以使用 MAE 对 VRWKV 进行预训练。可以使用 Q-Shift 方式直接针对其他任务微调预训练的权重。遵循与 ViT 相同的 MAE 预训练设置，以及类似于第 4.1 节的后续分类训练，我们的 VRWKV-L 在 ImageNet-1K val 上实现了从 86.0% 到 86.2% 的前第一准确率提高，显示了从掩蔽图像建模中获取视觉先验的能力。

5 Conclusion 5 结论

We propose Vision-RWKV (VRWKV), a vision encoder with a linear computational complexity attention mechanism. We demonstrate its capability to be an alternative backbone to ViT in comprehensive vision tasks including classification, dense predictions, and masked image modeling pre-training. With comparable performance and scalability, VRWKV exhibits lower computational complexity and memory consumption. Benefiting from its low complexity, VRWKV can achieve better performance in the tasks that ViT struggling to afford the high computational overhead of global attention. We hope VRWKV will be an efficient and low-cost alternative to ViT, showcasing the powerful potential of linear complexity transformers in vision fields.
我们提出了Vision-RWKV（VRWKV），这是一种具有线性计算复杂性注意力机制的视觉编码器。我们证明了它有能力在综合视觉任务中成为 ViT 的替代骨干，包括分类、密集预测和蒙版图像建模预训练。凭借相当的性能和可扩展性，VRWKV 表现出较低的计算复杂性和内存消耗。得益于其低复杂度，VRWKV可以在ViT难以承受全球关注的高计算开销的任务中实现更好的性能。我们希望VRWKV能够成为ViT的高效、低成本的替代品，展示线性复杂度转换器在视觉领域的强大潜力。

References 引用

[1] Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al.: Xcit: Cross-covariance image transformers. NeurIPS 34 (2021)
Ali， A.， Touvron， H.， Caron， M.， Bojanowski， P.， Douze， M.， Joulin， A.， Laptev， I.， Neverova， N.， Synnaeve， G.， Verbeek， J.， et al.： Xcit：交叉协方差图像转换器。神经IPS 34 （2021）
[2] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Ba， J.L.， Kiros， J.R.， Hinton， G.E.：层归一化。arXiv 预印本 arXiv：1607.06450 （2016）
[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020)
Brown， T.， Mann， B.， Ryder， N.， Subbiah， M.， Kaplan， JD， Dhariwal， P.， Neelakantan， A.， Shyam， P.， Sastry， G.， Askell， A.， et al.：语言模型是小概率学习者。神经漏洞 33， 1877–1901 （2020）
[4] Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions (2023)
Chen， Z.， Duan， Y.， Wang， W.， He， J.， Lu， T.， Dai， J.， Qiao， Y.：用于密集预测的视觉变压器适配器（2023）
[5] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. pp. 764–773 (2017)
Dai， J.， Qi， H.， Xiong， Y.， Li， Y.， Zhang， G.，胡， H.， Wei， Y.：可变形卷积网络.在：ICCV。第764–773页（2017）
[6] Dai, J., Shi, M., Wang, W., Wu, S., Xing, L., Wang, W., Zhu, X., Lu, L., Zhou, J., Wang, X., et al.: Demystify transformers & convolutions in modern image deep networks. arXiv preprint arXiv:2211.05781 (2022)
戴，J.，石，M.，王，W.，吴，S.，邢，L.，王W.，朱X.，卢，L.，周，J.，王X.，等人：揭开现代图像深度网络中的变压器和卷积的神秘面纱。arXiv 预印本 arXiv：2211.05781 （2022）
[7] Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS 35, 16344–16359 (2022)
Dao， T.， Fu， D.， Ermon， S.， Rudra， A.， Ré， C.： Flashattention：具有 io 感知的快速且节省内存的精确注意。神经IPS 35， 16344–16359 （2022）
[8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)
邓，J.，董，W.，Socher，R.，李，L.J.，李，K.，Fei-Fei，L.：Imagenet：一种大规模分层图像数据库。在：CVPR。第248–255页（2009）
[9] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Devlin， J.， Chang， M.W.， Lee， K.， Toutanova， K.： Bert：用于语言理解的深度双向转换器的预训练。arXiv 预印本 arXiv：1810.04805 （2018）
[10] Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., Wei, F.: Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486 (2023)
Ding， J.，马， S.， Dong， L.， Zhang， X.， Huang， S.， Wang， W.， Zheng， N.， Wei， F.：长网：将转换器扩展到 1,000,000,000 个代币。arXiv 预印本 arXiv：2307.02486 （2023）
[11] Ding, X., Zhang, X., Han, J., Ding, G.: Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In: CVPR. pp. 11963–11975 (2022)
Ding， X.， Zhang， X.， Han， J.， Ding， G.：将内核扩展到 31x31：重新审视 CNN 中的大型内核设计。在：CVPR。第11963–11975页（2022）
[12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2020)
Dosovitskiy， A.， Beyer， L.， Kolesnikov， A.， Weissenborn， D.， Zhai， X.， Unterthiner， T.， Dehghani， M.， Minderer， M.， Heigold， G.， Gelly， S.， et al.：一张图像胜过 16x16 个单词：用于大规模图像识别的转换器。收录于：ICLR （2020）
[13] Elman, J.L.: Finding structure in time. Cognitive Science 14(2), 179–211 (1990)
Elman， J.L.：在时间中寻找结构。认知科学 14（2）， 179–211 （1990）
[14] Fan, Q., Huang, H., Chen, M., Liu, H., He, R.: Rmt: Retentive networks meet vision transformers. arXiv preprint arXiv:2309.11523 (2023)
Fan， Q.， Huang， H.， Chen， M.， Liu， H.， He， R.： Rmt：保持网络满足视觉变压器的需求。arXiv 预印本 arXiv：2309.11523 （2023）
[15] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
Gu， A.， Dao， T.：曼巴：具有选择性状态空间的线性时间序列建模。arXiv 预印本 arXiv：2312.00752 （2023）
[16] Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021)
Gu， A.， Goel， K.， Ré， C.：使用结构化状态空间高效建模长序列。arXiv 预印本 arXiv：2111.00396 （2021）
[17] Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS 34, 572–585 (2021)
Gu， A.， Johnson， I.， Goel， K.， Saab， K.， Dao， T.， Rudra， A.， Ré， C.：将循环、卷积和连续时间模型与线性状态空间层相结合。神经IPS 34， 572–585 （2021）
[18] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He， K.， Chen， X.， Xie， S.， Li， Y.， Dollár， P.， Girshick， R.：蒙面自动编码器是可扩展的视觉学习器。arXiv 预印本 arXiv：2111.06377 （2021）
[19] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969 (2017)
他，K.，Gkioxari，G.，Dollár，P.，Girshick，R.：面具 r-cnn。在：ICCV。pp. 2961–2969 （2017年）
[20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
He， K.， Zhang， X.，任， S.， Sun， J.：图像识别的深度残差学习。在：CVPR。第770–778页（2016）
[21] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR. pp. 7132–7141 (2018)
胡，J.，沈，L.，孙，G.：挤压和激励网络。在：CVPR。pp. 7132–7141 （2018）
[22] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. pp. 4700–4708 (2017)
Huang， G.， Liu， Z.， Van Der Maaten， L.， Weinberger， K.Q.：密集连接的卷积网络。在：CVPR。第 4700–4708 页（2017）
[23] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. NeurIPS 25 (2012)
Krizhevsky， A.， Sutskever， I.， Hinton， GE：使用深度卷积神经网络进行图像网络分类。神经IPS 25 （2012）
[24] LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks 3361(10), 1995 (1995)
LeCun， Y.， Bengio， Y.， et al.：图像、语音和时间序列的卷积网络。脑理论与神经网络手册 3361（10），1995 （1995）
[25] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Lewis， M.， Liu， Y.， Goyal， N.， Ghazvininejad， M.， Mohamed， A.， Levy， O.， Stoyanov， V.， Zettlemoyer， L.： Bart：去噪序列到序列预训练，用于自然语言生成、翻译和理解。arXiv 预印本 arXiv：1910.13461 （2019）
[26] Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: CVPR. pp. 510–519 (2019)
Li， X.， Wang， W.，胡， X.， Yang， J.：选择性核网络.在：CVPR。pp. 510–519 （2019年）
[27] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)
Lin， T.Y.， Maire， M.， Belongie， S.， Hays， J.， Perona， P.， Ramanan， D.， Dollár， P.， Zitnick， C.L.： Microsoft coco：上下文中的常见对象。在：ECCV。第740–755页（2014）
[28] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu， Y.， Ott， M.， Goyal， N.， Du， J.， Joshi， M.， Chen， D.， Levy， O.， Lewis， M.， Zettlemoyer， L.， Stoyanov， V.： Roberta：一种稳健优化的 bert 预训练方法。arXiv 预印本 arXiv：1907.11692 （2019）
[29] Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024)
Liu， Y.， Tian， Y.， Zhao， Y.， Yu， H.， Xie， L.， Wang， Y.， Ye， Q.， Liu， Y.： Vmamba：视觉状态空间模型.arXiv 预印本 arXiv：2401.10166 （2024）
[30] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV. pp. 10012–10022 (2021)
Liu， Z.， Lin， Y.， Cao， Y.，胡， H.， Wei， Y.， Zhang， Z.， Lin， S.， Guo， B.：Swin transformer：使用移位窗口的分层视觉转换器。在：ICCV。pp. 10012–10022 （2021年）
[31] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR. pp. 11976–11986 (2022)
Liu， Z.，毛， H.， Wu， C.Y.， Feichtenhofer， C.， Darrell， T.， Xie， S.： 2020 年代的卷积网. 在： CVPR.pp. 11976–11986 （2022）
[32] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov， I.， Hutter， F.： Sgdr：随机梯度下降，热重启。arXiv 预印本 arXiv：1608.03983 （2016）
[33] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Loshchilov， I.， Hutter， F.：解耦权重衰减正则化。arXiv 预印本 arXiv：1711.05101 （2017）
[34] Memory, L.S.T.: Long short-term memory. Neural Computation 9(8), 1735–1780 (2010)
记忆，LST：长短期记忆。神经计算 9（8），1735–1780 （2010）
[35] Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K.K., et al.: Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023)
Peng， B.， Alcaide， E.， Anthony， Q.， Albalak， A.， Arcadinho， S.， Cao， H.， Cheng， X.， Chung， M.， Grella， M.， GV， K.K.， et al.： Rwkv：为变压器时代重塑 rnn。arXiv 预印本 arXiv：2305.13048 （2023）
[36] Qin, Z., Han, X., Sun, W., He, B., Li, D., Li, D., Dai, Y., Kong, L., Zhong, Y.: Toeplitz neural network for sequence modeling. arXiv preprint arXiv:2305.04749 (2023)
Qin， Z.， Han， X.， Sun， W.， He， B.， Li， D.， Li， D.， Dai， Y.， Kong， L.， Zhong， Y.：用于序列建模的 Toeplitz 神经网络。arXiv 预印本 arXiv：2305.04749 （2023）
[37] Qin, Z., Yang, S., Zhong, Y.: Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems 36 (2024)
Qin， Z.， Yang， S.， Zhong， Y.：用于序列建模的分层门控递归神经网络。神经信息处理系统进展 36 （2024）
[38] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford， A.， Narasimhan， K.， Salimans， T.， Sutskever， I.：通过生成性预训练提高语言理解（2018 年）
[39] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Radford， A.， Wu， J.， Child， R.， Luan， D.， Amodei， D.， Sutskever， I.， et al.：语言模型是无监督的多任务学习者。OpenAI 博客 1（8）， 9 （2019）
[40] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Raffel， C.， Shazeer， N.， Roberts， A.， Lee， K.， Narang， S.， Matena， M.，周， Y.， Li， W.， Liu， P.J.：使用统一的文本到文本转换器探索迁移学习的极限。arXiv 预印本 arXiv：1910.10683 （2019）
[41] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Simonyan， K.， Zisserman， A.：用于大规模图像识别的非常深的卷积网络。arXiv 预印本 arXiv：1409.1556 （2014）
[42] Smith, J.T., Warrington, A., Linderman, S.W.: Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933 (2022)
Smith， J.T.， Warrington， A.， Linderman， S.W.：用于序列建模的简化状态空间层。arXiv 预印本 arXiv：2208.04933 （2022）
[43] Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022)
Smith， S.， Patwary， M.， Norick， B.， LeGresley， P.， Rajbhandari， S.， Casper， J.， Liu， Z.， Prabhumoye， S.， Zerveas， G.， Korthikanti， V.， et al.：使用 deepspeed 和 megatron 训练威震天-图灵 nlg 530b，一种大规模生成语言模型。arXiv 预印本 arXiv：2201.11990 （2022）
[44] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)
[45] Stickland, A.C., Murray, I.: Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In: ICML. pp. 5986–5995 (2019)
[46] Sun, W., Qin, Z., Deng, H., Wang, J., Zhang, Y., Zhang, K., Barnes, N., Birchfield, S., Kong, L., Zhong, Y.: Vicinity vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
[47] Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., Wei, F.: Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023)
[48] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. pp. 1–9 (2015)
[49] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. pp. 10347–10357 (2021)
[50] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV. pp. 32–42 (2021)
[51] Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: CVPR. pp. 12894–12904 (2021)
[52] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS 30 (2017)
[53] Wang, J., Zhu, W., Wang, P., Yu, X., Liu, L., Omar, M., Hamid, R.: Selective structured state-spaces for long-form video understanding. In: CVPR. pp. 6387–6397 (2023)
[54] Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
[55] Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al.: Internimage: Exploring large-scale vision foundation models with deformable convolutions. In: CVPR. pp. 14408–14419 (2023)
[56] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV. pp. 568–578 (2021)
[57] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvtv2: Improved baselines with pyramid vision transformer. CVMJ pp. 1–10 (2022)
[58] Wu, S., Wu, T., Tan, H., Guo, G.: Pale transformer: A general vision transformer backbone with pale-shaped attention. In: AAAI. vol. 36, pp. 2731–2739 (2022)
[59] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV. pp. 418–434 (2018)
[60] Xiong, Y., Li, Z., Chen, Y., Wang, F., Zhu, X., Luo, J., Wang, W., Lu, T., Li, H., Qiao, Y., et al.: Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. arXiv preprint arXiv:2401.06197 (2024)
[61] Zhang, Y., Sun, Q., Zhou, Y., He, Z., Yin, Z., Wang, K., Sheng, L., Qiao, Y., Shao, J., Liu, Z.: Bamboo: Building mega-scale vision dataset continually with human-machine synergy. arXiv preprint arXiv:2203.07845 (2022)
[62] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR. pp. 633–641 (2017)
[63] Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024)
[64] Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: CVPR. pp. 9308–9316 (2019)

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like ArchitecturesVision-RWKV：具有类似 RWKV 架构的高效且可扩展的视觉感知

Abstract 抽象

Keywords:

关键字：

1 Introduction 1 介绍

2 Related Works 阿拉伯数字 相关作品

2.1 Vision Encoder2.1 Vision 编码器

2.2 Feature Aggregation Mechanism2.2 元特征聚合机制

3 Vision-RWKV 3 愿景-RWKV

3.1 Overall Architecture3.1 整体架构

3.2 Linear Complexity Bidirectional Attention3.2 线性复杂度双向注意力

3.3 Quad-Directional Token Shift3.3 四向令牌偏移

3.4 Scale Up Stability3,4 放大稳定性

3.5 Model Details3,5 型号详细信息