这是用户在 2024-7-21 24:05 为 https://ar5iv.labs.arxiv.org/html/2012.12556?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A Survey on Visual Transformer
Visual Transformer 综述

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao
韩凯、王云鹤、陈汉庭、陈兴浩、郭建元、刘振华、唐夜辉、肖安、徐春静、徐艺兴、杨朝晖、张一曼、陶大成
Kai Han, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Yunhe Wang are with Huawei Noah’s Ark Lab. E-mail: {kai.han, yunhe.wang}@huawei.com. Hanting Chen, Zhenhua Liu, Yehui Tang, and Zhaohui Yang are also with School of EECS, Peking University. Dacheng Tao is with the School of Computer Science, in the Faculty of Engineering, at The University of Sydney, 6 Cleveland St, Darlington, NSW 2008, Australia. E-mail: dacheng.tao@sydney.edu.au. Corresponding to Yunhe Wang and Dacheng Tao. All authors are listed in alphabetical order of last name (except the primary and corresponding authors).
韩凯、陈汉庭、陈兴豪、郭建元、刘振华、唐业辉、肖安、徐春静、徐艺兴、杨朝晖、张一曼和王云鹤在华为诺亚方舟实验室工作。电子邮件:{kai. han,yunhe.wang}@huawei.com。陈汉庭、刘振华、唐业辉和杨朝晖也在北大EECS学院工作。陶大成在悉尼大学计算机科学学院工程学院工作,地址是澳大利亚新南威尔士州达林顿克利夫兰街6号。电子邮件:dacheng.tao@sydney.edu.au。对应王云鹤和陶大成。所有作者均按姓氏字母顺序列出(主要作者和通讯作者除外)。
Abstract 抽象

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
Transformer首先应用于自然语言处理领域,是一种主要基于自我注意力机制的深度神经网络。由于其强大的表示能力,研究人员正在寻找将变压器应用于计算机视觉任务的方法。在各种视觉基准测试中,基于变压器的模型的性能类似于或优于其他类型的网络,如卷积和递归神经网络。鉴于其高性能和对视觉特定感应偏差的需求较少,变压器正受到计算机视觉社区越来越多的关注。在本文中,我们通过对不同任务中的这些视觉变压器模型进行分类并分析其优缺点来回顾这些视觉变压器模型。我们探索的主要类别包括骨干网络、高/中级视觉、低级视觉和视频处理。我们还包括将变压器推向基于真实设备的应用程序的高效变压器方法。此外,我们还简要介绍了计算机视觉中的自我注意力机制,因为它是变压器的基本组件。在本文的最后,我们讨论了视觉变压器面临的挑战并提供了几个进一步的研究方向。

Index Terms:
Transformer, Self-attention, Computer Vision, High-level vision, Low-level vision, Video.
索引术语:Transformer,自注意力机制,计算机视觉,高级视觉,低级视觉,视频。

1 Introduction 1介绍

Deep neural networks (DNNs) have become the fundamental infrastructure in today’s artificial intelligence (AI) systems. Different types of tasks have typically involved different types of networks. For example, multi-layer perceptron (MLP) or the fully connected (FC) network is the classical type of neural network, which is composed of multiple linear layers and nonlinear activations stacked together [1, 2]. Convolutional neural networks (CNNs) introduce convolutional layers and pooling layers for processing shift-invariant data such as images [3, 4]. And recurrent neural networks (RNNs) utilize recurrent cells to process sequential data or time series data [5, 6]. Transformer is a new type of neural network. It mainly utilizes the self-attention mechanism [7, 8] to extract intrinsic features [9] and shows great potential for extensive use in AI applications.
深度神经网络(DNN)已成为当今人工智能(AI)系统中的基础基础设施。不同类型的任务通常涉及不同类型的网络。例如,多层感知器(MLP)或全连接(FC)网络是神经网络的经典类型,它由多个线性层和非线性激活堆叠在一起[1,2]。卷积神经网络(CNN)引入卷积层和池化层,用于处理图像等移位不变数据[3,4]。而递归神经网络(RNN)利用循环细胞来处理顺序数据或时间序列数据[5,6]。Transformer是一种新型神经网络。它主要利用自我注意力机制[7,8]来提取内在特征[9],并显示出在AI应用中广泛使用的巨大潜力。

Refer to caption
Figure 1: Key milestones in the development of transformer. The vision transformer models are marked in red.
图1:变压器开发的关键里程碑。视觉变压器型号用红色标记。

Transformer was first applied to natural language processing (NLP) tasks where it achieved significant improvements [9, 10, 11]. For example, Vaswani et al. [9] first proposed transformer based on attention mechanism for machine translation and English constituency parsing tasks. Devlin et al. [10] introduced a new language representation model called BERT (short for Bidirectional Encoder Representations from Transformers), which pre-trains a transformer on unlabeled text taking into account the context of each word as it is bidirectional. When BERT was published, it obtained state-of-the-art performance on 11 NLP tasks. Brown et al. [11] pre-trained a massive transformer-based model called GPT-3 (short for Generative Pre-trained Transformer 3) on 45 TB of compressed plaintext data using 175 billion parameters. It achieved strong performance on different types of downstream natural language tasks without requiring any fine-tuning. These transformer-based models, with their strong representation capacity, have achieved significant breakthroughs in NLP.
Transformer首先被应用于自然语言处理(自然语言处理)任务,在那里它取得了显著的改进[9,10,11]。例如,Vaswani等人[9]首次提出了基于注意力机制的机器翻译和英语选区解析任务的转换器。Devlin等人[10]引入了一种新的语言表示模型,称为BERT(来自变形金刚的双向编码器表示的缩写),它在未标记的文本上预训练一个变形金刚,同时考虑到每个单词的上下文,因为它是双向的。当BERT发布时,它在11个自然语言处理任务上获得了最先进的性能。Brown等人[11]使用1750亿参数在45 TB的压缩明文数据上预训练了一个名为GPT-3(生成预训练Transformer3的缩写)的基于变形金刚的大规模模型。它在不同类型的下游自然语言任务上实现了强大的性能,而无需任何微调。这些基于转换器的模型凭借其强大的表示能力,在自然语言处理方面取得了重大突破。

Inspired by the major success of transformer architectures in the field of NLP, researchers have recently applied transformer to computer vision (CV) tasks. In vision applications, CNNs are considered the fundamental component [12, 13], but nowadays transformer is showing it is a potential alternative to CNN. Chen et al. [14] trained a sequence transformer to auto-regressively predict pixels, achieving results comparable to CNNs on image classification tasks. Another vision transformer model is ViT, which applies a pure transformer directly to sequences of image patches to classify the full image. Recently proposed by Dosovitskiy et al. [15], it has achieved state-of-the-art performance on multiple image recognition benchmarks. In addition to image classification, transformer has been utilized to address a variety of other vision problems, including object detection [16, 17], semantic segmentation [18], image processing [19], and video understanding [20]. Thanks to its exceptional performance, more and more researchers are proposing transformer-based models for improving a wide range of visual tasks.
受变压器架构在自然语言处理领域取得重大成功的启发,研究人员最近将变压器应用于计算机视觉(CV)任务。在视觉应用中,CNN被认为是基本组件[12,13],但如今变压器显示它是卷积神经网络的潜在替代品。Chen等人[14]训练了一个序列变压器来自动回归预测像素,在图像分类任务上取得了与CNN相当的结果。另一个视觉变压器模型是ViT,它将纯变压器直接应用于图像补丁序列,以对完整图像进行分类。最近由Dosovitskiy等人[15]提出,它在多个图像识别基准上取得了最先进的性能。除了图像分类,变压器还被用于解决各种其他视觉问题,包括目标检测[16,17]、语义分割[18]、图像处理[19]和视频理解[20]。由于其卓越的性能,越来越多的研究人员提出了基于变压器的模型来改进广泛的视觉任务。

Due to the rapid increase in the number of transformer-based vision models, keeping pace with the rate of new progress is becoming increasingly difficult. As such, a survey of the existing works is urgent and would be beneficial for the community. In this paper, we focus on providing a comprehensive overview of the recent advances in vision transformers and discuss the potential directions for further improvement. To facilitate future research on different topics, we categorize the transformer models by their application scenarios, as listed in Table I. The main categories include backbone network, high/mid-level vision, low-level vision, and video processing. High-level vision deals with the interpretation and use of what is seen in the image [21], whereas mid-level vision deals with how this information is organized into what we experience as objects and surfaces [22]. Given the gap between high- and mid-level vision is becoming more obscure in DNN-based vision systems [23, 24], we treat them as a single category here. A few examples of transformer models that address these high/mid-level vision tasks include DETR [16], deformable DETR [17] for object detection, and Max-DeepLab [25] for segmentation. Low-level image processing mainly deals with extracting descriptions from images (such descriptions are usually represented as images themselves) [26]. Typical applications of low-level image processing include super-resolution, image denoising, and style transfer. At present, only a few works [19, 27] in low-level vision use transformers, creating the need for further investigation. Another category is video processing, which is an important part in both computer vision and image-based tasks. Due to the sequential property of video, transformer is inherently well suited for use on video tasks [20, 28], in which it is beginning to perform on par with conventional CNNs and RNNs. Here, we survey the works associated with transformer-based visual models in order to track the progress in this field. Figure 1 shows the development timeline of vision transformer — undoubtedly, there will be many more milestones in the future.
由于基于变压器的视觉模型数量的快速增加,跟上新进展的速度变得越来越困难。因此,对现有作品进行调查是紧迫的,并且将对社区有益。在本文中,我们专注于全面概述视觉变压器的最新进展,并讨论进一步改进的潜在方向。为了便于未来对不同主题的研究,我们按应用场景对变压器模型进行分类,如表I所列。主要类别包括骨干网络、高/中级视觉、低级视觉和视频处理。高级视觉处理图像中看到的内容的解释和使用[21],而中级视觉处理如何将这些信息组织成我们作为对象和表面所体验的内容[22]。鉴于在基于DNN的视觉系统中,高级别和中级视觉之间的差距变得越来越模糊[23,24],我们在这里将它们视为一个单一的类别。解决这些高/中级视觉任务的变压器模型的几个示例包括DETR[16]、用于目标检测的可变形DETR[17]和用于分割的Max-DeepLab[25]。低级图像处理主要处理从图像中提取描述(此类描述通常表示为图像本身)[26]。低级图像处理的典型应用包括超分辨率、图像去噪和样式迁移。目前,低级视觉中只有少数作品[19,27]使用变压器,创造了进一步研究的需要。另一个类别是视频处理,它是计算机视觉和基于图像的任务中的重要组成部分。 由于视频的顺序特性,变压器本质上非常适合用于视频任务[20,28],在这些任务中,它开始与传统的CNN和RNN相提并论。在这里,我们调查了与基于变压器的视觉模型相关的工作,以跟踪该领域的进展。图1显示了视觉变压器的开发时间表——毫无疑问,未来还会有更多的里程碑。

The rest of the paper is organized as follows. Section 2 discusses the formulation of the standard transformer and the self-attention mechanism. Section 4 is the main part of the paper, in which we summarize the vision transformer models on backbone, high/mid-level vision, low-level vision, and video tasks. We also briefly describe efficient transformer methods, as they are closely related to our main topic. In the final section, we give our conclusion and discuss several research directions and challenges. Due to the page limit, we describe the methods of transformer in NLP in the supplemental material, as the research experience may be beneficial for vision tasks. In the supplemental material, we also review the self-attention mechanism for CV as the supplementary of vision transformer models. In this survey, we mainly include the representative works (early, pioneering, novel, or inspiring works) since there are many preprinted works on arXiv and we cannot include them all in limited pages.
论文的其余部分组织如下。第2节讨论了标准变压器的制定和自我注意力机制。第4节是论文的主要部分,其中我们总结了骨干、高/中级视觉、低级视觉和视频任务上的视觉变压器模型。我们还简要描述了高效的变压器方法,因为它们与我们的主要主题密切相关。在最后一节中,我们给出了结论并讨论了几个研究方向和挑战。由于页数限制,我们在补充材料中描述了自然语言处理中的变压器方法,因为研究经验可能对视觉任务有益。在补充材料中,我们还回顾了CV的自我注意力机制作为视觉变压器模型的补充。在本次调查中,我们主要包括代表性作品(早期、开创性、新颖或鼓舞人心的作品),因为arxiv上有许多预印本作品,我们无法将它们全部包含在有限的页面中。

TABLE I: Representative works of vision transformers.
表一:视觉变形金刚代表作品。
  Category  类别 Sub-category 子类别 Method 方法 Highlights 亮点 Publication 出版物
Backbone 骨干 Supervised pretraining 监督预训练 ViT [15] ViT[15] Image patches, standard transformer
图像补丁,标准变压器
ICLR 2021
TNT [29] TNT[29] Transformer in transformer, local attention
变压器Transformer,本地注意
NeurIPS 2021
Swin [30] 斯温[30] Shifted window, window-based self-attention
移位窗口,基于窗口的自我关注
ICCV 2021
Self-supervised pretraining
自我监督预训练
iGPT [14] iGPT[14] Pixel prediction self-supervised learning, GPT model
像素预测自监督学习, GPT模型
ICML 2020
MoCo v3 [31] MoCo v3[31] Contrastive self-supervised learning, ViT
对比自监督学习
ICCV 2021
MAE  [32] MAE[32] Masked image modeling, ViT
蒙版图像建模,ViT
CVPR 2022
High/Mid-level 高/中级 vision 视觉 Object detection 目标检测 DETR [16] DETR[16] Set-based prediction, bipartite matching, transformer
基于集合的预测,二部匹配,变压器
ECCV 2020
Deformable DETR [17] 可变形DETR[17] DETR, deformable attention module
DETR,可变形注意力模块
ICLR 2021
UP-DETR [33] UP-DETR[33] Unsupervised pre-training, random query patch detection
无监督预训练,随机查询补丁检测
CVPR 2021
Segmentation 分割 Max-DeepLab [25] Max-DeepLab[25] PQ-style bipartite matching, dual-path transformer
PQ型二部匹配,双路变压器
CVPR 2021
VisTR [34] VisTR[34] Instance sequence matching and segmentation
示例序列匹配和分割
CVPR 2021
SETR [18] SETR[18] Sequence-to-sequence prediction, standard transformer
Sequence-to-sequence预测,标准变压器
CVPR 2021
Pose Estimation 姿态估计 Hand-Transformer [35] Transformer[35] Non-autoregressive transformer, 3D point set
非自回归变压器,3D点集
ECCV 2020
HOT-Net [36] 热网[36] Structured-reference extractor
Structured-reference提取器
MM 2020
METRO [37] 地铁[37] Progressive dimensionality reduction
渐进降维
CVPR 2021
Low-level 低级 vision 视觉 Image generation 图像生成 Image Transformer [27] 图像Transformer[27] Pixel generation using transformer
使用变压器生成像素
ICML 2018
Taming transformer [38] 驯服变压器[38] VQ-GAN, auto-regressive transformer
GAN自回归变压器
CVPR 2021
TransGAN [39] 跨甘[39] GAN using pure transformer architecture
GAN采用纯变压器架构
NeurIPS 2021
Image enhancement 图像增强 IPT [19] IPT[19] Multi-task, ImageNet pre-training, transformer model
多任务、ImageNet预训练、变压器模型
CVPR 2021
TTSR [40] TTSR[40] Texture transformer, RefSR
纹理变压器,RefSR
CVPR 2020
Video 视频 processing 加工 Video inpainting 视频修复 STTN [28] STTN[28] Spatial-temporal adversarial loss
时空对抗损失
ECCV 2020
Video captioning 视频字幕 Masked Transformer [20] 蒙面Transformer[20] Masking network, event proposal
掩蔽网络、活动提案
CVPR 2018
Multimodality 多模态 Classification 分类 CLIP [41] 剪辑[41] NLP supervision for images, zero-shot transfer
图像的自然语言处理监督,零镜头传输
arXiv 2021
Image generation 图像生成 DALL-E [42] DALL-E[42] Zero-shot text-to image generation
零镜头文本到图像生成
ICML 2021
Cogview [43] Cogview[43] VQ-VAE, Chinese input VQ-VAE,中文输入 NeurIPS 2021
Multi-task 多任务 GPT-4 [44] GPT-4[44] Large Multi-modal model for NLP & CV tasks
用于自然语言处理和CV任务的大型多模态模型
arXiv 2023
Efficient 高效 transformer 变压器 Decomposition 分解 ASH [45] 灰烬[45] Number of heads, importance estimation
人头数,重要性估计
NeurIPS 2019
Distillation 蒸馏 TinyBert [46] TinyBert[46] Various losses for different modules
不同模块的各种损耗
EMNLP Findings 2020 2020年EMNLP调查结果
Quantization 量化 FullyQT [47] 完全QT[47] Fully quantized transformer
全量子化变压器
EMNLP Findings 2020 2020年EMNLP调查结果
Architecture design 架构设计 ConvBert [48] ConvBert[48] Local dependence, dynamic convolution
局部依赖,动态卷积
NeurIPS 2020

2 Formulation of Transformer
2Transformer的制定

Transformer [9] was first used in the field of natural language processing (NLP) on machine translation tasks. As shown in Figure 2, it consists of an encoder and a decoder with several transformer blocks of the same architecture. The encoder generates encodings of inputs, while the decoder takes all the encodings and using their incorporated contextual information to generate the output sequence. Each transformer block is composed of a multi-head attention layer, a feed-forward neural network, shortcut connection and layer normalization. In the following, we describe each component of the transformer in detail.
Transformer[9]最早用于机器翻译任务上的自然语言处理(自然语言处理)领域。如图2所示,它由一个编码器和一个解码器组成,其中有几个相同架构的变压器块。编码器生成输入的编码,而解码器则获取所有的编码并使用它们合并的上下文信息来生成输出序列。每个变压器块由多头注意力层、前馈神经网络、快捷连接和层归一化组成。下面,我们详细描述变压器的每个组件。

Refer to caption
Figure 2: Structure of the original transformer (image from [9]).
图2:原始变压器的结构(图片来自[9])。

2.1 Self-Attention 2.1自注意力

In the self-attention layer, the input vector is first transformed into three different vectors: the query vector 𝐪𝐪\mathbf{q}, the key vector 𝐤𝐤\mathbf{k} and the value vector 𝐯𝐯\mathbf{v} with dimension dq=dk=dv=dmodel=512subscript𝑑𝑞subscript𝑑𝑘subscript𝑑𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙512d_{q}=d_{k}=d_{v}=d_{model}=512. Vectors derived from different inputs are then packed together into three different matrices, namely, 𝐐𝐐\mathbf{Q}, 𝐊𝐊\mathbf{K} and 𝐕𝐕\mathbf{V}. Subsequently, the attention function between different input vectors is calculated as follows (and shown in Figure 3 left):
在自注意力层中,首先将输入向量转换为三个不同的向量:查询向量 𝐪𝐪\mathbf{q} 、键向量 𝐤𝐤\mathbf{k} 和维数 dq=dk=dv=dmodel=512subscript𝑑𝑞subscript𝑑𝑘subscript𝑑𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙512d_{q}=d_{k}=d_{v}=d_{model}=512 的值向量 𝐯𝐯\mathbf{v} 。然后将来自不同输入的向量一起打包成三个不同的矩阵,即 𝐐𝐐\mathbf{Q}𝐊𝐊\mathbf{K}𝐕𝐕\mathbf{V} 。随后,不同输入向量之间的注意力函数计算如下(如左图3所示):

  • Step 1: Compute scores between different input vectors with 𝐒=𝐐𝐊𝐒𝐐superscript𝐊top\mathbf{S}=\mathbf{Q}\cdot\mathbf{K}^{\top};


    •第1步:使用 𝐒=𝐐𝐊𝐒𝐐superscript𝐊top\mathbf{S}=\mathbf{Q}\cdot\mathbf{K}^{\top} 计算不同输入向量之间的分数;
  • Step 2: Normalize the scores for the stability of gradient with 𝐒n=𝐒/dksubscript𝐒𝑛𝐒subscript𝑑𝑘\mathbf{S}_{n}=\mathbf{S}/{\sqrt{d_{k}}};


    第2步:使用 𝐒n=𝐒/dksubscript𝐒𝑛𝐒subscript𝑑𝑘\mathbf{S}_{n}=\mathbf{S}/{\sqrt{d_{k}}} 标准化梯度稳定性的分数;
  • Step 3: Translate the scores into probabilities with softmax function 𝐏=softmax(𝐒n)𝐏softmaxsubscript𝐒𝑛\mathbf{P}=\mathrm{softmax}(\mathbf{S}_{n});


    第3步:使用softmax函数 𝐏=softmax(𝐒n)𝐏softmaxsubscript𝐒𝑛\mathbf{P}=\mathrm{softmax}(\mathbf{S}_{n}) 将分数转换为概率;
  • Step 4: Obtain the weighted value matrix with 𝐙=𝐕𝐏𝐙𝐕𝐏\mathbf{Z}=\mathbf{V}\cdot\mathbf{P}.


    •第4步:用 𝐙=𝐕𝐏𝐙𝐕𝐏\mathbf{Z}=\mathbf{V}\cdot\mathbf{P} 获取加权值矩阵。

The process can be unified into a single function:
该过程可以统一为一个功能:

Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊dk)𝐕.Attention𝐐𝐊𝐕softmax𝐐superscript𝐊topsubscript𝑑𝑘𝐕\mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}(\frac{\mathbf{Q}\cdot\mathbf{K}^{\top}}{\sqrt{d_{k}}})\cdot\mathbf{V}. (1)

The logic behind Eq. 1 is simple. Step 1 computes scores between each pair of different vectors, and these scores determine the degree of attention that we give other words when encoding the word at the current position. Step 2 normalizes the scores to enhance gradient stability for improved training, and step 3 translates the scores into probabilities. Finally, each value vector is multiplied by the sum of the probabilities. Vectors with larger probabilities receive additional focus from the following layers.
公式1背后的逻辑很简单。步骤1计算每对不同向量之间的分数,这些分数决定了我们在当前位置编码单词时给予其他单词的关注程度。步骤2将分数归一化以增强梯度稳定性以进行改进训练,步骤3将分数转换为概率。最后,每个值向量乘以概率之和。概率较大的向量从以下层获得额外的焦点。

The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: The key matrix K𝐾K and value matrix V𝑉V are derived from the encoder module, and the query matrix Q𝑄Q is derived from the previous layer.
解码器模块中的编码器-解码器关注层类似于编码器模块中的自关注层,但有以下例外:键矩阵 K𝐾K 和值矩阵 V𝑉V 源自编码器模块,查询矩阵 Q𝑄Q 源自前一层。

Note that the preceding process is invariant to the position of each word, meaning that the self-attention layer lacks the ability to capture the positional information of words in a sentence. However, the sequential nature of sentences in a language requires us to incorporate the positional information within our encoding. To address this issue and allow the final input vector of the word to be obtained, a positional encoding with dimension dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model} is added to the original input embedding. Specifically, the position is encoded with the following equations:
请注意,前面的过程对每个单词的位置是不变的,这意味着自我注意层缺乏捕捉单词在句子中的位置信息的能力。然而,语言中句子的顺序性质要求我们将位置信息合并到我们的编码中。为了解决这个问题并允许获得单词的最终输入向量,在原始输入嵌入中添加了一个维度为 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model} 的位置编码。具体来说,位置用以下方程式编码:

PE(pos,2i)=sin(pos100002idmodel);𝑃𝐸𝑝𝑜𝑠2𝑖𝑠𝑖𝑛𝑝𝑜𝑠superscript100002𝑖subscript𝑑𝑚𝑜𝑑𝑒𝑙\displaystyle PE(pos,2i)=sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}}); (2)
PE(pos,2i+1)=cos(pos100002idmodel),𝑃𝐸𝑝𝑜𝑠2𝑖1𝑐𝑜𝑠𝑝𝑜𝑠superscript100002𝑖subscript𝑑𝑚𝑜𝑑𝑒𝑙\displaystyle PE(pos,2i+1)=cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}}), (3)

in which pos𝑝𝑜𝑠pos denotes the position of the word in a sentence, and i𝑖i represents the current dimension of the positional encoding. In this way, each element of the positional encoding corresponds to a sinusoid, and it allows the transformer model to learn to attend by relative positions and extrapolate to longer sequence lengths during inference. In apart from the fixed positional encoding in the vanilla transformer, learned positional encoding [49] and relative positional encoding [50] are also utilized in various models [10, 15].
其中 pos𝑝𝑜𝑠pos 表示单词在句子中的位置, i𝑖i 表示位置编码的当前维度。这样,位置编码的每个元素对应一个正弦曲线,它允许变压器模型在推理过程中学习通过相对位置参与并外推到更长的序列长度。除了香草变压器中的固定位置编码之外,学习的位置编码[49]和相对位置编码[50]也用于各种模型[10,15]。

Refer to caption
Figure 3: (Left) Self-attention process. (Right) Multi-head attention. The image is from [9].
图3:(左)自注意力机制过程。(右)多头注意力。图像来自[9]。

Multi-Head Attention. Multi-head attention is a mechanism that can be used to boost the performance of the vanilla self-attention layer. Note that for a given reference word, we often want to focus on several other words when going through the sentence. A single-head self-attention layer limits our ability to focus on one or more specific positions without influencing the attention on other equally important positions at the same time. This is achieved by giving attention layers different representation subspace. Specifically, different query, key and value matrices are used for different heads, and these matrices can project the input vectors into different representation subspace after training due to random initialization.
多头注意力。多头注意力是一种机制,可用于提升香草自我注意力层的性能。请注意,对于一个给定的参考词,我们在浏览句子时往往希望将注意力集中在其他几个词上。单头自我注意力层限制了我们将注意力集中在一个或多个特定位置而不同时影响其他同等重要位置上的注意力的能力。这是通过给注意力层不同的表示子空间来实现的。具体来说,不同的头使用不同的查询、键和值矩阵,这些矩阵可以在训练后由于随机初始化而将输入向量投射到不同的表示子空间中。

To elaborate on this in greater detail, given an input vector and the number of heads hh, the input vector is first transformed into three different groups of vectors: the query group, the key group and the value group. In each group, there are hh vectors with dimension dq=dk=dv=dmodel/h=64subscript𝑑superscript𝑞subscript𝑑superscript𝑘subscript𝑑superscript𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙64d_{q^{\prime}}=d_{k^{\prime}}=d_{v^{\prime}}=d_{model}/h=64. The vectors derived from different inputs are then packed together into three different groups of matrices: {𝐐i}i=1hsuperscriptsubscriptsubscript𝐐𝑖𝑖1\{\mathbf{Q}_{i}\}_{i=1}^{h}, {𝐊i}i=1hsuperscriptsubscriptsubscript𝐊𝑖𝑖1\{\mathbf{K}_{i}\}_{i=1}^{h} and {𝐕i}i=1hsuperscriptsubscriptsubscript𝐕𝑖𝑖1\{\mathbf{V}_{i}\}_{i=1}^{h}. The multi-head attention process is shown as follows:
为了更详细地说明这一点,给定一个输入向量和头部 hh 的数量,首先将输入向量转换为三组不同的向量:查询组、键组和值组。在每组中,都有维度 dq=dk=dv=dmodel/h=64subscript𝑑superscript𝑞subscript𝑑superscript𝑘subscript𝑑superscript𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙64d_{q^{\prime}}=d_{k^{\prime}}=d_{v^{\prime}}=d_{model}/h=64hh{𝐊i}i=1hsuperscriptsubscriptsubscript𝐊𝑖𝑖1\{\mathbf{K}_{i}\}_{i=1}^{h}{𝐕i}i=1hsuperscriptsubscriptsubscript𝐕𝑖𝑖1\{\mathbf{V}_{i}\}_{i=1}^{h} 。多头注意过程如下所示:

MultiHead(𝐐,𝐊,𝐕)MultiHeadsuperscript𝐐superscript𝐊superscript𝐕\displaystyle\mathrm{MultiHead}(\mathbf{Q}^{\prime},\mathbf{K}^{\prime},\mathbf{V}^{\prime}) =Concat(head1,,headh)𝐖o,absentConcatsubscripthead1subscriptheadsuperscript𝐖𝑜\displaystyle=\mathrm{Concat}(\mathrm{head}_{1},\cdots,\mathrm{head}_{h})\mathbf{W}^{o},
where headiwhere subscripthead𝑖\displaystyle\text{where }\mathrm{head}_{i} =Attention(𝐐i,𝐊i,𝐕i).absentAttentionsubscript𝐐𝑖subscript𝐊𝑖subscript𝐕𝑖\displaystyle=\mathrm{Attention}(\mathbf{Q}_{i},\mathbf{K}_{i},\mathbf{V}_{i}). (4)

Here, 𝐐superscript𝐐\mathbf{Q}^{\prime} (and similarly 𝐊superscript𝐊\mathbf{K}^{\prime} and 𝐕superscript𝐕\mathbf{V}^{\prime}) is the concatenation of {𝐐i}i=1hsuperscriptsubscriptsubscript𝐐𝑖𝑖1\{\mathbf{Q}_{i}\}_{i=1}^{h}, and 𝐖odmodel×dmodelsuperscript𝐖𝑜superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙subscript𝑑𝑚𝑜𝑑𝑒𝑙\mathbf{W}^{o}\in\mathbb{R}^{d_{model}\times d_{model}} is the projection weight.
这里, 𝐐superscript𝐐\mathbf{Q}^{\prime} (类似地 𝐊superscript𝐊\mathbf{K}^{\prime}𝐕superscript𝐕\mathbf{V}^{\prime} )是 {𝐐i}i=1hsuperscriptsubscriptsubscript𝐐𝑖𝑖1\{\mathbf{Q}_{i}\}_{i=1}^{h} 的级联, 𝐖odmodel×dmodelsuperscript𝐖𝑜superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙subscript𝑑𝑚𝑜𝑑𝑒𝑙\mathbf{W}^{o}\in\mathbb{R}^{d_{model}\times d_{model}} 是投影权重。

2.2 Other Key Concepts in Transformer
2.2Transformer的其他关键概念

Feed-Forward Network. A feed-forward network (FFN) is applied after the self-attention layers in each encoder and decoder. It consists of two linear transformation layers and a nonlinear activation function within them, and can be denoted as the following function:
前向网络。在每个编码器和解码器的自注意层之后应用前馈网络(FFN)。它由两个线性变换层和其中的非线性激活函数组成,可以表示为以下函数:

FFN(𝐗)=𝐖2σ(𝐖1𝐗),FFN𝐗subscript𝐖2𝜎subscript𝐖1𝐗\mathrm{FFN}(\mathbf{X})=\mathbf{W}_{2}\sigma(\mathbf{W}_{1}\mathbf{X}), (5)

where 𝐖1subscript𝐖1\mathbf{W}_{1} and 𝐖2subscript𝐖2\mathbf{W}_{2} are the two parameter matrices of the two linear transformation layers, and σ𝜎\sigma represents the nonlinear activation function, such as GELU [51]. The dimensionality of the hidden layer is dh=2048subscript𝑑2048d_{h}=2048.
其中 𝐖1subscript𝐖1\mathbf{W}_{1}𝐖2subscript𝐖2\mathbf{W}_{2} 是两个线性变换层的两个参数矩阵, σ𝜎\sigma 表示非线性激活函数,如GELU[51]。隐藏层的维数为 dh=2048subscript𝑑2048d_{h}=2048

Residual Connection in the Encoder and Decoder. As shown in Figure 2, a residual connection is added to each sub-layer in the encoder and decoder. This strengthens the flow of information in order to achieve higher performance. A layer-normalization [52] is followed after the residual connection. The output of these operations can be described as:
编码器和解码器中的残差连接如图2所示,编码器和解码器中的每个子层都添加了残差连接。这加强了信息流,以便实现更高的性能。残差连接之后接着是层归一化[52]。这些操作的输出可以描述为:

LayerNorm(𝐗+Attention(𝐗)).LayerNorm𝐗Attention𝐗\mathrm{LayerNorm}(\mathbf{X}+\mathrm{Attention}(\mathbf{X})). (6)

Here, 𝐗𝐗\mathbf{X} is used as the input of self-attention layer, and the query, key and value matrices 𝐐,𝐊𝐐𝐊\mathbf{Q},\mathbf{K} and 𝐕𝐕\mathbf{V} are all derived from the same input matrix 𝐗𝐗\mathbf{X}. A variant pre-layer normalization (Pre-LN) is also widely-used [53, 54, 15]. Pre-LN inserts the layer normalization inside the residual connection and before multi-head attention or FFN. For the normalization layer, there are several alternatives such as batch normalization [55]. Batch normalization usually perform worse when applied on transformer as the feature values change acutely [56]. Some other normalization algorithms [57, 56, 58] have been proposed to improve training of transformer.
这里, 𝐗𝐗\mathbf{X} 作为自关注层的输入,查询、键和值矩阵 𝐐,𝐊𝐐𝐊\mathbf{Q},\mathbf{K}𝐕𝐕\mathbf{V} 都来源于同一个输入矩阵 𝐗𝐗\mathbf{X} 。变体预层归一化(Pre-LN)也被广泛使用[53,54,15]。Pre-LN在残差连接内部和多头关注或FFN之前插入层归一化。对于归一化层,有几种替代方案,例如批量归一化[55]。批量归一化在应用于变压器时通常表现更差,因为特征值发生剧烈变化[56]。已经提出了一些其他归一化算法[57,56,58]来改进变压器的训练。

Final Layer in the Decoder. The final layer in the decoder is used to turn the stack of vectors back into a word. This is achieved by a linear layer followed by a softmax layer. The linear layer projects the vector into a logits vector with dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} dimensions, in which dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} is the number of words in the vocabulary. The softmax layer is then used to transform the logits vector into probabilities.
解码器中的最终层。解码器中的最后一层用于将向量堆栈转回单词。这是通过一个线性层和一个softmax层来实现的。线性层将向量投射到具有 dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} 维度的logits向量中,其中 dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} 是词汇表中的单词数。然后使用softmax层将logits向量转换为概率。

When used for CV tasks, most transformers adopt the original transformer’s encoder module. Such transformers can be treated as a new type of feature extractor. Compared with CNNs which focus only on local characteristics, transformer can capture long-distance characteristics, meaning that it can easily derive global information. And in contrast to RNNs, whose hidden state must be computed sequentially, transformer is more efficient because the output of the self-attention layer and the fully connected layers can be computed in parallel and easily accelerated. From this, we can conclude that further study into using transformer in computer vision as well as NLP would yield beneficial results.
当用于CV任务时,大多数变压器采用原始变压器的编码器模块。这种变压器可以被视为一种新型的特征提取器。与只关注局部特征的CNN相比,变压器可以捕获远距离特征,这意味着它可以很容易地导出全局信息。与必须按顺序计算隐藏状态的RNN相比,变压器效率更高,因为自注意层和全连接层的输出可以并行计算,并且很容易加速。由此,我们可以得出结论,进一步研究在计算机视觉和自然语言处理中使用变压器将产生有益的结果。

3 Vision Transformer 3视觉Transformer

In this section, we review the applications of transformer-based models in computer vision, including image classification, high/mid-level vision, low-level vision and video processing. We also briefly summarize the applications of the self-attention mechanism and model compression methods for efficient transformer.
在本节中,我们回顾了基于变压器的模型在计算机视觉中的应用,包括图像分类、高/中级视觉、低级视觉和视频处理。我们还简要总结了高效变压器的自注意力机制和模型压缩方法的应用。

Refer to caption
Figure 4: A taxonomy of backbone using convolution and attention.
图4:使用卷积和注意力的骨干分类。

3.1 Backbone for Representation Learning
3.1表示学习的主干

Inspired by the success that transformer has achieved in the field of NLP, some researchers have explored whether similar models can learn useful representations for images. Given that images involve more dimensions, noise and redundant modality compared to text, they are believed to be more difficult for generative modeling.
受转换器在自然语言处理领域取得的成功的启发,一些研究人员探索了类似模型是否可以学习对图像有用的表示。鉴于图像与文本相比涉及更多的维度、噪声和冗余模态,它们被认为更难以进行生成建模。

Other than CNNs, the transformer can be used as backbone networks for image classification. Wu et al. [59] adopted ResNet as a convenient baseline and used vision transformers to replace the last stage of convolutions. Specifically, they apply convolutional layers to extract low-level features that are then fed into the vision transformer. For the vision transformer, they use a tokenizer to group pixels into a small number of visual tokens, each representing a semantic concept in the image. These visual tokens are used directly for image classification, with the transformers being used to model the relationships between tokens. As shown in Figure 4, the works can be divided into purely using transformer for vision and combining CNN and transformer. We summarize the results of these models in Table II and Figure 6 to demonstrate the development of the backbones. In addition to supervised learning, self-supervised learning is also explored in vision transformer.
除了CNN之外,变压器还可以作为图像分类的骨干网络。Wu等人[59]采用ResNet作为方便的基线,并使用视觉变压器来代替卷积的最后一阶段。具体来说,他们应用卷积层来提取低级特征,然后将这些特征馈送到视觉变压器中。对于视觉变压器,他们使用标记器将像素分组为少量的视觉标记,每个标记代表图像中的一个语义概念。这些视觉标记直接用于图像分类,变压器用于对标记之间的关系进行建模。如图4所示,作品可以分为纯粹使用变压器进行视觉和结合卷积神经网络和变压器。我们在表II和图6中总结了这些模型的结果,以演示骨干的发展。除了监督学习,视觉变压器中还探索了自监督学习。

3.1.1 Pure Transformer 3.1.1纯Transformer

ViT. Vision Transformer (ViT) [15] is a pure transformer directly applies to the sequences of image patches for image classification task. It follows transformer’s original design as much as possible. Figure 5 shows the framework of ViT.
ViT. VisionTransformer(ViT)[15]是一个纯转换器,直接应用于图像块序列进行图像分类任务。它尽可能地遵循变压器的原始设计。图5显示了ViT的框架。

To handle 2D images, the image Xh×w×c𝑋superscript𝑤𝑐X\in\mathbb{R}^{h\times w\times c} is reshaped into a sequence of flattened 2D patches Xpn×(p2c)subscript𝑋𝑝superscript𝑛superscript𝑝2𝑐X_{p}\in\mathbb{R}^{n\times(p^{2}\cdot c)} such that c𝑐c is the number of channels. (h,w)𝑤(h,w) is the resolution of the original image, while (p,p)𝑝𝑝(p,p) is the resolution of each image patch. The effective sequence length for the transformer is therefore n=hw/p2𝑛𝑤superscript𝑝2n=hw/p^{2}. Because the transformer uses constant widths in all of its layers, a trainable linear projection maps each vectorized path to the model dimension d𝑑d, the output of which is referred to as patch embeddings.
为了处理2D图像,图像 Xh×w×c𝑋superscript𝑤𝑐X\in\mathbb{R}^{h\times w\times c} 被重塑为一系列扁平的2D块 Xpn×(p2c)subscript𝑋𝑝superscript𝑛superscript𝑝2𝑐X_{p}\in\mathbb{R}^{n\times(p^{2}\cdot c)} ,使得 c𝑐c 是通道数。 (h,w)𝑤(h,w) 是原始图像的分辨率,而 (p,p)𝑝𝑝(p,p) 是每个图像块的分辨率。因此,转换器的有效序列长度是 n=hw/p2𝑛𝑤superscript𝑝2n=hw/p^{2} 。因为转换器在其所有层中使用恒定的宽度,可训练的线性投影将每个矢量化路径映射到模型维度 d𝑑d ,其输出称为补丁嵌入。

Similar to BERT’s [class]delimited-[]𝑐𝑙𝑎𝑠𝑠[class] token, a learnable embedding is applied to the sequence of embedding patches. The state of this embedding serves as the image representation. During both pre-training and fine-tuning stage, the classification heads are attached to the same size. In addition, 1D position embeddings are added to the patch embeddings in order to retain positional information. It is worth noting that ViT utilizes only the standard transformer’s encoder (except for the place for the layer normalization), whose output precedes an MLP head. In most cases, ViT is pre-trained on large datasets, and then fine-tuned for downstream tasks with smaller data.
类似于BERT的 [class]delimited-[]𝑐𝑙𝑎𝑠𝑠[class] 令牌,一个可学习的嵌入被应用于嵌入补丁的序列。这种嵌入的状态作为图像表示。在预训练和微调阶段,分类头都被附加到相同的大小。此外,1D位置嵌入被添加到补丁嵌入中,以保留位置信息。值得注意的是,ViT仅利用标准转换器的编码器(除了层归一化的地方),其输出先于MLP头。在大多数情况下,ViT在大型数据集上进行预训练,然后针对数据较小的下游任务进行微调。

ViT yields modest results when trained on mid-sized datasets such as ImageNet, achieving accuracies of a few percentage points below ResNets of comparable size. Because transformers lack some inductive biases inherent to CNNs–such as translation equivariance and locality–they do not generalize well when trained on insufficient amounts of data. However, the authors found that training the models on large datasets (14 million to 300 million images) surpassed inductive bias. When pre-trained at sufficient scale, transformers achieve excellent results on tasks with fewer datapoints. For example, when pre-trained on the JFT-300M dataset, ViT approached or even exceeded state of the art performance on multiple image recognition benchmarks. Specifically, it reached an accuracy of 88.36% on ImageNet, and 77.16% on the VTAB suite of 19 tasks.
当在ImageNet等中型数据集上训练时,ViT产生的结果适中,其准确度比同等大小的ResNets低几个百分点。由于变压器缺乏CNN固有的一些归纳偏差——例如翻译等方差和局部性——当在数据量不足的情况下训练时,它们不能很好地泛化。然而,作者发现,在大型数据集(1400万3亿图像)上训练模型超过了归纳偏差。当在足够规模上进行预训练时,变压器在数据点较少的任务上取得了出色的结果。例如,当在JFT-300M数据集上进行预训练时,ViT在多个图像识别基准上接近甚至超过了最先进的性能。具体而言,它在ImageNet上达到了88.36%的准确率,在VTAB套件(包含19个任务)上达到了77.16%的准确率。

Touvron et al. [60] proposed a competitive convolution-free transformer, called Data-efficient image transformer (DeiT), by training on only the ImageNet database. DeiT-B, the reference vision transformer, has the same architecture as ViT-B and employs 86 million parameters. With a strong data augmentation, DeiT-B achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. In addition, the authors observe that using a CNN teacher gives better performance than using a transformer. Specifically, DeiT-B can achieve top-1 accuracy 84.40% with the help of a token-based distillation.
Touvron等人[60]通过仅在ImageNet数据库上进行训练,提出了一种竞争性的无卷积转换器,称为数据高效图像转换器(DeiT)。参考视觉转换器DeiT-B与ViT-B具有相同的架构,并采用8600万参数。借助强大的数据增强,DeiT-B在没有外部数据的情况下在ImageNet上实现了83.1%(单作物评估)的顶级1准确率。此外,作者观察到使用卷积神经网络教师比使用转换器提供了更好的性能。具体来说,DeiT-B可以在基于令牌的蒸馏的帮助下实现84.40%的顶级1准确率。

Refer to caption
Figure 5: The framework of ViT (image from [15]).
图5:ViT的框架(图片来自[15])。

Variants of ViT. Following the paradigm of ViT, a series of variants of ViT have been proposed to improve the performance on vision tasks. The main approaches include enhancing locality, self-attention improvement and architecture design.
ViT的变体。遵循ViT的范式,提出了一系列ViT变体来提高视觉任务的性能。主要方法包括增强局部性、自我注意力改进和架构设计。

The original vision transformer is good at capturing long-range dependencies between patches, but disregard the local feature extraction as the 2D patch is projected to a vector with simple linear layer. Recently, the researchers begin to pay attention to improve the modeling capacity for local information [29, 61, 62]. TNT [29] further divides the patch into a number of sub-patches and introduces a novel transformer-in-transformer architecture which utilizes an inner transformer block to model the relationship between sub-patches and an outer transformer block for patch-level information exchange. Twins [63] and CAT [64] alternately perform local and global attention layer-by-layer. Swin Transformers [61, 65] performs local attention within a window and introduces a shifted window partitioning approach for cross-window connections. Shuffle Transformer [66, 67] further utilizes the spatial shuffle operation instead of shifted window partitioning to allow cross-window connections. RegionViT [62] generates regional tokens and local tokens from an image, and local tokens receive global information via attention with regional tokens. In addition to the local attention, some other works propose to boost local information through local feature aggregation, e.g., T2T [68]. These works demonstrate the benefit of the local information exchange and global information exchange in vision transformer.
最初的视觉转换器擅长捕捉补丁之间的长期依赖关系,但由于2D补丁被投影到具有简单线性层的向量上,因此忽略了局部特征提取。最近,研究人员开始关注提高对局部信息的建模能力[29,61,62]。TNT[29]进一步将补丁划分为多个子补丁,并引入了一种新颖的transformer-in-transformer架构,该架构利用内部转换器块对子补丁和外部转换器块之间的关系进行建模,以进行补丁级信息交换。Twins[63]和CAT[64]交替逐层执行局部和全局注意力。Swin变压器[61,65]在窗口内执行局部注意力,并为跨窗口连接引入了移位窗口分区方法。洗牌Transformer[66,67]进一步利用空间洗牌操作而不是移位窗口分区来允许跨窗口连接。RegionViT[62]从图像中生成区域标记和本地标记,本地标记通过区域标记的注意力接收全局信息。除了局部注意力之外,一些其他工作建议通过局部特征聚合来增强局部信息,例如T2T[68]。这些工作证明了视觉转换器中局部信息交换和全局信息交换的好处。

As a key component of transformer, self-attention layer provides the ability for global interaction between image patches. Improving the calculation of self-attention layer has attracted many researchers. DeepViT [69] proposes to establish cross-head communication to re-generate the attention maps to increase the diversity at different layers. KVT [70] introduces the k𝑘k-NN attention to utilize locality of images patches and ignore noisy tokens by only computing attentions with top-k𝑘k similar tokens. Refiner [71] explores attention expansion in higher-dimension space and applied convolution to augment local patterns of the attention maps. XCiT [72] performs self-attention calculation across feature channels rather than tokens, which allows efficient processing of high-resolution images. The computation complexity and attention precision of the self-attention mechanism are two key-points for future optimization.
自注意力层作为变压器的关键组成部分,提供了图像块之间全局交互的能力。改进自注意力层的计算吸引了许多研究人员。DeepViT[69]提出建立跨头通信来重新生成注意力图,以增加不同层的多样性。KVT[70]引入了 k𝑘k -NN注意力,以利用图像块的局部性,并通过仅计算具有top- k𝑘k 相似标记的注意力来忽略嘈杂的标记。Refiner[71]探索了高维空间中的注意力扩展,并应用卷积来增强注意力图的局部模式。XCiT[72]跨特征通道而不是标记执行自注意力计算,这允许高效处理高分辨率图像。自注意力机制的计算复杂性和注意力精度是未来优化的两个关键点。

The network architecture is an important factor as demonstrated in the field of CNNs. The original architecture of ViT is a simple stack of the same-shape transformer block. New architecture design for vision transformer has been an interesting topic. The pyramid-like architecture is utilized by many vision transformer models [73, 61, 74, 75, 76, 77] including PVT [73], HVT [78], Swin Transformer [61] and PiT [79]. There are also other types of architectures, such as two-stream architecture [80] and U-net architecture [81, 30]. Neural architecture search (NAS) has also been investigated to search for better transformer architectures, e.g., Scaling-ViT [82], ViTAS [83], AutoFormer [84] and GLiT [85]. Currently, both network design and NAS for vision transformer mainly draw on the experience of CNN. In the future, we expect the specific and novel architectures appear in the filed of vision transformer.
网络架构是CNN领域所展示的一个重要因素。ViT的原始架构是相同形状的变压器块的简单堆叠。视觉变压器的新架构设计一直是一个有趣的话题。金字塔状架构被许多视觉变压器模型[73,61,74,75,76,77]使用,包括PVT[73]、HVT[78]、SwinTransformer[61]和PiT[79]。还有其他类型的架构,如双流架构[80]和U-net架构[81,30]。神经架构搜索(NAS)也被研究以搜索更好的变压器架构,例如,缩放-ViT[82]、ViTAS[83]、AutoForm[84]和GLiT[85]。目前,视觉变压器的网络设计和NAS主要借鉴卷积神经网络的经验。未来,我们期待视觉转换器领域出现特定的、新颖的架构。

In addition to the aforementioned approaches, there are some other directions to further improve vision transformer, e.g., positional encoding [86, 87], normalization strategy [88], shortcut connection [89] and removing attention [90, 91, 92, 93].
除了上述方法之外,还有一些其他方向可以进一步改进视觉转换器,例如,位置编码[86,87]、归一化策略[88]、快捷连接[89]和消除注意力[90,91,92,93]。

TABLE II: ImageNet result comparison of representative CNN and vision transformer models. Pure transformer means only using a few convolutions in the stem stage. CNN + Transformer means using convolutions in the intermediate layers. Following [60, 61], the throughput is measured on NVIDIA V100 GPU and Pytorch, with 224×\times224 input size.
表二:代表性卷积神经网络和视觉变压器模型的ImageNet结果比较。纯变压器意味着在茎阶段仅使用少数卷积。卷积神经网络+Transformer意味着在中间层使用卷积。继[60,61]之后,在NVIDIA V100 GPU和Pytorch上测量吞吐量,输入大小为224 ×\times 224。
  Model Params 参数 FLOPs 每秒浮点运算次数 Throughput 吞吐量 Top-1 前1名
(M) (m) (B) (b) (image/s) (图片/秒) (%)
CNN
ResNet-50 [12, 68] ResNet-50[12,68] 25.6 4.1 1226 79.1
ResNet-101 [12, 68] ResNet-101[12,68] 44.7 7.9 753 79.9
ResNet-152 [12, 68] ResNet-152[12,68] 60.2 11.5 526 80.8
EfficientNet-B0 [94] 效率网络-B0[94] 5.3 0.39 2694 77.1
EfficientNet-B1 [94] 效率网络-B1[94] 7.8 0.70 1662 79.1
EfficientNet-B2 [94] 效率网络-B2[94] 9.2 1.0 1255 80.1
EfficientNet-B3 [94] 效率网络-B3[94] 12 1.8 732 81.6
EfficientNet-B4 [94] 效率网络-B4[94] 19 4.2 349 82.9
Pure Transformer 纯Transformer
DeiT-Ti [15, 60] DeiT-Ti[15,60] 5 1.3 2536 72.2
DeiT-S [15, 60] DeiT-S[15,60] 22 4.6 940 79.8
DeiT-B [15, 60] DeiT-B[15,60] 86 17.6 292 81.8
T2T-ViT-14 [68] T2T-ViT-14[68] 21.5 5.2 764 81.5
T2T-ViT-19 [68] T2T-ViT-19[68] 39.2 8.9 464 81.9
T2T-ViT-24 [68] T2T-ViT-24[68] 64.1 14.1 312 82.3
PVT-Small [73] PVT-小[73] 24.5 3.8 820 79.8
PVT-Medium [73] PVT-介质[73] 44.2 6.7 526 81.2
PVT-Large [73] PVT-大[73] 61.4 9.8 367 81.7
TNT-S [29] TNT-S[29] 23.8 5.2 428 81.5
TNT-B [29] TNT-B[29] 65.6 14.1 246 82.9
CPVT-S [86] CPVT-S[86] 23 4.6 930 80.5
CPVT-B [86] CPVT-B[86] 88 17.6 285 82.3
Swin-T [61] Swin-T[61] 29 4.5 755 81.3
Swin-S [61] Swin-S[61] 50 8.7 437 83.0
Swin-B [61] Swin-B[61] 88 15.4 278 83.3
CNN + Transformer 卷积神经网络+Transformer
Twins-SVT-S [63] 双胞胎-SVT-S[63] 24 2.9 1059 81.7
Twins-SVT-B [63] 双胞胎-SVT-B[63] 56 8.6 469 83.2
Twins-SVT-L [63] 双胞胎-SVT-L[63] 99.2 15.1 288 83.7
Shuffle-T [66] Shuffle-T[66] 29 4.6 791 82.5
Shuffle-S [66] Shuffle-S[66] 50 8.9 450 83.5
Shuffle-B [66] 洗牌-B[66] 88 15.6 279 84.0
CMT-S [95] CMT-S[95] 25.1 4.0 563 83.5
CMT-B [95] CMT-B[95] 45.7 9.3 285 84.5
VOLO-D1 [96] VOLO-D1[96] 27 6.8 481 84.2
VOLO-D2 [96] VOLO-D2[96] 59 14.1 244 85.2
VOLO-D3 [96] VOLO-D3[96] 86 20.6 168 85.4
VOLO-D4 [96] VOLO-D4[96] 193 43.8 100 85.7
VOLO-D5 [96] VOLO-D5[96] 296 69.0 64 86.1
 
11footnotetext:

3.1.2 Transformer with Convolution
3.1.2卷积Transformer

Although vision transformers have been successfully applied to various visual tasks due to their ability to capture long-range dependencies within the input, there are still gaps in performance between transformers and existing CNNs. One main reason can be the lack of ability to extract local information. Except the above mentioned variants of ViT that enhance the locality, combining the transformer with convolution can be a more straightforward way to introduce the locality into the conventional transformer.
尽管视觉变压器由于能够捕获输入内的远程依赖关系而成功应用于各种视觉任务,但变压器与现有CNN之间在性能方面仍存在差距,一个主要原因可能是缺乏提取局部信息的能力,除了上述增强局部性的ViT变体之外,将变压器与卷积相结合可以是将局部性引入常规变压器的更直接的方式。

There are plenty of works trying to augment a conventional transformer block or self-attention layer with convolution. For example, CPVT [86] proposed a conditional positional encoding (CPE) scheme, which is conditioned on the local neighborhood of input tokens and adaptable to arbitrary input sizes, to leverage convolutions for fine-level feature encoding. CvT [97], CeiT [98], LocalViT [99] and CMT [95] analyzed the potential drawbacks when directly borrowing Transformer architectures from NLP and combined the convolutions with transformers together. Specifically, the feed-forward network (FFN) in each transformer block is combined with a convolutional layer that promotes the correlation among neighboring tokens. LeViT [100] revisited principles from extensive literature on CNNs and applied them to transformers, proposing a hybrid neural network for fast inference image classification. BoTNet [101] replaced the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet, and improved upon the baselines significantly on both instance segmentation and object detection tasks with minimal overhead in latency.
有很多作品试图用卷积来增强传统的变压器块或自注意力层。例如,CPVT[86]提出了一种条件位置编码(CPE)方案,该方案以输入令牌的局部邻域为条件,并可适应任意输入大小,以利用卷积进行精细级别的特征编码。CvT[97]、CeiT[98]、LocalViT[99]和CMT[95]分析了直接从自然语言处理中借用Transformer架构时的潜在缺点,并将卷积与变压器组合在一起。具体来说,每个变压器块中的前馈网络(FFN)与促进相邻令牌之间相关性的卷积层相结合。LeViT[100]重新审视了有关CNN的大量文献中的原理并将其应用于变压器,提出了一种用于快速推理图像分类的混合神经网络。BoTNet[101]在ResNet的最后三个瓶颈块中用全局自关注取代了空间卷积,并以最小的延迟开销显着改进了实例分割和目标检测任务的基线。

Besides, some researchers have demonstrated that transformer based models can be more difficult to enjoy a favorable ability of fitting data [15, 102, 103], in other words, they are sensitive to the choice of optimizer, hyper-parameter, and the schedule of training. Visformer [102] revealed the gap between transformers and CNNs with two different training settings. The first one is the standard setting for CNNs, i.e., the training schedule is shorter and the data augmentation only contains random cropping and horizental flipping. The other one is the training setting used in [60], i.e., the training schedule is longer and the data augmentation is stronger. [103] changed the early visual processing of ViT by replacing its embedding stem with a standard convolutional stem, and found that this change allows ViT to converge faster and enables the use of either AdamW or SGD without a significant drop in accuracy. In addition to these two works, [100, 95] also choose to add convolutional stem on the top of the transformer.
此外,一些研究人员已经证明,基于变压器的模型可能更难享受良好的拟合数据能力[15,102,103],换句话说,它们对优化器的选择、超参数和训练时间表很敏感。Visger[102]揭示了变压器和CNN之间的差距,两种不同的训练设置。第一个是CNN的标准设置,即训练时间表更短,数据增强只包含随机裁剪和水平翻转。另一个是[60]中使用的训练设置,即训练计划更长,数据增强更强。[103]通过用标准卷积词干替换其嵌入词干来改变ViT的早期视觉处理,并发现这种变化允许ViT更快地收敛,并且能够在精度没有明显下降的情况下使用AdamW或SGD。除了这两个作品之外,[100,95]还选择在变压器顶部添加卷积词干。

Refer to caption Refer to caption
(a) Acc v.s. FLOPs. (a)acc vs. s.每秒浮点运算次数。 (b) Acc v.s. throughput. (b)Acc vs. s.吞吐量。
Figure 6: FLOPs and throughput comparison of representative CNN and vision transformer models.
图6:代表性卷积神经网络和视觉变压器模型的每秒浮点运算次数和吞吐量比较。

3.1.3 Self-supervised Representation Learning
3.1.3自监督表示学习

Generative Based Approach. Generative pre-training methods for images have existed for a long time [104, 105, 106, 107]. Chen et al. [14] re-examined this class of methods and combined it with self-supervised methods. After that, several works [108, 109] were proposed to extend generative based self-supervised learning for vision transformer.
基于生成的方法,图像的生成预训练方法已经存在了很长时间[104,105,106,107],Chen等人[14]重新审视了这一类方法,并将其与自监督方法相结合,之后提出了若干工作[108,109],将基于生成的自监督学习扩展到视觉转换器。

We briefly introduce iGPT [14] to demonstrate its mechanism. This approach consists of a pre-training stage followed by a fine-tuning stage. During the pre-training stage, auto-regressive and BERT objectives are explored. To implement pixel prediction, a sequence transformer architecture is adopted instead of language tokens (as used in NLP). Pre-training can be thought of as a favorable initialization or regularizer when used in combination with early stopping. During the fine-tuning stage, they add a small classification head to the model. This helps optimize a classification objective and adapts all weights.
我们简要介绍iGPT[14]来演示其机制。这种方法由预训练阶段和随后的微调阶段组成。在预训练阶段,探索自回归和BERT目标。为了实现像素预测,采用序列转换器架构而不是语言标记(如在自然语言处理中使用的那样)。当与提前终止结合使用时,预训练可以被认为是一个有利的初始化或正则化器。在微调阶段,他们为模型添加了一个小的分类头。这有助于优化分类目标并适应所有权重。

The image pixels are transformed into a sequential data by k𝑘k-means clustering. Given an unlabeled dataset X𝑋{X} consisting of high dimensional data 𝐱=(x1,,xn)𝐱subscript𝑥1subscript𝑥𝑛\mathbf{x}=(x_{1},\cdots,x_{n}), they train the model by minimizing the negative log-likelihood of the data:
图像像素通过 k𝑘k -means聚类转换为顺序数据。给定一个由高维数据 𝐱=(x1,,xn)𝐱subscript𝑥1subscript𝑥𝑛\mathbf{x}=(x_{1},\cdots,x_{n}) 组成的未标记数据集 X𝑋{X} ,它们通过最小化数据的负对数似然来训练模型:

LAR=𝔼𝐱X[logp(𝐱)],subscript𝐿𝐴𝑅similar-to𝐱𝑋𝔼delimited-[]𝑝𝐱L_{AR}=\underset{\mathbf{x}\sim X}{\mathbb{E}}[-\log p(\mathbf{x})], (7)

where p(𝐱)𝑝𝐱p(\mathbf{x}) is the probability density of the data of images, which can be modeled as:
其中 p(𝐱)𝑝𝐱p(\mathbf{x}) 是图像数据的概率密度,可以建模为:

p(𝐱)=i=1np(xπi|xπ1,,xπi1,θ).𝑝𝐱superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑥subscript𝜋𝑖subscript𝑥subscript𝜋1subscript𝑥subscript𝜋𝑖1𝜃p(\mathbf{x})=\prod_{i=1}^{n}p(x_{\pi_{i}}|x_{\pi_{1}},\cdots,x_{\pi_{i-1}},\theta). (8)

Here, the identity permutation πi=isubscript𝜋𝑖𝑖\pi_{i}=i is adopted for 1in1𝑖𝑛1\leqslant i\leqslant n, which is also known as raster order. Chen et al. also considered the BERT objective, which samples a sub-sequence M[1,n]𝑀1𝑛M\subset[1,n] such that each index i𝑖i independently has probability 0.15 of appearing in M𝑀M. M𝑀M is called the BERT mask, and the model is trained by minimizing the negative log-likelihood of the “masked” elements xMsubscript𝑥𝑀x_{M} conditioned on the “unmasked” ones x[1,n]\Msubscript𝑥\1𝑛𝑀x_{[1,n]\backslash M}:
这里, 1in1𝑖𝑛1\leqslant i\leqslant n 采用同一性排列 πi=isubscript𝜋𝑖𝑖\pi_{i}=i ,也称为光栅顺序。Chen等人还考虑了BERT目标,它对子序列 M[1,n]𝑀1𝑛M\subset[1,n] 进行采样,使得每个索引 i𝑖i 独立出现在 M𝑀M 中的概率为0.15。 M𝑀M 称为BERT掩码,模型通过最小化“掩码”元素 xMsubscript𝑥𝑀x_{M} 的负对数似然来训练,条件是“未掩码”元素 x[1,n]\Msubscript𝑥\1𝑛𝑀x_{[1,n]\backslash M}

LBERT=𝔼𝐱X𝔼𝑀iM[logp(xi|x[1,n]\M)].subscript𝐿𝐵𝐸𝑅𝑇similar-to𝐱𝑋𝔼𝑀𝔼subscript𝑖𝑀delimited-[]𝑝conditionalsubscript𝑥𝑖subscript𝑥\1𝑛𝑀L_{BERT}=\underset{\mathbf{x}\sim X}{\mathbb{E}}\underset{M}{\mathbb{E}}\sum_{i\in M}[-\log p(x_{i}|x_{[1,n]\backslash M})]. (9)

During the pre-training stage, they pick either LARsubscript𝐿𝐴𝑅L_{AR} or LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT} and minimize the loss over the pre-training dataset.
在预训练阶段,他们选择 LARsubscript𝐿𝐴𝑅L_{AR}LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT} 并最小化预训练集的损失。

GPT-2 [110] formulation of the transformer decoder block is used. To ensure proper conditioning when training the AR objective, Chen et al. apply the standard upper triangular mask to the n×n𝑛𝑛n\times n matrix of attention logits. No attention logit masking is required when the BERT objective is used: Chen et al. zero out the positions after the content embeddings are applied to the input sequence. Following the final transformer layer, they apply a layer norm and learn a projection from the output to logits parameterizing the conditional distributions at each sequence element. When training BERT, they simply ignore the logits at unmasked positions.
使用了转换器解码器块的GPT-2[110]公式。为了确保在训练AR目标时进行适当的调节,Chen等人。将标准的上三角形掩码应用于注意力日志的 n×n𝑛𝑛n\times n 矩阵。使用BERT目标时不需要注意力日志掩码:Chen等人。将内容嵌入应用于输入序列后的位置归零。在最终的转换器层之后,他们应用层范数并从输出中学习投影,以对每个序列元素的条件分布进行参数化logits。在训练BERT时,他们只需忽略未掩码位置的logits。

During the fine-tuning stage, they average pool the output of the final layer normalization layer across the sequence dimension to extract a d𝑑d-dimensional vector of features per example. They learn a projection from the pooled feature to class logits and use this projection to minimize a cross entropy loss. Practical applications offer empirical evidence that the joint objective of cross entropy loss and pretraining loss (LARsubscript𝐿𝐴𝑅L_{AR} or LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT}) works even better. After iGPT, masked image modeling is proposed such as MAE [32] and SimMIM [111] which achieves competitive performance on downstream tasks.
在微调阶段,他们平均池跨序列维度的最终层归一化层的输出,以提取每个示例的 d𝑑d 维特征向量。他们学习从池化特征到类logits的投影,并使用该投影最小化交叉熵损失。实际应用提供了经验证据,表明交叉熵损失和预训练损失( LARsubscript𝐿𝐴𝑅L_{AR}LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT} )的联合目标工作得更好。在iGPT之后,提出了掩蔽图像建模,例如MAE[32]和SimMIM[111],这在下游任务上实现了竞争性能。

iGPT and ViT are two pioneering works to apply transformer for visual tasks. The difference of iGPT and ViT-like models mainly lies on 3 aspects: 1) The input of iGPT is a sequence of color palettes by clustering pixels, while ViT uniformly divided the image into a number of local patches; 2) The architecture of iGPT is an encoder-decoder framework, while ViT only has transformer encoder; 3) iGPT utilizes auto-regressive self-supervised loss for training, while ViT is trained by supervised image classification task.
iGPT和ViT是将转换器应用于视觉任务的两个开创性工作。iGPT和类ViT模型的区别主要在于3个方面: 1)iGPT的输入是通过聚类像素将调色板序列,而ViT将图像均匀地划分为多个局部块;2)iGPT的架构是编码器-解码器框架,而ViT只有变压器编码器;3)iGPT利用自回归自监督损失进行训练,而ViT是通过监督图像分类任务来训练的。

Contrastive Learning Based Approach. Currently, contrastive learning is the most popular manner of self-supervised learning for computer vision. Contrastive learning has been applied on vision transformer for unsupervised pretraining [31, 112, 113].
基于对比学习的方法。目前,对比学习是计算机视觉自监督学习最流行的方式。对比学习已应用于无监督预训练的视觉变压器[31,112,113]。

Chen et al. [31] investigate the effects of several fundamental components for training self-supervised ViT. The authors observe that instability is a major issue that degrades accuracy, and these results are indeed partial failure and they can be improved when training is made more stable.
Chen等人。[31]研究了训练自监督ViT的几个基本组成部分的影响。作者观察到,不稳定性是降低准确性的一个主要问题,这些结果确实是部分失败的,当训练变得更加稳定时,它们可以得到改善。

They introduce a “MoCo v3” framework, which is an incremental improvement of MoCo [114]. Specifically, the authors take two crops for each image under random data augmentation. They are encodes by two encoders, fqsubscript𝑓𝑞f_{q} and fksubscript𝑓𝑘f_{k}, with output vectors 𝐪𝐪\mathbf{q} and 𝐤𝐤\mathbf{k}. Intuitively, 𝐪𝐪\mathbf{q} behaves like a “query” and the goal of learning is to retrieve the corresponding “key”. This is formulated as minimizing a contrastive loss function, which can be written as:
他们引入了一个“MoCo v3”框架,这是MoCo[114]的增量改进。具体来说,作者在随机数据增强下为每个图像取两个作物。它们由两个编码器编码, fqsubscript𝑓𝑞f_{q}fksubscript𝑓𝑘f_{k} ,输出向量为 𝐪𝐪\mathbf{q}𝐤𝐤\mathbf{k} 。直观地说, 𝐪𝐪\mathbf{q} 表现得像一个“查询”,学习的目标是检索相应的“键”。这被表述为最小化一个对比损失函数,可以写成:

q=logexp(𝐪𝐤+/τ)exp(𝐪𝐤+/τ)+𝐤exp(𝐪𝐤/τ).subscript𝑞log𝐪superscript𝐤𝜏𝐪superscript𝐤𝜏subscriptsuperscript𝐤𝐪superscript𝐤𝜏\mathcal{L}_{q}=-\text{log}\frac{\exp(\mathbf{q}\cdot\mathbf{k}^{+}/\tau)}{\exp(\mathbf{q}\cdot\mathbf{k}^{+}/\tau)+\sum_{\mathbf{k}^{-}}\exp(\mathbf{q}\cdot\mathbf{k}^{-}/\tau)}. (10)

Here 𝐤+superscript𝐤\mathbf{k}^{+} is fksubscript𝑓𝑘f_{k}’s output on the same image as 𝐪𝐪\mathbf{q}, known as 𝐪𝐪\mathbf{q}’s positive sample. The set 𝐤superscript𝐤\mathbf{k}^{-} consists of fksubscript𝑓𝑘f_{k}’s outputs from other images, known as 𝐪𝐪\mathbf{q}’s negative samples. τ𝜏\tau is a temperature hyper-parameter for l2subscript𝑙2l_{2}-normalized 𝐪𝐪\mathbf{q}, 𝐤𝐤\mathbf{k}. MoCo v3 uses the keys that naturally co-exist in the same batch and abandon the memory queue, which they find has diminishing gain if the batch is sufficiently large (e.g., 4096). With this simplification, the contrastive loss can be implemented in a simple way. The encoder fqsubscript𝑓𝑞f_{q} consists of a backbone (e.g., ViT), a projection head and an extra prediction head; while the encoder fksubscript𝑓𝑘f_{k} has the backbone and projection head, but not the prediction head. fksubscript𝑓𝑘f_{k} is updated by the moving-average of fqsubscript𝑓𝑞f_{q}, excluding the prediction head.
这里 𝐤+superscript𝐤\mathbf{k}^{+}fksubscript𝑓𝑘f_{k} 在与 𝐪𝐪\mathbf{q} 相同的图像上的输出,称为 𝐪𝐪\mathbf{q} 的正样本。集合 𝐤superscript𝐤\mathbf{k}^{-}fksubscript𝑓𝑘f_{k} 在其他图像上的输出组成,称为 𝐪𝐪\mathbf{q} 的负样本。 τ𝜏\taul2subscript𝑙2l_{2} -归一化 𝐪𝐪\mathbf{q}𝐤𝐤\mathbf{k} 的温度超参数。MoCo v3使用在同一批次中自然共存的键,并放弃内存队列,如果批次足够大(例如4096),他们发现该队列的增益会减少。通过这种简化,可以以简单的方式实现对比损失。编码器 fqsubscript𝑓𝑞f_{q} 由骨干(例如ViT)、投影头和额外的预测头组成;而编码器 fksubscript𝑓𝑘f_{k} 具有骨干和投影头,但没有预测头。

MoCo v3 shows that the instability is a major issue of training the self-supervised ViT, thus they describe a simple trick that can improve the stability in various cases of the experiments. They observe that it is not necessary to train the patch projection layer. For the standard ViT patch size, the patch projection matrix is complete or over-complete. And in this case, random projection should be sufficient to preserve the information of the original patches. However, the trick alleviates the issue, but does not solve it. The model can still be unstable if the learning rate is too big and the first layer is unlikely the essential reason for the instability.
MoCo v3表明不稳定性是训练自监督ViT的一个主要问题,因此他们描述了一个简单的技巧,可以在实验的各种情况下提高稳定性。他们观察到没有必要训练补丁投影层。对于标准的ViT补丁大小,补丁投影矩阵是完整的或过完整的。并且在这种情况下,随机投影应该足以保留原始补丁的信息。然而,该技巧缓解了问题,但并没有解决它。如果学习速率太大,并且第一层不太可能是不稳定的本质原因,则模型仍然可能不稳定。

3.1.4 Discussions 3.1.4讨论

All of the components of vision transformer including multi-head self-attention, multi-layer perceptron, shortcut connection, layer normalization, positional encoding and network topology, play key roles in visual recognition. As stated above, a number of works have been proposed to improve the effectiveness and efficiency of vision transformer. From the results in Figure 6, we can see that combining CNN and transformer achieve the better performance, indicating their complementation to each other through local connection and global connection. Further investigation on backbone networks can lead to the improvement for the whole vision community. As for the self-supervised representation learning for vision transformer, we still need to make effort to pursue the success of large-scale pretraining in the filed of NLP.
视觉转换器的所有组成部分,包括多头自注意、多层感知器、快捷连接、层归一化、位置编码和网络拓扑,在视觉识别中发挥着关键作用。如上所述,已经提出了一些工作来提高视觉转换器的有效性和效率。从图6中的结果可以看出,卷积神经网络和转换器相结合获得了更好的性能,表明它们通过局部连接和全局连接相互补充。对骨干网络的进一步研究可以导致整个视觉社区的改进。至于视觉转换器的自监督表示学习,我们还需要努力追求自然语言处理领域大规模预训练的成功。

3.2 High/Mid-level Vision
3.2高/中级愿景

Recently there has been growing interest in using transformer for high/mid-level computer vision tasks, such as object detection [16, 17, 115, 116, 117], lane detection [118], segmentation [34, 25, 18] and pose estimation [35, 36, 37, 119]. We review these methods in this section.
最近越来越多的人对使用变压器进行高/中级计算机视觉任务感兴趣,例如目标检测[16,17,115,116,117]、车道检测[118]、分割[34,25,18]和姿态估计[35,36,37,119]。我们在本节中回顾这些方法。

3.2.1 Generic Object Detection
3.2.1目标检测

Traditional object detectors are mainly built upon CNNs, but transformer-based object detection has gained significant interest recently due to its advantageous capability.
传统的目标检测主要建立在CNN上,但基于变压器的目标检测因其优越的性能而引起了人们的极大兴趣。

Some object detection methods have attempted to use transformer’s self-attention mechanism and then enhance the specific modules for modern detectors, such as feature fusion module [120] and prediction head [121]. We discuss this in the supplemental material. Transformer-based object detection methods are broadly categorized into two groups: transformer-based set prediction methods [16, 17, 122, 123, 124] and transformer-based backbone methods [115, 117], as shown in Fig. 7. Transformer-based methods have shown strong performance compared with CNN-based detectors, in terms of both accuracy and running speed. Table III shows the detection results for different transformer-based object detectors mentioned earlier on the COCO 2012 val set.
一些目标检测方法已经尝试使用变压器的自注意力机制,然后增强现代检测器的特定模块,例如特征融合模块[120]和预测头[121]。我们在补充材料中对此进行了讨论。基于Transformer的目标检测方法大致分为两组:基于变压器的集合预测方法[16,17,122,123,124]和基于变压器的骨干方法[115,117],如图7所示。与基于卷积神经网络的检测器相比,基于Transformer的方法在精度和运行速度方面表现出了强大的性能。表III显示了前面提到的COCO 2012 val集合上不同基于变压器的目标检测器的检测结果。

Refer to caption
Figure 7: General framework of transformer-based object detection.
图7:基于变压器的目标检测的一般框架。

Transformer-based Set Prediction for Detection. As a pioneer for transformer-based detection method, the detection transformer (DETR) proposed by Carion et al. [16] redesigns the framework of object detection. DETR, a simple and fully end-to-end object detector, treats the object detection task as an intuitive set prediction problem, eliminating traditional hand-crafted components such as anchor generation and non-maximum suppression (NMS) post-processing. As shown in Fig. 8, DETR starts with a CNN backbone to extract features from the input image. To supplement the image features with position information, fixed positional encodings are added to the flattened features before the features are fed into the encoder-decoder transformer. The decoder consumes the embeddings from the encoder along with N𝑁N learned positional encodings (object queries), and produces N𝑁N output embeddings. Here N𝑁N is a predefined parameter and typically larger than the number of objects in an image. Simple feed-forward networks (FFNs) are used to compute the final predictions, which include the bounding box coordinates and class labels to indicate the specific class of object (or to indicate that no object exists). Unlike the original transformer, which computes predictions sequentially, DETR decodes N𝑁N objects in parallel. DETR employs a bipartite matching algorithm to assign the predicted and ground-truth objects. As shown in Eq. 11, the Hungarian loss is exploited to compute the loss function for all matched pairs of objects.
基于Transformer的检测集合预测。作为基于变压器的检测方法的先驱,Carion等人[16]提出的检测变压器(DETR)重新设计了目标检测的框架。DETR是一种简单且完全端到端的对象检测器,它将目标检测任务视为一个直观的集合预测问题,消除了传统的手工制作组件,如锚生成和非最大抑制(NMS)后处理。如图8所示,DETR从卷积神经网络主干开始,从输入图像中提取特征。为了用位置信息补充图像特征,在将特征馈入编码器-解码器变压器之前,将固定的位置编码添加到扁平化特征中。解码器将来自编码器的嵌入与 N𝑁N 学习的位置编码(对象查询)一起消耗,并产生 N𝑁N 输出嵌入。这里 N𝑁N 是一个预定义的参数,通常大于图像中的对象数量。简单前馈网络(FFN)用于计算最终预测,其中包括边界框坐标和类标签,以指示特定的对象类别(或指示不存在对象)。与按顺序计算预测的原始转换器不同,DETR并行解码 N𝑁N 对象。DETR采用二分匹配算法来分配预测的和真实的对象。如公式11所示,匈牙利损失被用来计算所有匹配对象对的损失函数。

Hungarian(y,y^)=i=1N[logp^σ^(i)(ci)+𝟙{ci}box(bi,b^σ^(i))],subscriptHungarian𝑦^𝑦superscriptsubscript𝑖1𝑁delimited-[]subscript^𝑝^𝜎𝑖subscript𝑐𝑖subscript1subscript𝑐𝑖subscriptboxsubscript𝑏𝑖subscript^𝑏^𝜎𝑖\displaystyle{\cal L}_{\rm Hungarian}{(y,\hat{y})}=\sum_{i=1}^{N}\left[-\log\hat{p}_{\hat{\sigma}(i)}(c_{i})+{\mathds{1}_{\{c_{i}\neq\varnothing\}}}{\cal L}_{\rm box}{(b_{i},\hat{b}_{\hat{\sigma}}(i))}\right]\,,

(11)

where σ^^𝜎\hat{\sigma} is the optimal assignment, cisubscript𝑐𝑖c_{i} and p^σ^(i)(ci)subscript^𝑝^𝜎𝑖subscript𝑐𝑖\hat{p}_{\hat{\sigma}(i)}(c_{i}) are the target class label and predicted label, respectively, and bisubscript𝑏𝑖b_{i} and b^σ^(i)subscript^𝑏^𝜎𝑖\hat{b}_{\hat{\sigma}}(i) are the ground truth and predicted bounding box, y={(ci,bi)}𝑦subscript𝑐𝑖subscript𝑏𝑖y=\{(c_{i},b_{i})\} and y^^𝑦\hat{y} are the ground truth and prediction of objects, respectively. DETR shows impressive performance on object detection, delivering comparable accuracy and speed with the popular and well-established Faster R-CNN [13] baseline on COCO benchmark.
其中 σ^^𝜎\hat{\sigma} 是最优分配, cisubscript𝑐𝑖c_{i}p^σ^(i)(ci)subscript^𝑝^𝜎𝑖subscript𝑐𝑖\hat{p}_{\hat{\sigma}(i)}(c_{i}) 分别是目标类标签和预测标签, bisubscript𝑏𝑖b_{i}b^σ^(i)subscript^𝑏^𝜎𝑖\hat{b}_{\hat{\sigma}}(i) 是地面实况和预测边界框, y={(ci,bi)}𝑦subscript𝑐𝑖subscript𝑏𝑖y=\{(c_{i},b_{i})\}y^^𝑦\hat{y} 分别是目标的地面实况和预测。DETR在目标检测方面表现出色,提供了与COCO基准上流行且成熟的Faster R-卷积神经网络[13]基线相当的准确性和速度。

Refer to caption
Figure 8: The overall architecture of DETR (image from  [16]).
图8:DETR的整体架构(图片来自[16])。

DETR is a new design for the object detection framework based on transformer and empowers the community to develop fully end-to-end detectors. However, the vanilla DETR poses several challenges, specifically, longer training schedule and poor performance for small objects. To address these challenges, Zhu et al. [17] proposed Deformable DETR, which has become a popular method that significantly improves the detection performance. The deformable attention module attends to a small set of key positions around a reference point rather than looking at all spatial locations on image feature maps as performed by the original multi-head attention mechanism in transformer. This approach significantly reduces the computational complexity and brings benefits in terms of fast convergence. More importantly, the deformable attention module can be easily applied for fusing multi-scale features. Deformable DETR achieves better performance than DETR with 10×10\times less training cost and 1.6×1.6\times faster inference speed. And by using an iterative bounding box refinement method and two-stage scheme, Deformable DETR can further improve the detection performance.
DETR是基于变压器的目标检测框架的新设计,并授权社区开发完全端到端的检测器。然而,香草DETR带来了几个挑战,具体来说,较长的训练计划和较差的小对象性能。为了解决这些挑战,朱等人[17]提出了可变形DETR,这已成为一种流行的方法,显着提高了检测性能。可变形注意力模块关注参考点周围的一小部分关键位置,而不是像变压器中的原始多头注意力机制那样查看图像特征图上的所有空间位置。这种方法显着降低了计算复杂性,并在快速收敛方面带来了好处。更重要的是,可变形注意力模块可以轻松应用于融合多尺度特征。Deformable DETR以 10×10\times 更少的训练成本和 1.6×1.6\times 更快的推理速度获得了比DETR更好的性能。并且通过使用迭代包围盒细化方法和两阶段方案,Deformable DETR可以进一步提高检测性能。

There are also several methods to deal with the slow convergence problem of the original DETR. For example, Sun et al. [122] investigated why the DETR model has slow convergence and discovered that this is mainly due to the cross-attention module in the transformer decoder. To address this issue, an encoder-only version of DETR is proposed, achieving considerable improvement in terms of detection accuracy and training convergence. In addition, a new bipartite matching scheme is designed for greater training stability and faster convergence and two transformer-based set prediction models, i.e. TSP-FCOS and TSP-RCNN, are proposed to improve encoder-only DETR with feature pyramids. These new models achieve better performance compared with the original DETR model. Gao et al. [125] proposed the Spatially Modulated Co-Attention (SMCA) mechanism to accelerate the convergence by constraining co-attention responses to be high near initially estimated bounding box locations. By integrating the proposed SMCA module into DETR, similar mAP could be obtained with about 10×\times less training epochs under comparable inference cost.
还有几种方法可以处理原始DETR收敛慢的问题。例如,Sun等人[122]调查了为什么DETR模型收敛慢,发现这主要是由于变压器解码器中的交叉注意力模块。为了解决这个问题,提出了一种仅编码器版本的DETR,在检测精度和训练收敛方面实现了相当大的改进。此外,设计了一种新的二分匹配方案,以获得更大的训练稳定性和更快的收敛速度,并提出了两种基于变压器的集合预测模型,即TSP-FCOS和TSP-RCNN,以改进具有特征金字塔的仅编码器DETR。这些新模型与原始DETR模型相比实现了更好的性能。高等人。[125]提出了空间调制协同注意力(SMCA)机制,通过将协同注意力响应限制在靠近初始估计边界盒位置的高水平来加速收敛。通过将所提出的SMCA模块集成到DETR中,在可比推理成本下,可以以大约10 ×\times 的训练周期获得相似的地图。

Given the high computation complexity associated with DETR, Zheng et al. [123] proposed an Adaptive Clustering Transformer (ACT) to reduce the computation cost of pre-trained DETR. ACT adaptively clusters the query features using a locality sensitivity hashing (LSH) method and broadcasts the attention output to the queries represented by the selected prototypes. ACT is used to replace the self-attention module of the pre-trained DETR model without requiring any re-training. This approach significantly reduces the computational cost while the accuracy slides slightly. The performance drop can be further reduced by utilizing a multi-task knowledge distillation (MTKD) method, which exploits the original transformer to distill the ACT module with a few epochs of fine-tuning. Yao et al.  [126] pointed out that the random initialization in DETR is the main reason for the requirement of multiple decoder layers and slow convergence. To this end, they proposed the Efficient DETR to incorporate the dense prior into the detection pipeline via an additional region proposal network. The better initialization enables them to use only one decoder layers instead of six layers to achieve competitive performance with a more compact network.
鉴于与DETR相关的高计算复杂度,郑等人[123]提出了一种自适应聚类Transformer(ACT),以降低预训练DETR的计算成本。ACT使用局部敏感性散列(LSH)方法自适应地对查询特征进行聚类,并将注意力输出广播到所选原型表示的查询中。ACT用于替换预训练DETR模型的自注意力模块,而不需要任何重新训练。这种方法显着降低了计算成本,同时精度略有下滑。通过利用多任务知识蒸馏(MTKD)方法,可以进一步降低性能下降,该方法利用原始变压器对ACT模块进行几个时期的微调。姚等人[126]指出,DETR中的随机初始化是要求多个解码器层和收敛缓慢的主要原因。为此,他们提出了高效DETR,通过额外的区域提议网络将密集先验合并到检测管道中。更好的初始化使他们能够仅使用一个解码器层而不是六层,以实现与更紧凑网络的竞争性能。

TABLE III: Comparison of different transformer-based object detectors on COCO 2017 val set. Running speed (FPS) is evaluated on an NVIDIA Tesla V100 GPU as reported in [17]. Estimated speed according to the reported number in the paper. ViT backbone is pre-trained on ImageNet-21k. ViT backbone is pre-trained on an private dataset with 1.3 billion images.
表三:COCO 2017 val集上不同基于变压器的对象检测器的比较。运行速度(FPS)在[17]中报告的NVIDIA特斯拉V100 GPU上进行评估。 根据论文中报告的数量估计的速度。 ViT骨干在ImageNet-21k上进行预训练。 ViT骨干在具有13亿图像的私有数据集上进行预训练。
  Method  方法 Epochs 纪元 AP AP5050{}_{\text{50}} AP 5050{}_{\text{50}} AP7575{}_{\text{75}} AP 7575{}_{\text{75}} APSS{}_{\text{S}} AP SS{}_{\text{S}} APMM{}_{\text{M}} AP MM{}_{\text{M}} APLL{}_{\text{L}} AP LL{}_{\text{L}} #Params (M) #参数(M) GFLOPs FPS
CNN based 基于卷积神经网络
FCOS [127] FCOS[127] 36 41.0 59.8 44.1 26.2 44.6 52.2 - 177 23
Faster R-CNN + FPN [13]
更快的R-卷积神经网络+FPN[13]
109 42.0 62.1 45.5 26.6 45.4 53.4 42 180 26
CNN Backbone + Transformer Head
卷积神经网络骨干+Transformer
DETR [16] DETR[16] 500 42.0 62.4 44.2 20.5 45.8 61.1 41 86 28
DETR-DC5 [16] DETR-DC5[16] 500 43.3 63.1 45.9 22.5 47.3 61.1 41 187 12
Deformable DETR [17] 可变形DETR[17] 50 46.2 65.2 50.0 28.8 49.2 61.7 40 173 19
TSP-FCOS [122] TSP-FCOS[122] 36 43.1 62.3 47.0 26.6 46.8 55.9 - 189 20
TSP-RCNN [122] TSP-RCNN[122] 96 45.0 64.5 49.6 29.7 47.7 58.0 - 188 15
ACT+MKKD (L=32) [123] ACT+MKKD(L=32)[123] - 43.1 - - 61.4 47.1 22.2 - 169 14
SMCA [125] SMCA[125] 108 45.6 65.5 49.1 25.9 49.3 62.6 - - -
Efficient DETR [126] 高效DETR[126] 36 45.1 63.1 49.1 28.3 48.4 59.0 35 210 -
UP-DETR [33] UP-DETR[33] 150 40.5 60.8 42.6 19.0 44.4 60.0 41 - -
UP-DETR [33] UP-DETR[33] 300 42.8 63.0 45.3 20.8 47.1 61.7 41 - -
Transformer Backbone + CNN Head
Transformer骨干+卷积神经网络头部
ViT-B/16-FRCNN [115]
ViT-B/16-FRCNN [115]
21 36.6 56.3 39.3 17.4 40.0 55.5 - - -
ViT-B/16-FRCNN [115]
ViT-B/16-FRCNN [115]
21 37.8 57.4 40.1 17.8 41.4 57.3 - - -
PVT-Small+RetinaNet [73]
PVT-Small+RetinaNet[73]
12 40.4 61.3 43.0 25.0 42.9 55.7 34.2 118 -
Twins-SVT-S+RetinaNet [63]
Twins-SVT-S+RetinaNet[63]
12 43.0 64.2 46.3 28.0 46.4 57.5 34.3 104 -
Swin-T+RetinaNet [61] Swin-T+RetinaNet[61] 12 41.5 62.1 44.2 25.1 44.9 55.5 38.5 118 -
Swin-T+ATSS [61] Swin-T+ATSS[61] 36 47.2 66.5 51.3 - - - 36 215 -
Pure Transformer based 纯Transformer
PVT-Small+DETR [73] PVT-小+DETR[73] 50 34.7 55.7 35.4 12.0 36.4 56.7 40 - -
TNT-S+DETR [29] TNT-S+DETR[29] 50 38.2 58.9 39.4 15.5 41.1 58.8 39 - -
YOLOS-Ti [128] YOLOS-Ti[128] 300 30.0 - - - - - 6.5 21 -
YOLOS-S [128] YOLOS-S[128] 150 37.6 57.6 39.2 15.9 40.2 57.3 28 179 -
YOLOS-B [128] YOLOS-B[128] 150 42.0 62.2 44.5 19.5 45.3 62.1 127 537 -
 

Transformer-based Backbone for Detection. Unlike DETR which redesigns object detection as a set prediction tasks via transformer, Beal et al. [115] proposed to utilize transformer as a backbone for common detection frameworks such as Faster R-CNN [13]. The input image is divided into several patches and fed into a vision transformer, whose output embedding features are reorganized according to spatial information before passing through a detection head for the final results. A massive pre-training transformer backbone could bring benefits to the proposed ViT-FRCNN. There are also quite a few methods to explore versatile vision transformer backbone design [29, 73, 61, 63] and transfer these backbones to traditional detection frameworks like RetinaNet [129] and Cascade R-CNN [130]. For example, Swin Transformer [61] obtains about 4 box AP gains over ResNet-50 backbone with similar FLOPs for various detection frameworks.
基于Transformer的检测骨干。与通过变压器将目标检测重新设计为一组预测任务的DETR不同,Beal等人[115]提出利用变压器作为常见检测框架的骨干,例如Faster R-卷积神经网络[13]。输入图像被分成几个补丁并馈送到视觉变压器中,其输出嵌入特征在通过检测头以获得最终结果之前根据空间信息进行重组。一个大规模的预训练变压器骨干可以为提出的ViT-FRCNN带来好处。还有相当多的方法可以探索多功能视觉变压器骨干设计[29,73,61,63],并将这些骨干转移到传统的检测框架,如RetinaNet[129]和级联R-卷积神经网络[130]。例如,SwinTransformer[61]在ResNet-50骨干网上获得大约4盒AP增益,对于各种检测框架,每秒浮点运算次数相似。

Pre-training for Transformer-based Object Detection. Inspired by the pre-training transformer scheme in NLP, several methods have been proposed to explore different pre-training scheme for transformer-based object detection [33, 128, 131]. Dai et al. [33] proposed unsupervised pre-training for object detection (UP-DETR). Specifically, a novel unsupervised pretext task named random query patch detection is proposed to pre-train the DETR model. With this unsupervised pre-training scheme, UP-DETR significantly improves the detection accuracy on a relatively small dataset (PASCAL VOC). On the COCO benchmark with sufficient training data, UP-DETR still outperforms DETR, demonstrating the effectiveness of the unsupervised pre-training scheme.
基于Transformer目标检测的预训练。受自然语言处理中预训练变压器方案的启发,已经提出了几种方法来探索基于变压器的目标检测的不同预训练方案[33,128,131]。戴等人[33]提出了目标检测的无监督预训练(UP-DETR)。具体来说,提出了一种名为随机查询补丁检测的新型无监督借口任务来对DETR模型进行预训练。利用这种无监督预训练方案,UP-DETR在相对较小的数据集(PASCAL VOC)上显著提高了检测精度。在训练数据充足的COCO基准上,UP-DETR仍然优于DETR,证明了无监督预训练方案的有效性。

Fang et al. [128] explored how to transfer the pure ViT structure that is pre-trained on ImageNet to the more challenging object detection task and proposed the YOLOS detector. To cope with the object detection task, the proposed YOLOS first drops the classification tokens in ViT and appends learnable detection tokens. Besides, the bipartite matching loss is utilized to perform set prediction for objects. With this simple pre-training scheme on ImageNet dataset, the proposed YOLOS shows competitive performance for object detection on COCO benchmark.
芳等人[128]探索了如何将在ImageNet上预训练的纯ViT结构转移到更具挑战性的目标检测任务中,并提出了YOLOS检测器。为了应对目标检测任务,所提出的YOLOS首先丢弃ViT中的分类标记并附加可学习的检测标记。此外,利用二分匹配损失对对象进行集合预测。通过在ImageNet数据集上这种简单的预训练方案,所提出的YOLOS在COCO基准上显示出对目标检测的竞争力。

3.2.2 Segmentation 3.2.2分割

Segmentation is an important topic in computer vision community, which broadly includes panoptic segmentation, instance segmentation and semantic segmentation etc. Vision transformer has also shown impressive potential on the field of segmentation.
分割是计算机视觉领域的一个重要课题,广泛地包括全景分割、实例分割和语义分割等。视觉转换器在分割领域也显示出巨大的潜力。

Transformer for Panoptic Segmentation. DETR [16] can be naturally extended for panoptic segmentation tasks and achieve competitive results by appending a mask head on the decoder. Wang et al. [25] proposed Max-DeepLab to directly predict panoptic segmentation results with a mask transformer, without involving surrogate sub-tasks such as box detection. Similar to DETR, Max-DeepLab streamlines the panoptic segmentation tasks in an end-to-end fashion and directly predicts a set of non-overlapping masks and corresponding labels. Model training is performed using a panoptic quality (PQ) style loss, but unlike prior methods that stack a transformer on top of a CNN backbone, Max-DeepLab adopts a dual-path framework that facilitates combining the CNN and transformer.
用于全景分割的Transformer。DETR[16]可以自然地扩展用于全景分割任务,并通过在解码器上附加掩模头来实现具有竞争力的结果。Wang等人[25]提出了Max-DeepLab,可以使用掩模变压器直接预测全景分割结果,而不涉及盒子检测等代理子任务。与DETR类似,Max-DeepLab以端到端的方式简化全景分割任务,并直接预测一组不重叠的掩模和相应的标签。模型训练使用全景质量(PQ)样式损失进行,但与先前将变压器堆叠在卷积神经网络主干之上的方法不同,Max-DeepLab采用了双路径框架,便于将卷积神经网络和变压器相结合。

Transformer for Instance Segmentation. VisTR, a transformer-based video instance segmentation model, was proposed by Wang et al. [34] to produce instance prediction results from a sequence of input images. A strategy for matching instance sequence is proposed to assign the predictions with ground truths. In order to obtain the mask sequence for each instance, VisTR utilizes the instance sequence segmentation module to accumulate the mask features from multiple frames and segment the mask sequence with a 3D CNN. Hu et al. [132] proposed an instance segmentation Transformer (ISTR) to predict low-dimensional mask embeddings, and match them with ground truth for the set loss. ISTR conducted detection and segmentation with a recurrent refinement strategy which is different from the existing top-down and bottom-up frameworks. Yang et al. [133] investigated how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. Some papers such as  [134, 135] also discussed using Transformer to deal with segmentation task.
Transformer实例分割。Wang等人[34]提出了基于变压器的视频实例分割模型VisTR,用于从输入图像序列中产生实例预测结果。提出了一种匹配实例序列的策略,将预测分配给地面真相。为了获得每个实例的掩码序列,VisTR利用实例序列分割模块从多帧中累积掩码特征,并用3D卷积神经网络分割掩码序列。Huet al.[132]提出了一种实例分割Transformer(ISTR),用于预测低维掩码嵌入,并将其与地面实况进行匹配以获取集损。ISTR使用循环精化策略进行检测和分割,该策略不同于现有的自上而下和自下而上框架。Yang et al.[133]研究了如何实现更好、更高效的嵌入学习来解决具有挑战性的多目标场景下的半监督视频对象分割。一些论文如[134,135]也讨论了使用Transformer来处理分割任务。

Transformer for Semantic Segmentation. Zheng et al. [18] proposed a transformer-based semantic segmentation network (SETR). SETR utilizes an encoder similar to ViT [15] as the encoder to extract features from an input image. A multi-level feature aggregation module is adopted for performing pixel-wise segmentation. Strudel et al. [136] introduced Segmenter which relies on the output embedding corresponding to image patches and obtains class labels with a point-wise linear decoder or a mask transformer decoder. Xie et al. [137] proposed a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders, which outputs multiscale features and avoids complex decoders.
Transformer用于语义分割。郑等人[18]提出了一种基于变换器的语义分割网络(SETR)。SETR利用类似于ViT[15]的编码器作为编码器,从输入图像中提取特征。采用多级特征聚合模块进行像素级分割。Strudel等人[136]引入了Segmenter,它依赖于图像块对应的输出嵌入,用逐点线性解码器或掩码变换器解码器获得类标签。谢等人[137]提出了一种简单、高效但功能强大的语义分割框架,它将变换器与轻量级多层感知(MLP)解码器统一起来,输出多尺度特征并避免复杂的解码器。

Transformer for Medical Image Segmentation. Cao et al. [30] proposed an Unet-like pure Transformer for medical image segmentation, by feeding the tokenized image patches into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Valanarasu et al. [138] explored transformer-based solutions and study the feasibility of using transformer-based network architectures for medical image segmentation tasks and proposed a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Cell-DETR [139], based on the DETR panoptic segmentation model, is an attempt to use transformer for cell instance segmentation. It adds skip connections that bridge features between the backbone CNN and the CNN decoder in the segmentation head in order to enhance feature fusion. Cell-DETR achieves state-of-the-art performance for cell instance segmentation from microscopy imagery.
用于医学图像分割的Transformer。曹等人[30]提出了一种用于医学图像分割的类Unet纯Transformer,通过将标记化的图像块馈送到基于Transformer的U形编码器-解码器(模型)架构中,该架构具有用于局部-全局语义特征学习的跳过连接。Valanarasu等人[138]探索了基于变压器的解决方案,研究了使用基于变压器的网络架构进行医学图像分割任务的可行性,并提出了一种门控Axial-注意力模型,该模型通过在自注意力模块中引入额外的控制机制来扩展现有架构。单元-DETR[139],基于DETR全景分割模型,是使用变压器进行细胞实例分割的尝试。它添加了跳过连接,在分割头部的骨干卷积神经网络和卷积神经网络解码器之间桥接特征,以增强特征融合。单元-DETR从显微镜图像中实现细胞实例分割的最先进性能。

3.2.3 Pose Estimation 3.2.3姿态估计

Human pose and hand pose estimation are foundational topics that have attracted significant interest from the research community. Articulated pose estimation is akin to a structured prediction task, aiming to predict the joint coordinates or mesh vertices from input RGB/D images. Here we discuss some methods [35, 36, 37, 119] that explore how to utilize transformer for modeling the global structure information of human poses and hand poses.
人体位姿和手部位姿估计是引起研究界极大兴趣的基础课题。铰接式位姿估计类似于结构化预测任务,旨在从输入RGB/D图像中预测关节坐标或网格顶点。在这里,我们讨论了一些方法[35,36,37,119],探索如何利用转换器对人体位姿和手部位姿的全局结构信息进行建模。

Transformer for Hand Pose Estimation. Huang et al. [35] proposed a transformer based network for 3D hand pose estimation from point sets. The encoder first utilizes a PointNet [140] to extract point-wise features from input point clouds and then adopts standard multi-head self-attention module to produce embeddings. In order to expose more global pose-related information to the decoder, a feature extractor such as PointNet++ [141] is used to extract hand joint-wise features, which are then fed into the decoder as positional encodings. Similarly, Huang et al. [36] proposed HOT-Net (short for hand-object transformer network) for 3D hand-object pose estimation. Unlike the preceding method which employs transformer to directly predict 3D hand pose from input point clouds, HOT-Net uses a ResNet to generate initial 2D hand-object pose and then feeds it into a transformer to predict the 3D hand-object pose. A spectral graph convolution network is therefore used to extract input embeddings for the encoder. Hampali et al. [142] proposed to estimate the 3D poses of two hands given a single color image. Specifically, appearance and spatial encodings of a set of potential 2D locations for the joints of both hands were inputted to a transformer, and the attention mechanisms were used to sort out the correct configuration of the joints and outputted the 3D poses of both hands.
手部姿态估计的Transformer。黄等人[35]提出了一种基于变压器的网络,用于从点集进行3D手部姿态估计。编码器首先利用PointNet[140]从输入点云中提取逐点特征,然后采用标准的多头自注意模块产生嵌入。为了向解码器公开更多全局姿态相关信息,使用PointNet++[141]等特征提取器提取手部联合逐点特征,然后将其作为位置编码馈入解码器。类似地,黄等人[36]提出了HOT-Net(手部物体变压器网络的缩写),用于3D手部物体姿态估计。与前面使用转换器从输入点云直接预测3D手部姿势的方法不同,HOT-Net使用ResNet生成初始2D手部物体姿势,然后将其输入转换器以预测3D手部物体姿势。因此,使用频谱图卷积网络来提取编码器的输入嵌入。Hampali等人[142]提出在给定单色图像的情况下估计两只手的3D姿势。具体而言,将双手关节的一组潜在2D位置的外观和空间编码输入到变压器,并使用注意力机制来整理关节的正确配置并输出双手的3D姿势。

Transformer for Human Pose Estimation. Lin et al. [37] proposed a mesh transformer (METRO) for predicting 3D human pose and mesh from a single RGB image. METRO extracts image features via a CNN and then perform position encoding by concatenating a template human mesh to the image feature. A multi-layer transformer encoder with progressive dimensionality reduction is proposed to gradually reduce the embedding dimensions and finally produce 3D coordinates of human joint and mesh vertices. To encourage the learning of non-local relationships between human joints, METRO randomly mask some input queries during training. Yang et al. [119] constructed an explainable model named TransPose based on Transformer architecture and low-level convolutional blocks. The attention layers built in Transformer can capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on. Li et al. [143] proposed a novel approach based on Token representation for human Pose estimation (TokenPose). Each keypoint was explicitly embedded as a token to simultaneously learn constraint relationships and appearance cues from images. Mao et al. [144] proposed a human pose estimation framework that solved the task in the regression-based fashion. They formulated the pose estimation task into a sequence prediction problem and solve it by transformers, which bypass the drawbacks of the heatmap-based pose estimator. Jiang et al. [145] proposed a novel transformer based network that can learn a distribution over both pose and motion in an unsupervised fashion rather than tracking body parts and trying to temporally smooth them. The method overcame inaccuracies in detection and corrected partial or entire skeleton corruption. Hao et al. [146] proposed to personalize a human pose estimator given a set of test images of a person without using any manual annotations. The method adapted the pose estimator during test time to exploit person-specific information, and used a Transformer model to build a transformation between the self-supervised keypoints and the supervised keypoints.
人类姿态估计的Transformer。Lin等人[37]提出了一种网格转换器(METRO),用于从单个RGB图像中预测3D人类姿态和网格。METRO通过卷积神经网络提取图像特征,然后通过将模板人类网格连接到图像特征来执行位置编码。提出了一种具有渐进降维的多层转换器编码器,以逐步降低嵌入维度,最终生成人类关节和网格顶点的3D坐标。为了鼓励人类关节之间的非局部关系的学习,METRO在训练过程中随机屏蔽一些输入查询。Yang等人[119]基于Transformer架构和低级卷积块构建了一个名为TransPose的可解释模型。Transformer中构建的注意力层可以捕捉关键点之间的远程空间关系,并解释预测的关键点位置高度依赖于哪些依赖关系。Li等人[143]提出了一种基于词元表示的用于人体姿态估计(TokenPose)的新颖方法。每个关键点被显式嵌入为令牌,以同时从图像中学习约束关系和外观线索。毛等人[144]提出了一种人体姿态估计框架,以基于回归的方式解决了该任务。他们将姿势估计任务制定为序列预测问题,并通过变压器解决它,变压器绕过了基于热图的姿势估计器的缺点。江等人[145]提出了一种新颖的基于变压器的网络,可以以无监督的方式学习姿势和运动的分布,而不是跟踪身体部位并试图暂时平滑它们。该方法克服了检测中的不准确性,并纠正了部分或整个骨骼腐败。郝等人。 [146]提出了在不使用任何手动注释的情况下对给定一组人的测试图像的人类姿势估计器进行个性化。该方法在测试时间期间调整姿势估计器以利用人特定的信息,并使用Transformer模型来构建自监督关键点和监督关键点之间的转换。

3.2.4 Other Tasks 3.2.4其他任务

There are also quite a lot different high/mid-level vision tasks that have explored the usage of vision transformer for better performance. We briefly review several tasks below.
也有相当多不同的高/中级视觉任务探索了视觉转换器的使用以获得更好的性能。我们在下面简要回顾几个任务。

Pedestrian Detection. Because the distribution of objects is very dense in occlusion and crowd scenes, additional analysis and adaptation are often required when common detection networks are applied to pedestrian detection tasks. Lin et al. [147] revealed that sparse uniform queries and a weak attention field in the decoder result in performance degradation when directly applying DETR or Deformable DETR to pedestrian detection tasks. To alleviate these drawbacks, the authors proposes Pedestrian End-to-end Detector (PED), which employs a new decoder called Dense Queries and Rectified Attention field (DQRF) to support dense queries and alleviate the noisy or narrow attention field of the queries. They also proposed V-Match, which achieves additional performance improvements by fully leveraging visible annotations.
行人检测。由于物体在遮挡和人群场景中的分布非常密集,因此在将常见的检测网络应用于行人检测任务时,通常需要额外的分析和适配。Lin等人[147]揭示了当直接将DETR或Deformable DETR应用于行人检测任务时,解码器中稀疏均匀查询和弱注意力字段会导致性能下降。为了缓解这些缺点,作者提出了行人端到端检测器(PED),它采用了一种称为稠密查询和校正注意力字段(DQRF)的新解码器,以支持密集查询并缓解查询的嘈杂或狭窄注意力字段。他们还提出了V-Match,它通过充分利用可见注释来实现额外的性能改进。

Lane Detection. Based on PolyLaneNet [148], Liu et al. [118] proposed a method called LSTR, which improves performance of curve lane detection by learning the global context with a transformer network. Similar to PolyLaneNet, LSTR regards lane detection as a task of fitting lanes with polynomials and uses neural networks to predict the parameters of polynomials. To capture slender structures for lanes and the global context, LSTR introduces a transformer network into the architecture. This enables processing of low-level features extracted by CNNs. In addition, LSTR uses Hungarian loss to optimize network parameters. As demonstrated in [118], LSTR outperforms PolyLaneNet, with 2.82% higher accuracy and 3.65×3.65\times higher FPS using 5-times fewer parameters. The combination of a transformer network, CNN and Hungarian Loss culminates in a lane detection framework that is precise, fast, and tiny. Considering that the entire lane line generally has an elongated shape and long-range, Liu et al. [149] utilized a transformer encoder structure for more efficient context feature extraction. This transformer encoder structure improves the detection of the proposal points a lot, which rely on contextual features and global information, especially in the case where the backbone network is a small model.
车道检测。基于PolyLaneNet[148],Liu等人[118]提出了一种称为LSTR的方法,该方法通过使用变压器网络学习全局上下文来提高曲线车道检测的性能。与PolyLaneNet类似,LSTR将车道检测视为使用多项式拟合车道的任务,并使用神经网络来预测多项式的参数。为了捕获车道和全局上下文的细长结构,LSTR在架构中引入了变压器网络。这可以处理CNN提取的低级特征。此外,LSTR使用匈牙利损失来优化网络参数。如[118]所示,LSTR优于PolyLaneNet,精度提高了2.82%,并且使用5倍更少的参数获得了更高的FPS。变压器网络、卷积神经网络和匈牙利损失的组合最终形成了精确、快速和微小的车道检测框架。考虑到整个车道线一般具有拉长的形状和长距离,Liu等人[149]利用了一种变换器编码器结构用于更高效的上下文特征提取,这种变换器编码器结构大大改进了提案点的检测,其依赖于上下文特征和全局信息,特别是在骨干网络是小模型的情况下。

Scene Graph. Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene [150]. To generate scene graph, most of existing methods first extract image-based object representations and then do message propagation between them. Graph R-CNN [151] utilizes self-attention to integrate contextual information from neighboring nodes in the graph. Recently, Sharifzadeh et al. [152] employed transformers over the extracted object embedding. Sharifzadeh et al. [153] proposed a new pipeline called Texema and employed a pre-trained Text-to-Text Transfer Transformer (T5) [154] to create structured graphs from textual input and utilized them to improve the relational reasoning module. The T5 model enables Texema to utilize the knowledge in texts.
场景图。场景图是一种场景的结构化表示,能够清晰地表达场景中的对象、属性以及对象之间的关系[150]。为了生成场景图,现有的大多数方法首先提取基于图像的对象表示,然后在它们之间进行消息传播。图R-卷积神经网络[151]利用自我关注来整合来自图中相邻节点的上下文信息。最近,Sharifzadeh等人[152]在提取的对象嵌入上采用了变压器。Sharifzadeh等人[153]提出了一个名为T😍a的新管道,并采用了一个预训练的文本到文本传输Transformer(T5)[154]来从文本输入中创建结构化图,并利用它们来改进关系推理模块。T5模型使Telia能够利用文本中的知识。

Tracking. Some researchers also explored to use transformer encoder-decoder architecture in template-based discriminative trackers, such as TMT [155], TrTr [156] and TransT [157]. All these work use a Siamese-like tracking pipeline to do video object tracking and utilize the encoder-decoder network to replace explicit cross-correlation operation for global and rich contextual inter-dependencies. Specifically, the transformer encoder and decoder are assigned to the template branch and the searching branch, respectively. In addition, Sun et al. proposed TransTrack [158], which is an online joint-detection-and-tracking pipeline. It utilizes the query-key mechanism to track pre-existing objects and introduces a set of learned object queries into the pipeline to detect new-coming objects. The proposed TransTrack achieves 74.5% and 64.5% MOTA on MOT17 and MOT20 benchmark.
跟踪。一些研究人员还探索了在基于模板的判别跟踪器中使用转换器编码器-解码器架构,例如TMT[155]、TrTr[156]和TransT[157]。所有这些工作都使用类似暹罗的跟踪管道来做视频目标跟踪,并利用编码器-解码器网络来代替全局和丰富的上下文相互依赖的显式互相关操作。具体来说,转换器编码器和解码器分别分配给模板分支和搜索分支。此外,Sun等人提出了TransTrack[158],这是一个在线joint-detection-and-tracking管道。它利用查询键机制来跟踪预先存在的对象,并将一组学习到的对象查询引入管道以检测新来的对象。所提出的TransTrack在MOT17和MOT20基准上实现了74.5%和64.5%的MOTA。

Re-Identification. He et al. [159] proposed TransReID to investigate the application of pure transformers in the field of object re-identification (ReID). While introducing transformer network into object ReID, TransReID slices with overlap to reserve local neighboring structures around the patches and introduces 2D bilinear interpolation to help handle any given input resolution. With the transformer module and the loss function, a strong baseline was proposed to achieve comparable performance with CNN-based frameworks. Moreover, The jigsaw patch module (JPM) was designed to facilitate perturbation-invariant and robust feature representation of objects and the side information embeddings (SIE) was introduced to encode side information. The final framework TransReID achieves state-of-the-art performance on both person and vehicle ReID benchmarks. Both Liu et al. [160] and Zhang et al. [161] provided solutions for introducing transformer network into video-based person Re-ID. And similarly, both of the them utilized separated transformer networks to refine spatial and temporal features, and then utilized a cross view transformer to aggregate multi-view features.
重新识别。他等人[159]提出了TransReID来研究纯变压器在对象重新识别(ReID)领域的应用。在将变压器网络引入对象ReID的同时,TransReID重叠切片以保留补丁周围的局部相邻结构,并引入2D双线性插值以帮助处理任何给定的输入分辨率。通过变压器模块和损失函数,提出了一个强基线,以实现与基于卷积神经网络的框架相当的性能。此外,拼图补丁模块(JPM)旨在促进对象的perturbation-invariant和稳健的特征表示,并引入了侧信息嵌入(SIE)来编码侧信息。最终的框架TransReID在人和车辆ReID基准上都实现了最先进的性能。Liu等人[160]和Zhang等人[161]都提供了将变压器网络引入基于视频的人Re-ID的解决方案。同样,他们都利用分离的变压器网络来细化空间和时间特征,然后利用交叉视图变压器来聚合多视图特征。

Point Cloud Learning. A number of other works exploring transformer architecture for point cloud learning [162, 163, 164] have also emerged recently. For example, Guo et al. [163] proposed a novel framework that replaces the original self-attention module with a more suitable offset-attention module, which includes implicit Laplace operator and normalization refinement. In addition, Zhao et al. [164] designed a novel transformer architecture called Point Transformer. The proposed self-attention layer is invariant to the permutation of the point set, making it suitable for point set processing tasks. Point Transformer shows strong performance for semantic segmentation task from 3D point clouds.
点云学习。最近也出现了许多其他探索用于点云学习的变压器架构的工作[162,163,164]。例如,郭等人[163]提出了一种新颖的框架,用更合适的偏移注意力模块取代了原来的自注意力模块,其中包括隐式拉普拉斯算子和归一化细化。此外,赵等人[164]设计了一种新颖的变压器架构,称为点Transformer。所提出的自注意力层对点集的排列是不变的,使其适用于点集处理任务。点Transformer对于来自3D点云的语义分割任务显示出强大的性能。

3.2.5 Discussions 3.2.5讨论

As discussed in the preceding sections, transformers have shown strong performance on several high-level tasks, including detection, segmentation and pose estimation. The key issues that need to be resolved before transformer can be adopted for high-level tasks relate to input embedding, position encoding, and prediction loss. Some methods propose improving the self-attention module from different perspectives, for example, deformable attention [17], adaptive clustering [123] and point transformer [164]. Nevertheless, exploration into the use of transformers for high-level vision tasks is still in the preliminary stages and so further research may prove beneficial. For example, is it necessary to use feature extraction modules such as CNN and PointNet before transformer for potential better performance? How can vision transformer be fully leveraged using large scale pre-training datasets as BERT and GPT-3 do in the NLP field? And is it possible to pre-train a single transformer model and fine-tune it for different downstream tasks with only a few epochs of fine-tuning? How to design more powerful architecture by incorporating prior knowledge of the specific tasks? Several prior works have performed preliminary discussions for the aforementioned topics and We hope more further research effort is conducted into exploring more powerful transformers for high-level vision.
正如前面几节所讨论的,变压器在几个高级任务上表现出了强大的性能,包括检测,分割和姿态估计。在变压器可以用于高级任务之前需要解决的关键问题涉及输入嵌入,位置编码和预测损失。一些方法从不同的角度提出改进自我注意力模块,例如,可变形注意力[17],自适应聚类[123]和点变压器[164]。尽管如此,对将变压器用于高级视觉任务的探索仍处于初步阶段,因此进一步的研究可能会证明是有益的。例如,是否有必要在变压器之前使用特征提取模块,如卷积神经网络和PointNet,以获得潜在的更好性能?如何使用大规模预训练数据集来充分利用视觉变压器,就像BERT和GPT-3在自然语言处理领域所做的那样?是否有可能预训练单个变压器模型,并仅用几个微调时期就将其精调用于不同的下游任务?如何通过结合特定任务的先验知识来设计更强大的架构?一些先前的工作已经对上述主题进行了初步讨论,我们希望进行更多的进一步研究工作来探索更强大的变压器以实现高级视觉。

3.3 Low-level Vision 3.3低级视觉

Few works apply transformers on low-level vision fields, such as image super-resolution and generation. These tasks often take images as outputs (e.g., high-resolution or denoised images), which is more challenging than high-level vision tasks such as classification, segmentation, and detection, whose outputs are labels or boxes.
很少有作品将变压器应用在低级视觉领域,例如图像超分辨率和生成,这些任务往往将图像作为输出(例如,高分辨率或去噪图像),这比分类、分割和检测等高级视觉任务更具挑战性,这些任务的输出是标签或盒子。

Refer to caption
Figure 9: A generic framework for transformer in image generation.
图9:图像生成中转换器的通用框架。

3.3.1 Image Generation 3.3.1图像生成

An simple yet effective to apply transformer model to the image generation task is to directly change the architectures from CNNs to transformers, as shown in Figure 9 (a). Jiang et al. [39] proposed TransGAN, which build GAN using the transformer architecture. Since the it is difficult to generate high-resolution images pixel-wise, a memory-friendly generator is utilized by gradually increasing the feature map resolution at different stages. Correspondingly, a multi-scale discriminator is designed to handle the varying size of inputs in different stages. Various training recipes are introduced including grid self-attention, data augmentation, relative position encoding and modified normalization to stabilize the training and improve its performance. Experiments on various benchmark datasets demonstrate the effectiveness and potential of the transformer-based GAN model in image generation tasks. Kwonjoon Lee et al. [165] proposed ViTGAN, which introduce several technique to both generator and discriminator to stabilize the training procedure and convergence. Euclidean distance is introduced for the self-attention module to enforce the Lipschitzness of transformer discriminator. Self-modulated layernorm and implicit neural representation are proposed to enhance the training for the generator. As a result, ViTGAN is the first work to demonstrate transformer-based GANs can achieve comparable performance to state-of-the-art CNN-based GANs.
将变压器模型应用于图像生成任务的一个简单但有效的方法是直接将架构从CNN更改为变压器,如图9(a)所示。江等人[39]提出了TransGAN,它使用变压器架构构建GAN。由于很难按像素生成高分辨率图像,因此通过在不同阶段逐渐增加特征图分辨率来利用内存友好的生成器。相应地,设计了一个多尺度鉴别器来处理不同阶段输入的不同大小。引入了各种训练食谱,包括网格自注意、数据增强、相对位置编码和修改的归一化,以稳定训练并提高其性能。在各种基准数据集上的实验证明了基于变压器的GAN模型在图像生成任务中的有效性和潜力。[165]提出了ViTGAN,它在生成器和鉴别器中引入了几种技术来稳定训练过程和收敛。自我注意力模块引入了欧几里德距离来增强变压器鉴别器的Lipschitzness。提出了自调制层范数和隐式神经表示来增强生成器的训练。因此,ViTGAN是第一个证明基于变压器的GAN可以实现与最先进的基于卷积神经网络的GAN相当的性能的工作。

Parmar et al. [27] proposed Image Transformer, taking the first step toward generalizing the transformer model to formulate image translation and generation tasks in an auto-regressive manner. Image Transformer consists of two parts: an encoder for extracting image representation and a decoder to generate pixels. For each pixel with value 025502550-255, a 256×d256𝑑256\times d dimensional embedding is learned for encoding each value into a d𝑑d dimensional vector, which is fed into the encoder as input. The encoder and decoder adopt the same architecture as that in [9]. Each output pixel qsuperscript𝑞q^{\prime} is generated by calculating self-attention between the input pixel q𝑞q and previously generated pixels m1,m2,subscript𝑚1subscript𝑚2m_{1},m_{2},... with position embedding p1,p2,subscript𝑝1subscript𝑝2p_{1},p_{2},.... For image-conditioned generation, such as super-resolution and inpainting, an encoder-decoder architecture is used, where the encoder’s input is the low-resolution or corrupted images. For unconditional and class-conditional generation (i.e., noise to image), only the decoder is used for inputting noise vectors. Because the decoder’s input is the previously generated pixels (involving high computation cost when producing high-resolution images), a local self-attention scheme is proposed. This scheme uses only the closest generated pixels as input for the decoder, enabling Image Transformer to achieve performance on par with CNN-based models for image generation and translation tasks, demonstrating the effectiveness of transformer-based models on low-level vision tasks.
Parmar等人[27]提出了图像Transformer,朝着泛化变压器模型迈出了第一步,以自回归的方式制定图像翻译和生成任务。图像Transformer由两部分组成:用于提取图像表示的编码器和用于生成像素的解码器。对于每个具有值 025502550-255 的像素,学习了 256×d256𝑑256\times d 维度嵌入,用于将每个值编码成 d𝑑d 维度向量,该向量作为输入馈入编码器。编码器和解码器采用与[9]中相同的架构。每个输出像素 qsuperscript𝑞q^{\prime} 是通过计算输入像素 q𝑞q 和先前生成的像素 m1,m2,subscript𝑚1subscript𝑚2m_{1},m_{2},... 之间的自关注来生成的,具有位置嵌入 p1,p2,subscript𝑝1subscript𝑝2p_{1},p_{2},... 。对于图像条件生成,如超分辨率和修复,使用编码器-解码器架构,其中编码器的输入是低分辨率或损坏的图像。对于无条件和类条件生成(即噪声到图像),仅使用解码器输入噪声向量。由于解码器的输入是先前生成的像素(在生成高分辨率图像时涉及高计算成本),因此提出了一种局部自注意方案。该方案仅使用最接近的生成像素作为解码器的输入,使图像Transformer能够在图像生成和翻译任务方面实现与基于卷积神经网络的模型相当的性能,证明了基于变压器的模型在低级视觉任务上的有效性。

Since it is difficult to directly generate high-resolution images by transformer models, Esser et al. [38] proposed Taming Transformer. Taming Transformer consists of two parts: a VQGAN and a transformer. VQGAN is a variant of VQVAE [166], which uses a discriminator and perceptual loss to improve the visual quality. Through VQGAN, the image can be represented by a series of context-rich discrete vectors and therefore these vectors can be easily predicted by a transformer model through an auto-regression way. The transformer model can learn the long-range interactions for generating high-resolution images. As a result, the proposed Taming Transformer achieves state-of-the-art results on a wide variety of image synthesis tasks.
由于变压器模型很难直接生成高分辨率图像,Esser等人[38]提出了TamingTransformer。TamingTransformer由两部分组成:VQGAN和变压器。VQGAN是VQVAE[166]的变体,它使用鉴别器和感知损失来提高视觉质量。通过VQGAN,图像可以由一系列上下文丰富的离散向量表示,因此这些向量可以很容易地由变压器模型通过自回归方式进行预测。变压器模型可以学习生成高分辨率图像的远程交互。因此,所提出的TamingTransformer在各种图像合成任务上实现了最先进的结果。

Besides image generation, DALL\cdot[42] proposed the transformer model for text-to-image generation, which synthesizes images according to the given captions. The whole framework consists of two stages. In the first stage, a discrete VAE is utilized to learn the visual codebook. In the second stage, the text is decoded by BPE-encode and the corresponding image is decoded by dVAE learned in the first stage. Then an auto-regression transformer is used to learn the prior between the encoded text and image. During the inference procedure, tokens of images are predicted by the transformer and decoded by the learned decoder. The CLIP model [41] is introduced to rank generated samples. Experiments on text-to-image generation task demonstrate the powerful ability of the proposed model. Note that our survey mainly focus on pure vision tasks, we do not include the framework of DALL\cdotE in Figure 9. The image generation has been pushed to a higher level with the introduction of diffusion model [167], such as DALLE2 [168] and Stable Diffusion [169].
除了图像生成,DALL \cdot E[42]提出了用于文本到图像生成的转换器模型,该模型根据给定的字幕合成图像。整个框架由两个阶段组成。在第一阶段,利用离散的VAE学习视觉密码本。在第二阶段,文本被BPE-encode解码,相应的图像被第一阶段学习的dVAE解码。然后使用自回归转换器学习编码文本和图像之间的先验。在推理过程中,图像的标记由转换器预测,并由学习的解码器解码。引入CLIP模型[41]对生成的样本进行排名。文本到图像生成任务的实验证明了所提出模型的强大能力。请注意,我们的调查主要关注纯视觉任务,我们不包括图9中DALL \cdot E的框架。随着扩散模型[167]的引入,图像生成被推到了更高的水平,例如DALLE2[168]和稳定扩散[169]。

3.3.2 Image Processing 3.3.2图像处理

A number of recent works eschew using each pixel as the input for transformer models and instead use patches (set of pixels) as input. For example, Yang et al. [40] proposed Texture Transformer Network for Image Super-Resolution (TTSR), using the transformer architecture in the reference-based image super-resolution problem. It aims to transfer relevant textures from reference images to low-resolution images. Taking a low-resolution image and reference image as the query 𝐐𝐐\mathbf{Q} and key 𝐊𝐊\mathbf{K}, respectively, the relevance ri,jsubscript𝑟𝑖𝑗r_{i,j} is calculated between each patch 𝐪isubscript𝐪𝑖\mathbf{q}_{i} in 𝐐𝐐\mathbf{Q} and 𝐤isubscript𝐤𝑖\mathbf{k}_{i} in 𝐊𝐊\mathbf{K} as:
最近的一些作品避免使用每个像素作为变压器模型的输入,而是使用补丁(一组像素)作为输入。例如,Yang等人[40]提出了用于图像超分辨率的TextureTransformer网络(TTSR),在基于参考的图像超分辨率问题中使用了变压器架构。它旨在将相关纹理从参考图像转移到低分辨率图像。将低分辨率图像和参考图像分别作为查询 𝐐𝐐\mathbf{Q} 和键 𝐊𝐊\mathbf{K} ,分别计算 𝐐𝐐\mathbf{Q} 中的每个补丁 𝐪isubscript𝐪𝑖\mathbf{q}_{i}𝐊𝐊\mathbf{K} 中的 𝐤isubscript𝐤𝑖\mathbf{k}_{i} 之间的相关性 ri,jsubscript𝑟𝑖𝑗r_{i,j} 为:

ri,j=𝐪i𝐪i,𝐤i𝐤i.subscript𝑟𝑖𝑗subscript𝐪𝑖normsubscript𝐪𝑖subscript𝐤𝑖normsubscript𝐤𝑖r_{i,j}=\left<\frac{\mathbf{q}_{i}}{\|\mathbf{q}_{i}\|},\frac{\mathbf{k}_{i}}{\|\mathbf{k}_{i}\|}\right>. (12)

A hard-attention module is proposed to select high-resolution features 𝐕𝐕\mathbf{V} according to the reference image, so that the low-resolution image can be matched by using the relevance. The hard-attention map is calculated as:
提出硬注意力模块,根据参考图像选择高分辨率特征 𝐕𝐕\mathbf{V} ,从而利用相关性匹配低分辨率图像,硬注意力图计算为:

hi=argmaxjri,jsubscript𝑖subscript𝑗subscript𝑟𝑖𝑗h_{i}=\arg\max_{j}r_{i,j} (13)

The most relevant reference patch is 𝐭i=𝐯hisubscript𝐭𝑖subscript𝐯subscript𝑖\mathbf{t}_{i}=\mathbf{v}_{h_{i}}, where 𝐭isubscript𝐭𝑖\mathbf{t}_{i} in 𝐓𝐓\mathbf{T} is the transferred features. A soft-attention module is then used to transfer 𝐕𝐕\mathbf{V} to the low-resolution feature. The transferred features from the high-resolution texture image and the low-resolution feature are used to generate the output features of the low-resolution image. By leveraging the transformer-based architecture, TTSR can successfully transfer texture information from high-resolution reference images to low-resolution images in super-resolution tasks.
最相关的参考补丁是 𝐭i=𝐯hisubscript𝐭𝑖subscript𝐯subscript𝑖\mathbf{t}_{i}=\mathbf{v}_{h_{i}} ,其中 𝐓𝐓\mathbf{T} 中的 𝐭isubscript𝐭𝑖\mathbf{t}_{i} 是转移特征。然后使用软注意力模块将 𝐕𝐕\mathbf{V} 转移到低分辨率特征。来自高分辨率纹理图像的转移特征和低分辨率特征用于生成低分辨率图像的输出特征。通过利用基于变压器的架构,TTSR可以在超分辨率任务中成功地将纹理信息从高分辨率参考图像转移到低分辨率图像。

Refer to caption
Figure 10: Diagram of IPT architecture (image from [19]).
图10:IPT架构图(图片来自[19])。

Different from the preceding methods that use transformer models on single tasks, Chen et al. [19] proposed Image Processing Transformer (IPT), which fully utilizes the advantages of transformers by using large pre-training datasets. It achieves state-of-the-art performance in several image processing tasks, including super-resolution, denoising, and deraining. As shown in Figure 10, IPT consists of multiple heads, an encoder, a decoder, and multiple tails. The multi-head, multi-tail structure and task embeddings are introduced for different image processing tasks. The features are divided into patches, which are fed into the encoder-decoder architecture. Following this, the outputs are reshaped to features with the same size. Given the advantages of pre-training transformer models on large datasets, IPT uses the ImageNet dataset for pre-training. Specifically, images from this dataset are degraded by manually adding noise, rain streaks, or downsampling in order to generate corrupted images. The degraded images are used as inputs for IPT, while the original images are used as the optimization goal of the outputs. A self-supervised method is also introduced to enhance the generalization ability of the IPT model. Once the model is trained, it is fine-tuned on each task by using the corresponding head, tail, and task embedding. IPT largely achieves performance improvements on image processing tasks (e.g., 2 dB in image denoising tasks), demonstrating the huge potential of applying transformer-based models to the field of low-level vision.
与前面在单个任务上使用变压器模型的方法不同,Chen等人[19]提出了图像处理Transformer(IPT),它通过使用大型预训练数据集充分利用了变压器的优势。它在多个图像处理任务中实现了最先进的性能,包括超分辨率、去噪和去噪。如图10所示,IPT由多个头部、一个编码器、一个解码器和多个尾部组成。针对不同的图像处理任务引入了多头、多尾部结构和任务嵌入。特征被划分为补丁,补丁被馈送到编码器-解码器架构中。在此之后,输出被重塑为具有相同大小的特征。鉴于预训练变压器模型在大型数据集上的优势,IPT使用ImageNet数据集进行预训练。具体来说,通过手动添加噪声、雨条纹或下采样来对来自该数据集的图像进行降级,以便生成损坏的图像。降级后的图像被用作IPT的输入,而原始图像被用作输出的优化目标。还引入了一种自监督方法来增强IPT模型的泛化能力。一旦模型被训练,它就通过使用相应的头部、尾部和任务嵌入在每个任务上进行微调。IPT在很大程度上实现了图像处理任务的性能改进(例如,图像去噪任务中的2 dB),展示了将基于变压器的模型应用于低级视觉领域的巨大潜力。

Besides single image generation, Wang et al. [170] proposed SceneFormer to utilize transformer in 3D indoor scene generation. By treating a scene as a sequence of objects, the transformer decoder can be used to predict series of objects and their location, category, and size. This has enabled SceneFormer to outperform conventional CNN-based methods in user studies.
除了单个图像生成之外,Wang等人[170]提出了SceneForm在3D室内场景生成中利用变压器。通过将场景视为对象序列,转换器解码器可用于预测一系列对象及其位置、类别和大小。这使得SceneForm在用户研究中优于传统的基于卷积神经网络的方法。

Refer to caption
Figure 11: A generic framework for transformer in image processing.
图11:图像处理中转换器的通用框架。

It should be noted that iGPT [14] is pre-trained on an inpainting-like task. Since iGPT mainly focus on the fine-tuning performance on image classification tasks, we treat this work more like an attempt on image classification task using transformer than low-level vision tasks.
需要注意的是,iGPT[14]是在类似修复的任务上预先训练的。由于iGPT主要关注图像分类任务上的微调性能,因此我们将这项工作更像是使用变压器进行图像分类任务的尝试,而不是低级视觉任务。

In conclusion, different to classification and detection tasks, the outputs of image generation and processing are images. Figure 11 illustrates using transformers in low-level vision. In image processing tasks, the images are first encoded into a sequence of tokens or patches and the transformer encoder uses the sequence as input, allowing the transformer decoder to successfully produce desired images. In image generation tasks, the GAN-based models directly learn a decoder to generated patches to outputting images through linear projection, while the transformer-based models train a auto-encoder to learn a codebook for images and use an auto-regression transformer model to predict the encoded tokens. A meaningful direction for future research would be designing a suitable architecture for different image processing tasks.
总之,与分类和检测任务不同,图像生成和处理的输出是图像。图11说明了在低级视觉中使用变压器。在图像处理任务中,首先将图像编码为标记或补丁序列,变压器编码器使用该序列作为输入,允许变压器解码器成功生成所需的图像。在图像生成任务中,基于GAN的模型直接学习解码器生成补丁以通过线性投影输出图像,而基于变压器的模型训练自编码器学习图像密码本,并使用自回归变压器模型预测编码的标记。未来研究的一个有意义的方向是为不同的图像处理任务设计合适的架构。

3.4 Video Processing 3.4视频处理

Transformer performs surprisingly well on sequence-based tasks and especially on NLP tasks. In computer vision (specifically, video tasks), spatial and temporal dimension information is favored, giving rise to the application of transformer in a number of video tasks, such as frame synthesis [171], action recognition [172], and video retrieval [173].
Transformer在基于序列的任务上表现得非常好,尤其是在自然语言处理任务上。在计算机视觉(特别是视频任务)中,空间和时间维度信息受到青睐,从而导致转换器在许多视频任务中的应用,例如帧合成[171]、动作识别[172]和视频检索[173]。

3.4.1 High-level Video Processing
3.4.1高级视频处理

Video Action Recognition. Video human action tasks, as the name suggests, involves identifying and localizing human actions in videos. Context (such as other people and objects) plays a critical role in recognizing human actions. Rohit et al. proposed the action transformer [172] to model the underlying relationship between the human of interest and the surrounding context. Specifically, the I3D [174] is used as the backbone to extract high-level feature maps. The features extracted (using RoI pooling) from intermediate feature maps are viewed as the query (Q), while the key (K) and values (V) are calculated from the intermediate features. A self-attention mechanism is applied to the three components, and it outputs the classification and regressions predictions. Lohit et al. [175] proposed an interpretable differentiable module, named temporal transformer network, to reduce the intra-class variance and increase the inter-class variance. In addition, Fayyaz and Gall proposed a temporal transformer [176] to perform action recognition tasks under weakly supervised settings. In addition to human action recognition, transformer has been utilized for group activity recognition [177]. Gavrilyuk et al. proposed an actor-transformer [178] architecture to learn the representation, using the static and dynamic representations generated by the 2D and 3D networks as input. The output of the transformer is the predicted activity.
视频动作识别。视频人类动作任务,顾名思义,涉及识别和本地化视频中的人类动作。上下文(如其他人和对象)在识别人类动作中起着至关重要的作用。罗希特等人提出了动作转换器[172]来建模感兴趣的人与周围上下文之间的潜在关系。具体来说,I3D[174]被用作提取高级特征图的骨干。从中间特征图中提取的特征(使用RoI池化)被视为查询(Q),而键(K)和值(V)是从中间特征中计算的。将自我注意力机制应用于三个组件,并输出分类和回归预测。Loit等人[175]提出了一种可解释的可微模块,命名为时态变换器网络,以减少类内方差,增加类间方差。此外,Fayyaz和Gall提出了一种时态变换器[176],用于在弱监督设置下执行动作识别任务。除了人类动作识别之外,变换器已被用于群体活动识别[177]。Gavrilyuk等人提出了一种Actor-变换器[178]架构来学习表示,使用2D和3D网络生成的静态和动态表示作为输入。变换器的输出是预测的活动。

Video Retrieval. The key to content-based video retrieval is to find the similarity between videos. Leveraging only the image-level of video-level features to overcome the associated challenges, Shao et al. [179] suggested using the transformer to model the long-range semantic dependency. They also introduced the supervised contrastive learning strategy to perform hard negative mining. The results of using this approach on benchmark datasets demonstrate its performance and speed advantages. In addition, Gabeur et al. [180] presented a multi-modal transformer to learn different cross-modal cues in order to represent videos.
视频检索。基于内容的视频检索的关键是找到视频之间的相似性。Shao等人[179]建议仅利用视频级特征的图像级来克服相关挑战,他们还引入了监督对比学习策略来执行硬负挖掘。在基准数据集上使用这种方法的结果证明了其性能和速度优势。此外,Gabur等人[180]提出了一种多模态转换器来学习不同的跨模态线索,以便表示视频。

Video Object Detection. To detect objects in a video, both global and local information is required. Chen et al. introduced the memory enhanced global-local aggregation (MEGA) [181] to capture more content. The representative features enhance the overall performance and address the ineffective and insufficient problems. Furthermore, Yin et al. [182] proposed a spatiotemporal transformer to aggregate spatial and temporal information. Together with another spatial feature encoding component, these two components perform well on 3D video object detection tasks.
视频目标检测。要检测视频中的对象,需要全局和局部信息。Chen等人引入了内存增强全局-局部聚合(MEGA)[181]以捕获更多内容。代表性特征增强了整体性能,并解决了无效和不充分的问题。此外,In等人[182]提出了一种时空转换器来聚合空间和时间信息。与另一个空间特征编码组件一起,这两个组件在3D视频目标检测任务上表现良好。

Multi-task Learning. Untrimmed video usually contains many frames that are irrelevant to the target tasks. It is therefore crucial to mine the relevant information and discard the redundant information. To extract such information, Seong et al. proposed the video multi-task transformer network [183], which handles multi-task learning on untrimmed videos. For the CoVieW dataset, the tasks are scene recognition, action recognition and importance score prediction. Two pre-trained networks on ImageNet and Places365 extract the scene features and object features. The multi-task transformers are stacked to implement feature fusion, leveraging the class conversion matrix (CCM).
多任务学习。未修剪的视频通常包含许多与目标任务无关的帧。因此,挖掘相关信息并丢弃冗余信息至关重要。为了提取这些信息,Seong等人提出了视频多任务转换器网络[183],该网络处理未修剪视频上的多任务学习。对于CoVieW数据集,任务是场景识别、动作识别和重要性分数预测。ImageNet和Places365上的两个预训练网络提取场景特征和对象特征。多任务转换器被堆叠以实现特征融合,利用类转换矩阵(CCM)。

3.4.2 Low-level Video Processing
3.4.2低级视频处理

Frame/Video Synthesis. Frame synthesis tasks involve synthesizing the frames between two consecutive frames or after a frame sequence while video synthesis tasks involve synthesizing a video. Liu et al. proposed the ConvTransformer [171], which is comprised of five components: feature embedding, position encoding, encoder, query decoder, and the synthesis feed-forward network. Compared with LSTM based works, the ConvTransformer achieves superior results with a more parallelizable architecture. Another transformer-based approach was proposed by Schatz et al. [184], which uses a recurrent transformer network to synthetize human actions from novel views.
帧/视频合成。帧合成任务涉及在两个连续帧之间或在帧序列之后合成帧,而视频合成任务涉及合成视频。Liu等人提出了ConvTransform[171],它由五个组件组成:特征嵌入、位置编码、编码器、查询解码器和合成前馈网络。与基于LSTM的作品相比,ConvTransform以更可并行化的架构实现了卓越的结果。Schatz等人[184]提出了另一种基于变压器的方法,它使用循环变压器网络从新视图中合成人类动作。

Video Inpainting. Video inpainting tasks involve completing any missing regions within a frame. This is challenging, as it requires information along the spatial and temporal dimensions to be merged. Zeng et al. proposed a spatial-temporal transformer network [28], which uses all the input frames as input and fills them in parallel. The spatial-temporal adversarial loss is used to optimize the transformer network.
视频修复。视频修复任务涉及完成一帧内的任何缺失区域。这是具有挑战性的,因为它需要合并沿空间和时间维度的信息。曾等人提出了一种时空变压器网络[28],它使用所有输入帧作为输入并并行填充它们。时空对抗损失用于优化变压器网络。

3.4.3 Discussions 3.4.3讨论

Compared to image, video has an extra dimension to encode the temporal information. Exploiting both spatial and temporal information helps to have a better understanding of a video. Thanks to the relationship modeling capability of transformer, video processing tasks have been improved by mining spatial and temporal information simultaneously. Nevertheless, due to the high complexity and much redundancy of video data, how to efficiently and accurately modeling both spatial and temporal relationships is still an open problem.
与图像相比,视频具有额外的维度来编码时间信息。利用空间和时间信息有助于更好地理解视频。由于变压器的关系建模能力,通过同时挖掘空间和时间信息,视频处理任务得到了改进。然而,由于视频数据的高复杂性和大量冗余,如何高效准确地建模空间和时间关系仍然是一个悬而未决的问题。

3.5 Multi-Modal Tasks 3.5多模态任务

Owing to the success of transformer across text-based NLP tasks, many researches are keen to exploit its potential for processing multi-modal tasks (e.g., video-text, image-text and audio-text). One example of this is VideoBERT [185], which uses a CNN-based module to pre-process videos in order to obtain representation tokens. A transformer encoder is then trained on these tokens to learn the video-text representations for downstream tasks, such as video caption. Some other examples include VisualBERT [186] and VL-BERT [187], which adopt a single-stream unified transformer to capture visual elements and image-text relationship for downstream tasks such as visual question answering (VQA) and visual commonsense reasoning (VCR). In addition, several studies such as SpeechBERT [188] explore the possibility of encoding audio and text pairs with a transformer encoder to process auto-text tasks such as speech question answering (SQA).
由于转换器在基于文本的自然语言处理任务中的成功,许多研究都热衷于利用其处理多模态任务(例如,视频-文本、图像-文本和音频-文本)的潜力。这方面的一个例子是VideoBERT[185],它使用基于卷积神经网络的模块对视频进行预处理,以获得表示标记。然后在这些标记上训练转换器编码器,以学习下游任务的视频-文本表示,例如视频字幕。其他一些例子包括VisualBERT[186]和VL-BERT[187],它们采用单流统一转换器来捕获视觉元素和图像-文本关系,用于下游任务,例如视觉问答系统(VQA)和视觉常识推理(VCR)。此外,一些研究,如SpeechBERT[188]探索了用转换器编码器编码音频和文本对以处理自动文本任务的可能性,如语音问答系统(SQA)。

Refer to caption
Figure 12: The framework of the CLIP (image from [41]).
图12:CLIP的框架(图片来自[41])。

Apart from the aforementioned pioneering multi-modal transformers, Contrastive Language-Image Pre-training (CLIP) [41] takes natural language as supervision to learn more efficient image representation. CLIP jointly trains a text encoder and an image encoder to predict the corresponding training text-image pairs. The text encoder of CLIP is a standard transformer with masked self-attention used to preserve the initialization ability of the pre-trained language models. For the image encoder, CLIP considers two types of architecture, ResNet and Vision Transformer. CLIP is trained on a new dataset containing 400 million (image, text) pairs collected from the Internet. More specifically, given a batch of N𝑁N (image, text) pairs, CLIP learns both text and image embeddings jointly to maximize the cosine similarity of those N𝑁N matched embeddings while minimize N2Nsuperscript𝑁2𝑁N^{2}-N incorrectly matched embeddings. On Zero-Shot transfer, CLIP demonstrates astonishing zero-shot classification performances, achieving 76.2%percent76.276.2\% top-1 accuracy on ImageNet-1K dataset without using any ImageNet training labels. Concretely, at inference, the text encoder of CLIP first computes the feature embeddings of all ImageNet Labels and the image encoder then computes the embeddings of all images. By calculating the cosine similarity of text and image embeddings, the text-image pair with the highest score should be the image and its corresponding label. Further experiments on 30 various CV benchmarks show the zero-shot transfer ability of CLIP and the feature diversity learned by CLIP.
除了前面提到的开创性的多模态转换器之外,对比语言-图像预训练(CLIP)[41]以自然语言为监督,学习更高效的图像表示。CLIP联合训练一个文本编码器和一个图像编码器,以预测相应的训练文本-图像对。CLIP的文本编码器是一个标准的转换器,带有屏蔽自我注意,用于保留预训练语言模型的初始化能力。对于图像编码器,CLIP考虑了两种类型的架构,ResNet和VisionTransformer。CLIP在包含从Internet收集的4亿(图像,文本)对的新数据集上进行训练。更具体地说,给定一批 N𝑁N (图像,文本)对,CLIP联合学习文本和图像嵌入,以最大化那些 N𝑁N 匹配嵌入的余弦相似度,同时最小化 N2Nsuperscript𝑁2𝑁N^{2}-N 不正确匹配的嵌入。在Zero-Shot传输上,CLIP展示了惊人的零镜头分类性能,在ImageNet-1K数据集上实现了 76.2%percent76.276.2\% top-1精度,而无需使用任何ImageNet训练标签。具体来说,在推理时,CLIP的文本编码器首先计算所有ImageNet标签的特征嵌入,图像编码器然后计算所有图像的嵌入。通过计算文本和图像嵌入的余弦相似度,得分最高的文本-图像对应该是图像及其对应的标签。在30个各种CV基准上的进一步实验显示了CLIP的零镜头传输能力和CLIP学习的特征多样性。

While CLIP maps images according to the description in text, another work DALL-E [42] synthesizes new images of categories described in an input text. Similar to GPT-3, DALL-E is a multi-modal transformer with 12 billion model parameters autoregressively trained on a dataset of 3.3 million text-image pairs. More specifically, to train DALL-E, a two-stage training procedure is used, where in stage 1, a discrete variational autoencoder is used to compress 256×\times 256 RGB images into 32×\times32 image tokens and then in stage 2, an autoregressive transformer is trained to model the joint distribution over the image and text tokens. Experimental results show that DALL-E can generate images of various styles from scratch, including photorealistic imagery, cartoons and emoji or extend an existing image while still matching the description in the text. Subsequently, Ding et al. proposes CogView [43], which is a transformer with VQ-VAE tokenizer similar to DALL-E, but supports Chinese text input. They claim CogView outperforms DALL-E and previous GAN-bsed methods and also unlike DALL-E, CogView does not need an additional CLIP model to rerank the samples drawn from transformer, i.e. DALL-E.
当CLIP根据文本中的描述映射图像时,另一项工作DALL-E[42]合成了输入文本中描述的类别的新图像。与GPT-3类似,DALL-E是一种多模态转换器,其120亿模型参数在330万文本-图像对的数据集上进行自回归训练。更具体地说,为了训练DALL-E,使用了两阶段训练过程,其中在第1阶段,使用离散变分自编码器将256 ×\times 256 RGB图像压缩成32个 ×\times 32个图像标记,然后在第2阶段,训练自回归转换器以对图像和文本标记上的联合分布进行建模。实验结果表明,DALL-E可以从头开始生成各种风格的图像,包括逼真的图像、卡通和表情符号或扩展现有图像,同时仍然匹配文本中的描述。随后,丁等人提出了CogView[43],这是一个带有类似于DALL-E的VQ-VAE标记器的转换器,但支持中文文本输入。他们声称CogView优于DALL-E和以前的GAN方法,并且与DALL-E不同,CogView不需要额外的CLIP模型来重新排序从转换器中提取的样本,即DALL-E。

Recently, a Unified Transformer (UniT) [189] model is proposed to cope with multi-modal multi-task learning, which can simultaneously handle multiple tasks across different domains, including object detection, natural language understanding and vision-language reasoning. Specifically, UniT has two transformer encoders to handle image and text inputs, respectively, and then the transformer decoder takes the single or concatenated encoder outputs according to the task modality. Finally, a task-specific prediction head is applied to the decoder outputs for different tasks. In the training stage, all tasks are jointly trained by randomly selecting a specific task within an iteration. The experiments show UniT achieves satisfactory performance on every task with a compact set of model parameters.
最近,提出了一种统一Transformer(UniT)[189]模型来应对多模态多任务学习,该模型可以同时处理跨不同域的多个任务,包括目标检测、自然语言理解和视觉-语言推理。具体来说,UniT有两个变压器编码器来分别处理图像和文本输入,然后变压器解码器根据任务模态获取单个或串联的编码器输出。最后,针对不同任务将特定任务的预测头应用于解码器输出。在训练阶段,通过在迭代内随机选择特定任务来联合训练所有任务。实验表明,UniT通过一组紧凑的模型参数在每个任务上实现了令人满意的性能。

In conclusion, current transformer-based mutil-modal models demonstrates its architectural superiority for unifying data and tasks of various modalities, which demonstrates the potential of transformer to build a general-purpose intelligence agents able to cope with vast amount of applications. Future researches can be conducted in exploring the effective training or the extendability of multi-modal transformers (e.g., GPT-4 [44]).
总之,基于电流互感器的多模态模型证明了其在统一各种模态的数据和任务方面的架构优势,这证明了变压器构建能够应对大量应用的通用智能代理的潜力。未来的研究可以在探索多模态变压器的有效训练或可扩展性方面进行(例如,GPT-4[44])。

3.6 Efficient Transformer
3.6高效Transformer

Although transformer models have achieved success in various tasks, their high requirements for memory and computing resources block their implementation on resource-limited devices such as mobile phones. In this section, we review the researches carried out into compressing and accelerating transformer models for efficient implementation. This includes including network pruning, low-rank decomposition, knowledge distillation, network quantization, and compact architecture design. Table IV lists some representative works for compressing transformer-based models.
尽管变压器模型在各种任务中取得了成功,但它们对内存和计算资源的高要求阻碍了它们在手机等资源有限的设备上的实现。在本节中,我们回顾了为高效实现而压缩和加速变压器模型所进行的研究。这包括网络剪枝、低秩分解、知识蒸馏、网络量化和紧凑架构设计。表IV列出了一些压缩基于变压器的模型的代表性作品。

3.6.1 Pruning and Decomposition
3.6.1剪枝和分解

In transformer based pre-trained models (e.g., BERT), multiple attention operations are performed in parallel to independently model the relationship between different tokens [9, 10]. However, specific tasks do not require all heads to be used. For example, Michel et al. [45] presented empirical evidence that a large percentage of attention heads can be removed at test time without impacting performance significantly. The number of heads required varies across different layers — some layers may even require only one head. Considering the redundancy on attention heads, importance scores are defined to estimate the influence of each head on the final output in [45], and unimportant heads can be removed for efficient deployment. Dalvi et al. [190] analyzed the redundancy in pre-trained transformer models from two perspectives: general redundancy and task-specific redundancy. Following the lottery ticket hypothesis [191], Prasanna et al. [190] analyzed the lotteries in BERT and showed that good sub-networks also exist in transformer-based models, reducing both the FFN layers and attention heads in order to achieve high compression rates. For the vision transformer [15] which splits an image to multiple patches, Tang et al. [192] proposed to reduce patch calculation to accelerate the inference, and the redundant patches can be automatically discovered by considering their contributions to the effective output features. Zhu et al. [193] extended the network slimming approach [194] to vision transformers for reducing the dimensions of linear projections in both FFN and attention modules.
在基于变压器的预训练模型(例如,BERT)中,并行执行多个注意力操作,以独立地对不同令牌之间的关系进行建模[9,10]。然而,特定任务并不需要使用所有头部。例如,Michel等人[45]提出了经验证据,表明可以在测试时移除很大比例的注意力头部,而不会显着影响性能。不同层所需的头部数量各不相同——有些层甚至可能只需要一个头部。考虑到注意力头部上的冗余,在[45]中定义了重要性分数以估计每个头部对最终输出的影响,并且可以移除不重要的头部以进行高效部署。Dalvi等人[190]从两个角度分析了预训练变压器模型中的冗余:一般冗余和特定任务的冗余。遵循彩票假设[191],Prasanna等人[190]分析了BERT中的彩票,并表明基于变压器的模型中也存在良好的子网络,减少了FFN层和注意力头,以实现高压缩率。对于将图像拆分为多个补丁的视觉变压器[15],唐等人[192]提出减少补丁计算以加速推理,并且可以通过考虑它们对有效输出特征的贡献来自动发现冗余补丁。朱等人[193]将网络瘦身方法[194]扩展到视觉变压器,用于减少FFN和注意力模块中线性投影的维度。

[b]   Models  模型 Compress Type 压缩类型 ##\#Layer  ##\# Params 参数 Speed Up 加速 BERT[10]BASEsubscript[10]𝐵𝐴𝑆𝐸{}_{BASE}~{}\cite[cite]{[\@@bibref{}{bert}{}{}]} BERT [10]BASEsubscript[10]𝐵𝐴𝑆𝐸{}_{BASE}~{}\cite[cite]{[\@@bibref{}{bert}{}{}]} Baseline 基准 12 110M ×\times1 ALBERT [195] 阿尔伯特[195] Decomposition 分解 12 12M ×\times5.6 BERT- 伯特- Architecture 架构 6 66M ×\times1.94 of-Theseus [196] 忒修斯[196] design 设计 Q-BERT [197] Q-BERT[197] Quantization 量化 12 - - Q8BERT [198] Q8BERT[198] 12 TinyBERT [46] TinyBERT[46] Distillation 蒸馏 4 14.5M ×\times9.4 DistilBERT [199] 酒厂[199] 6 6.6m ×\times1.63 BERT-PKD [200] BERT-PKD[200] 3similar-to\sim6 45.7similar-to\sim67M 45.7 similar-to\sim 67M ×\times3.73similar-to\sim1.64 MobileBERT [201] MobileBERT[201] 24 25.3M ×\times4.0 PD [202] PD[202] 6 67.5M ×\times2.0

TABLE IV: List of representative compressed transformer-based models. The data of the Table is from [203].
表IV:基于变压器的代表性压缩模型列表。该表的数据来自[203]。

In addition to the width of transformer models, the depth (i.e., the number of layers) can also be reduced to accelerate the inference process [204, 205]. Differing from the concept that different attention heads in transformer models can be computed in parallel, different layers have to be calculated sequentially because the input of the next layer depends on the output of previous layers. Fan et al. [204] proposed a layer-wisely dropping strategy to regularize the training of models, and then the whole layers are removed together at the test phase.
除了变压器模型的宽度,深度(即层数)也可以减少以加速推理过程[204,205]。与变压器模型中不同的注意力头可以并行计算的概念不同,不同的层必须按顺序计算,因为下一层的输入取决于前一层的输出。范等人[204]提出了一种层明智地丢弃策略来正则化模型的训练,然后在测试阶段将整个层一起移除。

Beyond the pruning methods that directly discard modules in transformer models, matrix decomposition aims to approximate the large matrices with multiple small matrices based on the low-rank assumption. For example, Wang et al. [206] decomposed the standard matrix multiplication in transformer models, improving the inference efficiency.
除了变压器模型中直接丢弃模块的剪枝方法之外,矩阵分解旨在基于低秩假设用多个小矩阵逼近大矩阵。例如,Wang等人[206]分解了变压器模型中的标准矩阵乘法,提高了推理效率。

3.6.2 Knowledge Distillation
3.6.2知识蒸馏

Knowledge distillation aims to train student networks by transferring knowledge from large teacher networks [207, 208, 209]. Compared with teacher networks, student networks usually have thinner and shallower architectures, which are easier to be deployed on resource-limited resources. Both the output and intermediate features of neural networks can also be used to transfer effective information from teachers to students. Focused on transformer models, Mukherjee et al. [210] used the pre-trained BERT [10] as a teacher to guide the training of small models, leveraging large amounts of unlabeled data. Wang et al. [211] train the student networks to mimic the output of self-attention layers in the pre-trained teacher models. The dot-product between values is introduced as a new form of knowledge for guiding students. A teacher’s assistant [212] is also introduced in [211], reducing the gap between large pre-trained transformer models and compact student networks, thereby facilitating the mimicking process. Due to the various types of layers in the transformer model (i.e., self-attention layer, embedding layer, and prediction layers), Jiao et al. [46] design different objective functions to transfer knowledge from teachers to students. For example, the outputs of student models’ embedding layers imitate those of teachers via MSE losses. For the vision transformer, Jia et al. [213] proposed a fine-grained manifold distillation method, which excavates effective knowledge through the relationship between images and the divided patches.
知识蒸馏旨在通过从大型教师网络中转移知识来训练学生网络[207,208,209]。与教师网络相比,学生网络通常具有更薄、更浅的架构,更容易部署在资源有限的资源上。神经网络的输出和中间特征也可以用于将有效信息从教师传递给学生。专注于变压器模型,Mukherjee等人[210]使用预训练的BERT[10]作为教师来指导小模型的训练,利用大量未标记的数据。Wang等人[211]训练学生网络来模仿预训练教师模型中自注意力层的输出。值之间的点积被引入作为指导学生的新知识形式。[211]中还引入了教师助手[212],减少了大型预训练变压器模型和紧凑学生网络之间的差距,从而促进了模仿过程。由于变压器模型中存在各种类型的层(即自关注层、嵌入层和预测层),Jiao等人[46]设计了不同的目标函数,将知识从教师传递给学生。例如,学生模型的嵌入层的输出通过MSE损失模仿教师的输出。对于视觉变压器,贾等人[213]提出了一种细粒度流形蒸馏方法,该方法通过图像和划分的补丁之间的关系挖掘有效知识。

3.6.3 Quantization 3.6.3量化

Quantization aims to reduce the number of bits needed to represent network weight or intermediate features [214, 215]. Quantization methods for general neural networks have been discussed at length and achieve performance on par with the original networks [216, 217, 218]. Recently, there has been growing interest in how to specially quantize transformer models [219, 220]. For example, Shridhar et al. [221] suggested embedding the input into binary high-dimensional vectors, and then using the binary input representation to train the binary neural networks. Cheong et al. [222] represented the weights in the transformer models by low-bit (e.g., 4-bit) representation. Zhao et al. [223] empirically investigated various quantization methods and showed that k-means quantization has a huge development potential. Aimed at machine translation tasks, Prato et al. [47] proposed a fully quantized transformer, which, as the paper claims, is the first 8-bit model not to suffer any loss in translation quality. Beside, Liu et al. [224] explored a post-training quantization scheme to reduce the memory storage and computational costs of vision transformers.
量化旨在减少表示网络权重或中间特征所需的位数[214,215]。通用神经网络的量化方法已经被详细讨论,并实现与原始网络相当的性能[216,217,218]。最近,人们对如何专门量化变压器模型[219,220]越来越感兴趣。例如,Shridhar等人[221]建议将输入嵌入到二进制高维向量中,然后使用二进制输入表示来训练二进制神经网络。Cheong等人[222]通过低比特(例如,4比特)表示来表示变压器模型中的权重。赵等人[223]实证研究了各种量化方法,并表明k-means量化具有巨大的发展潜力。针对机器翻译任务,Prato等人[47]提出了一种完全量化的转换器,正如论文所声称的那样,它是第一个在翻译质量上不遭受任何损失的8位模型。旁边,Liu等人[224]探索了一种训练后量化方案,以减少视觉转换器的内存存储和计算成本。

3.6.4 Compact Architecture Design
3.6.4紧凑架构设计

Beyond compressing pre-defined transformer models into smaller ones, some works attempt to design compact models directly [225, 48]. Jiang et al. [48] simplified the calculation of self-attention by proposing a new module — called span-based dynamic convolution — that combine the fully-connected layers and the convolutional layers. Interesting “hamburger” layers are proposed in [226], using matrix decomposition to substitute the original self-attention layers.Compared with standard self-attention operations, matrix decomposition can be calculated more efficiently while clearly reflecting the dependence between different tokens. The design of efficient transformer architectures can also be automated with neural architecture search (NAS) [227, 228], which automatically searches how to combine different components. For example, Su et al. [83] searched patch size and dimensions of linear projections and head number of attention modules to get an efficient vision transformer. Li et al. [229] explored a self-supervised search strategy to get a hybrid architecture composing of both convolutional modules and self-attention modules.
除了将预定义的变压器模型压缩成更小的模型之外,一些作品还尝试直接设计紧凑模型[225,48]。江等人[48]通过提出一个新的模块——称为基于跨度的动态卷积——来简化自注意力的计算,该模块将全连接层和卷积层结合在一起。[226]中提出了有趣的“汉堡”层,使用矩阵分解来替代原始的自注意力层。与标准的自注意力操作相比,矩阵分解可以更高效地计算,同时清楚地反映不同标记之间的依赖关系。高效变压器架构的设计也可以通过网络结构搜索(NAS)[227,228]实现自动化,该算法自动搜索如何组合不同的组件。例如,苏等人[83]搜索线性投影的贴片大小和尺寸以及注意力模块的头部数量,以获得高效的视觉变压器。李等人。[229]探索了一种自监督搜索策略,以获得由卷积模块和自注意力模块组成的混合架构。

Refer to caption
Figure 13: Different methods for compressing transformers.
图13:压缩变压器的不同方法。

The self-attention operation in transformer models calculates the dot product between representations from different input tokens in a given sequence (patches in image recognition tasks [15]), whose complexity is O(N)𝑂𝑁O(N), where N𝑁N is the length of the sequence. Recently, there has been a targeted focus to reduce the complexity to O(N)𝑂𝑁O(N) in large methods so that transformer models can scale to long sequences [230, 231, 232]. For example, Katharopoulos et al. [230] approximated self-attention as a linear dot-product of kernel feature maps and revealed the relationship between tokens via RNNs. Zaheer et al. [232] considered each token as a vertex in a graph and defined the inner product calculation between two tokens as an edge. Inspired by graph theories [233, 234], various sparse graph are combined to approximate the dense graph in transformer models, and can achieve O(N)𝑂𝑁O(N) complexity.
变压器模型中的自注意力操作计算来自给定序列中不同输入标记的表示之间的点积(图像识别任务中的补丁[15]),其复杂度为 O(N)𝑂𝑁O(N) ,其中 N𝑁N 是序列的长度。最近,有一个有针对性的焦点,将复杂度降低到大型方法中的 O(N)𝑂𝑁O(N) ,以便变压器模型可以扩展到长序列[230,231,232]。例如,Katharopoulos等人[230]将自注意力近似为内核特征图的线性点积,并通过RNN揭示了标记之间的关系。Zaheer等人[232]将每个标记视为图中的一个顶点,并将两个标记之间的内积计算定义为一条边。受图论[233,234]的启发,将各种稀疏图组合起来逼近变压器模型中的稠密图,可以实现 O(N)𝑂𝑁O(N) 复杂度。

Discussion. The preceding methods take different approaches in how they attempt to identify redundancy in transformer models (see Figure 13). Pruning and decomposition methods usually require pre-defined models with redundancy. Specifically, pruning focuses on reducing the number of components (e.g., layers, heads) in transformer models while decomposition represents an original matrix with multiple small matrices. Compact models also can be directly designed either manually (requiring sufficient expertise) or automatically (e.g., via NAS). The obtained compact models can be further represented with low-bits via quantization methods for efficient deployment on resource-limited devices.
讨论。前面的方法采用不同的方法来识别变压器模型中的冗余(见图13)。剪枝和分解方法通常需要具有冗余的预定义模型。具体来说,剪枝侧重于减少变压器模型中组件(例如层、头)的数量,而分解表示具有多个小矩阵的原始矩阵。紧凑模型也可以直接手动设计(需要足够的专业知识)或自动设计(例如,通过NAS)。获得的紧凑模型可以通过量化方法进一步用低比特表示,以便在资源有限的设备上有效部署。

4 Conclusions and Discussions
4结论与讨论

Transformer is becoming a hot topic in the field of computer vision due to its competitive performance and tremendous potential compared with CNNs. To discover and utilize the power of transformer, as summarized in this survey, a number of methods have been proposed in recent years. These methods show excellent performance on a wide range of visual tasks, including backbone, high/mid-level vision, low-level vision, and video processing. Nevertheless, the potential of transformer for computer vision has not yet been fully explored, meaning that several challenges still need to be resolved. In this section, we discuss these challenges and provide insights on the future prospects.
由于与CNN相比具有竞争力的性能和巨大的潜力,Transformer正在成为计算机视觉领域的热门话题。为了发现和利用变压器的力量,正如本次调查所总结的那样,近年来已经提出了许多方法。这些方法在广泛的视觉任务上表现出优异的性能,包括骨干、高/中级视觉、低级视觉和视频处理。尽管如此,变压器用于计算机视觉的潜力尚未得到充分挖掘,这意味着仍有几个挑战需要解决。在本节中,我们讨论这些挑战并提供对未来前景的见解。

4.1 Challenges 4.1挑战

Although researchers have proposed many transformer-based models to tackle computer vision tasks, these works are only the first steps in this field and still have much room for improvement. For example, the transformer architecture in ViT [15] follows the standard transformer for NLP [9], but an improved version specifically designed for CV remains to be explored. Moreover, it is necessary to apply transformer to more tasks other than those mentioned earlier.
尽管研究人员提出了许多基于转换器的模型来应对计算机视觉任务,但这些工作只是该领域的第一步,仍有很大的改进空间。例如,ViT[15]中的转换器架构遵循了用于自然语言处理的标准转换器[9],但专门为CV设计的改进版本仍有待探索。此外,有必要将转换器应用于前面提到的任务之外的更多任务。

The generalization and robustness of transformers for computer vision are also challenging. Compared with CNNs, pure transformers lack some inductive biases and rely heavily on massive datasets for large-scale training [15]. Consequently, the quality of data has a significant influence on the generalization and robustness of transformers. Although ViT shows exceptional performance on downstream image classification tasks such as CIFAR [235] and VTAB [236], directly applying the ViT backbone on object detection has failed to achieve better results than CNNs [115]. There is still a long way to go in order to better generalize pre-trained transformers on more generalized visual tasks. Practitioners concern the robustness of transformer (e.g. the vulnerability issue [237]). Although the robustness has been investigated in [238, 239, 240], it is still an open problem waiting to be solved.
用于计算机视觉的变压器的泛化和鲁棒性也具有挑战性。与CNN相比,纯变压器缺乏一些归纳偏差,并且严重依赖海量数据集进行大规模训练[15]。因此,数据质量对变压器的泛化和鲁棒性有重大影响。尽管ViT在下游图像分类任务(如CIFAR[235]和VTAB[236])上表现出卓越的性能,但直接将ViT主干应用于目标检测未能获得比CNN更好的结果[115]。为了更好地将预训练的变压器泛化到更广义的视觉任务上,还有很长的路要走。从业者关注变压器的鲁棒性(例如漏洞问题[237])。尽管在[238,239,240]中已经对鲁棒性进行了研究,但它仍然是一个等待解决的开放问题。

Although numerous works have explained the use of transformers in NLP [241, 242], it remains a challenging subject to clearly explain why transformer works well on visual tasks. The inductive biases, including translation equivariance and locality, are attributed to CNN’s success, but transformer lacks any inductive bias. The current literature usually analyzes the effect in an intuitive way [15, 243]. For example, Dosovitskiy et al. [15] claim that large-scale training can surpass inductive bias. Position embeddings are added into image patches to retain positional information, which is important in computer vision tasks. Inspired by the heavy parameter usage in transformers, over-parameterization [244, 245] may be a potential point to the interpretability of vision transformers.
尽管许多作品已经解释了变压器在自然语言处理中的使用[241,242],但要清楚地解释为什么变压器在视觉任务上工作良好,仍然是一个具有挑战性的主题。归纳偏差,包括翻译等方差和局部性,归因于CNN的成功,但变压器缺乏任何归纳偏差。当前的文献通常以直观的方式分析效果[15,243]。例如,Dosovitskiy等人[15]声称大规模训练可以超越归纳偏差。位置嵌入被添加到图像补丁中以保留位置信息,这在计算机视觉任务中很重要。受变压器中大量参数使用的启发,over-parameterization[244,245]可能是视觉变压器可解释性的潜在点。

Last but not least, developing efficient transformer models for CV remains an open problem. Transformer models are usually huge and computationally expensive. For example, the base ViT model [15] requires 18 billion FLOPs to process an image. In contrast, the lightweight CNN model GhostNet [246, 247] can achieve similar performance with only about 600 million FLOPs. Although several methods have been proposed to compress transformer, they remain highly complex. And these methods, which were originally designed for NLP, may not be suitable for CV. Consequently, efficient transformer models are urgently needed so that vision transformer can be deployed on resource-limited devices.
最后但并非最不重要的一点是,为CV开发高效的变压器模型仍然是一个悬而未决的问题。Transformer模型通常庞大且计算成本高昂。例如,基本ViT模型[15]需要180亿每秒浮点运算次数来处理图像。相比之下,轻量级卷积神经网络模型GhostNet[246,247]只需大约6亿每秒浮点运算次数即可实现类似的性能。尽管已经提出了几种压缩变压器的方法,但它们仍然高度复杂。而这些最初是为自然语言处理而设计的方法可能不适合CV。因此,迫切需要高效的变压器模型,以便视觉变压器可以部署在资源有限的设备上。

4.2 Future Prospects 4.2未来展望

In order to drive the development of vision transformers, we provide several potential directions for future study.
为了推动视觉变压器的发展,我们为未来的研究提供了几个潜在的方向。

One direction is the effectiveness and the efficiency of transformers in computer vision. The goal is to develop highly effective and efficient vision transformers; specifically, transformers with high performance and low resource cost. The performance determines whether the model can be applied on real-world applications, while the resource cost influences the deployment on devices  [248, 249]. The effectiveness is usually correlated with the efficiency, so determining how to achieve a better balance between them is a meaningful topic for future study.
一个方向是计算机视觉中变压器的有效性和效率。目标是开发高效和高效的视觉变压器;具体来说,高性能和低资源成本的变压器。性能决定了模型是否可以应用于现实世界的应用程序,而资源成本影响了设备上的部署[248,249]。有效性通常与效率相关,因此确定如何在它们之间实现更好的平衡是未来研究的一个有意义的话题。

Most of the existing vision transformer models are designed to handle only a single task. Many NLP models such as GPT-3 [11] have demonstrated how transformer can deal with multiple tasks in one model. IPT [19] in the CV field is also able to process multiple low-level vision tasks, such as super-resolution, image denoising, and deraining. Perceiver [250] and Perceiver IO [251] are the pioneering models that can work on several domains including images, audio, multimodal, point clouds. We believe that more tasks can be involved in only one model. Unifying all visual tasks and even other tasks in one transformer (i.e., a grand unified model) is an exciting topic.
大多数现有的视觉转换器模型都被设计为只处理单个任务。许多自然语言处理模型如GPT-3[11]已经演示了转换器如何在一个模型中处理多个任务。CV领域的IPT[19]也能够处理多个低级视觉任务,如超分辨率、图像去噪和去噪。Percader[250]和Percader IO[251]是开创性的模型,可以在包括图像、音频、多模态、点云在内的几个领域工作。我们相信更多的任务可以只在一个模型中涉及。将所有视觉任务甚至其他任务统一在一个转换器中(即一个宏大的统一模型)是一个令人兴奋的话题。

There have been various types of neural networks, such as CNN, RNN, and transformer. In the CV field, CNNs used to be the mainstream choice [12, 94], but now transformer is becoming popular. CNNs can capture inductive biases such as translation equivariance and locality, whereas ViT uses large-scale training to surpass inductive bias [15]. From the evidence currently available [15], CNNs perform well on small datasets, whereas transformers perform better on large datasets. The question for the future is whether to use CNN or transformer.
已经出现了各种类型的神经网络,如卷积神经网络、循环神经网络和变压器。在CV领域,CNN曾经是主流选择[12,94],但现在变压器开始流行起来。CNN可以捕获翻译等方差和局部性等归纳偏差,而ViT使用大规模训练来超越归纳偏差[15]。从目前可用的证据[15]来看,CNN在小型数据集上表现良好,而变压器在大型数据集上表现更好。未来的问题是使用卷积神经网络还是变压器。

By training with large datasets, transformers can achieve state-of-the-art performance on both NLP [11, 10] and CV benchmarks [15]. It is possible that neural networks need big data rather than inductive bias. In closing, we leave you with a question: Can transformer obtains satisfactory results with a very simple computational paradigm (e.g., with only fully connected layers) and massive data training?
通过使用大型数据集进行训练,变压器可以在自然语言处理[11,10]和CV基准[15]上实现最先进的性能。神经网络可能需要大数据而不是归纳偏差。最后,我们给你留下一个问题:变压器能否通过非常简单的计算范式(例如,仅使用全连接层)和海量数据训练获得令人满意的结果?

Acknowledgement 致谢

This research is partially supported by MindSpore (https://mindspore.cn/) and CANN (Compute Architecture for Neural Networks).
这项研究得到了MindSporl(https://mindspore.cn/)和CANN(神经网络计算架构)的部分支持。

A1. General Formulation of Self-attention
A1.自注意力机制的一般表述

The self-attention module [9] for machine translation computes the responses at each position in a sequence by estimating attention scores to all positions and gathering the corresponding embeddings based on the scores accordingly. This can be viewed as a form of non-local filtering operations [252, 253]. We follow the convention [252] to formulate the self-attention module. Given an input signal (e.g., image, sequence, video and feature) 𝐗n×d𝐗superscript𝑛𝑑\mathbf{X}\in\mathbb{R}^{n\times d}, where n=h×w𝑛𝑤n=h\times w (indicating the number of pixels in feature) and d𝑑d is the number of channels, the output signal is generated as:
机器翻译的自注意力模块[9]通过估计对所有位置的注意力分数并相应地根据分数收集相应的嵌入来计算序列中每个位置的响应。这可以被视为非局部过滤操作的一种形式[252,253]。我们遵循约定[252]来制定自注意力模块。给定一个输入信号(例如,图像、序列、视频和特征) 𝐗n×d𝐗superscript𝑛𝑑\mathbf{X}\in\mathbb{R}^{n\times d} ,其中 n=h×w𝑛𝑤n=h\times w (表示特征中的像素数)和 d𝑑d 是通道数,输出信号生成为:

𝐲i=1C(𝐱i)jf(𝐱i,𝐱j)g(𝐱j),subscript𝐲𝑖1𝐶subscript𝐱𝑖subscriptfor-all𝑗𝑓subscript𝐱𝑖subscript𝐱𝑗𝑔subscript𝐱𝑗\mathbf{y}_{i}=\frac{1}{C(\mathbf{x}_{i})}\sum_{\forall j}{f(\mathbf{x}_{i},\mathbf{x}_{j})g(\mathbf{x}_{j})}, (14)

where 𝐱i1×dsubscript𝐱𝑖superscript1𝑑\mathbf{x}_{i}\in\mathbb{R}^{1\times d} and 𝐲i1×dsubscript𝐲𝑖superscript1𝑑\mathbf{y}_{i}\in\mathbb{R}^{1\times d} indicate the ithsuperscript𝑖𝑡i^{th} position (e.g., space, time and spacetime) of the input signal 𝐗𝐗\mathbf{X} and output signal 𝐘𝐘\mathbf{Y}, respectively. Subscript j𝑗j is the index that enumerates all positions, and a pairwise function f()𝑓f(\cdot) computes a representing relationship (such as affinity) between i𝑖i and all j𝑗j. The function g()𝑔g(\cdot) computes a representation of the input signal at position j𝑗j, and the response is normalized by a factor C(xi)𝐶subscript𝑥𝑖C(x_{i}).
其中 𝐱i1×dsubscript𝐱𝑖superscript1𝑑\mathbf{x}_{i}\in\mathbb{R}^{1\times d}𝐲i1×dsubscript𝐲𝑖superscript1𝑑\mathbf{y}_{i}\in\mathbb{R}^{1\times d} 分别表示输入信号 𝐗𝐗\mathbf{X} 和输出信号 𝐘𝐘\mathbf{Y}ithsuperscript𝑖𝑡i^{th} 位置(例如空间、时间和时空)。下标 j𝑗j 是枚举所有位置的索引,成对函数 f()𝑓f(\cdot) 计算 i𝑖i 和所有 j𝑗j 之间的表示关系(例如亲和力)。函数 g()𝑔g(\cdot) 计算输入信号在位置 j𝑗j 的表示,响应由因子 C(xi)𝐶subscript𝑥𝑖C(x_{i}) 归一化。

Note that there are many choices for the pairwise function f()𝑓f(\cdot). For example, a simple extension of the Gaussian function could be used to compute the similarity in an embedding space. As such, the function f()𝑓f(\cdot) can be formulated as:
请注意,成对函数 f()𝑓f(\cdot) 有很多选择。例如,可以使用高斯函数的简单扩展来计算嵌入空间中的相似性。因此,函数 f()𝑓f(\cdot) 可以表述为:

f(𝐱i,𝐱j)=eθ(𝐱i)ϕ(𝐱j)T𝑓subscript𝐱𝑖subscript𝐱𝑗superscript𝑒𝜃subscript𝐱𝑖italic-ϕsuperscriptsubscript𝐱𝑗𝑇f(\mathbf{x}_{i},\mathbf{x}_{j})=e^{\theta(\mathbf{x}_{i})\phi(\mathbf{x}_{j})^{T}} (15)

where θ()𝜃\theta(\cdot) and ϕ()italic-ϕ\phi(\cdot) can be any embedding layers. If we consider the θ(),ϕ(),g()𝜃italic-ϕ𝑔\theta(\cdot),\phi(\cdot),g(\cdot) in the form of linear embedding: θ(𝐗)=𝐗𝐖θ𝜃𝐗subscript𝐗𝐖𝜃\theta(\mathbf{X})=\mathbf{X}\mathbf{W}_{\theta}, ϕ(𝐗)=𝐗𝐖ϕ,g(𝐗)=𝐗𝐖gformulae-sequenceitalic-ϕ𝐗subscript𝐗𝐖italic-ϕ𝑔𝐗subscript𝐗𝐖𝑔\phi(\mathbf{X})=\mathbf{X}\mathbf{W}_{\phi},g(\mathbf{X})=\mathbf{X}\mathbf{W}_{g} where 𝐖θd×dk,𝐖ϕd×dk,𝐖gd×dvformulae-sequencesubscript𝐖𝜃superscript𝑑subscript𝑑𝑘formulae-sequencesubscript𝐖italic-ϕsuperscript𝑑subscript𝑑𝑘subscript𝐖𝑔superscript𝑑subscript𝑑𝑣\mathbf{W}_{\theta}\in\mathbb{R}^{d\times d_{k}},\mathbf{W}_{\phi}\in\mathbb{R}^{d\times d_{k}},\mathbf{W}_{g}\in\mathbb{R}^{d\times d_{v}}, and set the normalization factor as C(𝐱i)=jf(𝐱i,𝐱j)𝐶subscript𝐱𝑖subscriptfor-all𝑗𝑓subscript𝐱𝑖subscript𝐱𝑗C(\mathbf{x}_{i})=\sum_{\forall j}{f(\mathbf{x}_{i},\mathbf{x}_{j})}, the Eq. 14 can be rewritten as:
其中 θ()𝜃\theta(\cdot)ϕ()italic-ϕ\phi(\cdot) 可以是任何嵌入层。如果我们考虑线性嵌入形式的 θ(),ϕ(),g()𝜃italic-ϕ𝑔\theta(\cdot),\phi(\cdot),g(\cdot)θ(𝐗)=𝐗𝐖θ𝜃𝐗subscript𝐗𝐖𝜃\theta(\mathbf{X})=\mathbf{X}\mathbf{W}_{\theta}ϕ(𝐗)=𝐗𝐖ϕ,g(𝐗)=𝐗𝐖gformulae-sequenceitalic-ϕ𝐗subscript𝐗𝐖italic-ϕ𝑔𝐗subscript𝐗𝐖𝑔\phi(\mathbf{X})=\mathbf{X}\mathbf{W}_{\phi},g(\mathbf{X})=\mathbf{X}\mathbf{W}_{g} 其中 𝐖θd×dk,𝐖ϕd×dk,𝐖gd×dvformulae-sequencesubscript𝐖𝜃superscript𝑑subscript𝑑𝑘formulae-sequencesubscript𝐖italic-ϕsuperscript𝑑subscript𝑑𝑘subscript𝐖𝑔superscript𝑑subscript𝑑𝑣\mathbf{W}_{\theta}\in\mathbb{R}^{d\times d_{k}},\mathbf{W}_{\phi}\in\mathbb{R}^{d\times d_{k}},\mathbf{W}_{g}\in\mathbb{R}^{d\times d_{v}} ,并将归一化因子设置为 C(𝐱i)=jf(𝐱i,𝐱j)𝐶subscript𝐱𝑖subscriptfor-all𝑗𝑓subscript𝐱𝑖subscript𝐱𝑗C(\mathbf{x}_{i})=\sum_{\forall j}{f(\mathbf{x}_{i},\mathbf{x}_{j})} ,则等式。14可以改写为:

𝐲i=e𝐱iwθ,i𝐰ϕ,jT𝐱jTje𝐱i𝐰θ,i𝐰ϕ,jTxjT𝐱j𝐰g,j,subscript𝐲𝑖superscript𝑒subscript𝐱𝑖subscript𝑤𝜃𝑖superscriptsubscript𝐰italic-ϕ𝑗𝑇superscriptsubscript𝐱𝑗𝑇subscript𝑗superscript𝑒subscript𝐱𝑖subscript𝐰𝜃𝑖superscriptsubscript𝐰italic-ϕ𝑗𝑇superscriptsubscript𝑥𝑗𝑇subscript𝐱𝑗subscript𝐰𝑔𝑗\mathbf{y}_{i}=\frac{e^{\mathbf{x}_{i}w_{\theta,i}\mathbf{w}_{\phi,j}^{T}\mathbf{x}_{j}^{T}}}{\sum_{j}{e^{\mathbf{x}_{i}\mathbf{w}_{\theta,i}\mathbf{w}_{\phi,j}^{T}x_{j}^{T}}}}\mathbf{x}_{j}\mathbf{w}_{g,j}, (16)

where 𝐰θ,id×1subscript𝐰𝜃𝑖superscript𝑑1\mathbf{w}_{\theta,i}\in\mathbb{R}^{d\times 1} is the ithsuperscript𝑖𝑡i^{th} row of the weight matrix Wθsubscript𝑊𝜃W_{\theta}. For a given index i𝑖i, 1C(𝐱i)f(𝐱i,𝐱j)1𝐶subscript𝐱𝑖𝑓subscript𝐱𝑖subscript𝐱𝑗\frac{1}{C(\mathbf{x}_{i})}f(\mathbf{x}_{i},\mathbf{x}_{j}) becomes the softmax output along the dimension j𝑗j. The formulation can be further rewritten as:
其中 𝐰θ,id×1subscript𝐰𝜃𝑖superscript𝑑1\mathbf{w}_{\theta,i}\in\mathbb{R}^{d\times 1} 是权重矩阵 Wθsubscript𝑊𝜃W_{\theta}ithsuperscript𝑖𝑡i^{th}1C(𝐱i)f(𝐱i,𝐱j)1𝐶subscript𝐱𝑖𝑓subscript𝐱𝑖subscript𝐱𝑗\frac{1}{C(\mathbf{x}_{i})}f(\mathbf{x}_{i},\mathbf{x}_{j}) 成为沿维度 j𝑗j 的softmax输出。公式可以进一步重写为:

𝐘=softmax(𝐗𝐖θ𝐖ϕT𝐗)g(𝐗),𝐘softmaxsubscript𝐗𝐖𝜃superscriptsubscript𝐖italic-ϕ𝑇𝐗𝑔𝐗\mathbf{Y}=\mathrm{softmax}(\mathbf{X}\mathbf{W}_{\theta}\mathbf{W}_{\phi}^{T}\mathbf{X})g(\mathbf{X}), (17)

where 𝐘n×c𝐘superscript𝑛𝑐\mathbf{Y}\in\mathbb{R}^{n\times c} is the output signal of the same size as 𝐗𝐗\mathbf{X}. Compared with the query, key and value representations 𝐐=𝐗𝐖q,𝐊=𝐗𝐖k,𝐕=𝐗𝐖vformulae-sequence𝐐subscript𝐗𝐖𝑞formulae-sequence𝐊subscript𝐗𝐖𝑘𝐕subscript𝐗𝐖𝑣\mathbf{Q}=\mathbf{X}\mathbf{W}_{q},\mathbf{K}=\mathbf{X}\mathbf{W}_{k},\mathbf{V}=\mathbf{X}\mathbf{W}_{v} from the translation module, once 𝐖q=𝐖θ,𝐖k=𝐖ϕ,𝐖v=𝐖gformulae-sequencesubscript𝐖𝑞subscript𝐖𝜃formulae-sequencesubscript𝐖𝑘subscript𝐖italic-ϕsubscript𝐖𝑣subscript𝐖𝑔\mathbf{W}_{q}=\mathbf{W}_{\theta},\mathbf{W}_{k}=\mathbf{W}_{\phi},\mathbf{W}_{v}=\mathbf{W}_{g}, Eq. 17 can be formulated as:
其中 𝐘n×c𝐘superscript𝑛𝑐\mathbf{Y}\in\mathbb{R}^{n\times c} 是与 𝐗𝐗\mathbf{X} 大小相同的输出信号。与来自翻译模块的查询、键和值表示 𝐐=𝐗𝐖q,𝐊=𝐗𝐖k,𝐕=𝐗𝐖vformulae-sequence𝐐subscript𝐗𝐖𝑞formulae-sequence𝐊subscript𝐗𝐖𝑘𝐕subscript𝐗𝐖𝑣\mathbf{Q}=\mathbf{X}\mathbf{W}_{q},\mathbf{K}=\mathbf{X}\mathbf{W}_{k},\mathbf{V}=\mathbf{X}\mathbf{W}_{v} 相比,一旦 𝐖q=𝐖θ,𝐖k=𝐖ϕ,𝐖v=𝐖gformulae-sequencesubscript𝐖𝑞subscript𝐖𝜃formulae-sequencesubscript𝐖𝑘subscript𝐖italic-ϕsubscript𝐖𝑣subscript𝐖𝑔\mathbf{W}_{q}=\mathbf{W}_{\theta},\mathbf{W}_{k}=\mathbf{W}_{\phi},\mathbf{W}_{v}=\mathbf{W}_{g} ,Eq.17可以表述为:

𝐘=softmax(𝐐𝐊T)𝐕=Attention(𝐐,𝐊,𝐕),𝐘softmaxsuperscript𝐐𝐊𝑇𝐕Attention𝐐𝐊𝐕\mathbf{Y}=\mathrm{softmax}(\mathbf{Q}\mathbf{K}^{T})\mathbf{V}=\mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}), (18)

The self-attention module [9] proposed for machine translation is, to some extent, the same as the preceding non-local filtering operations proposed for computer vision.
为机器翻译提出的自注意力模块[9]在某种程度上与前面为计算机视觉提出的非局部过滤操作相同。

Generally, the final output signal of the self-attention module for computer vision will be wrapped as:
通常,计算机视觉自注意模块的最终输出信号会被包装为:

𝐙=𝐘𝐖o+𝐗𝐙superscript𝐘𝐖𝑜𝐗\mathbf{Z}=\mathbf{Y}\mathbf{W}^{o}+\mathbf{X} (19)

where 𝐘𝐘\mathbf{Y} is generated through Eq. 17. If 𝐖osuperscript𝐖𝑜\mathbf{W}^{o} is initialized as zero, this self-attention module can be inserted into any existing model without breaking its initial behavior.
其中 𝐘𝐘\mathbf{Y} 是通过Eq生成的。17.如果 𝐖osuperscript𝐖𝑜\mathbf{W}^{o} 初始化为零,则可以将此自我关注模块插入到任何现有模型中,而不会破坏其初始行为。

A2. Revisiting Transformers for NLP
A2。重访自然语言处理的变形金刚

Before transformer was developed, RNNs ( e.g., GRU [254] and LSTM [6]) with added attention [7] empowered most of the state-of-the-art language models. However, RNNs require the information flow to be processed sequentially from the previous hidden states to the next one. This rules out the possibility of using acceleration and parallelization during training, and consequently hinders the potential of RNNs to process longer sequences or build larger models. In 2017, Vaswani et al. [9] proposed transformer, a novel encoder-decoder architecture built solely on multi-head self-attention mechanisms and feed-forward neural networks. Its purpose was to solve seq-to-seq natural language tasks (e.g., machine translation) easily by acquiring global dependencies. The subsequent success of transformer demonstrates that leveraging attention mechanisms alone can achieve performance comparable with attentive RNNs. Furthermore, the architecture of transformer lends itself to massively parallel computing, which enables training on larger datasets. This has given rise to the surge of large pre-trained models (PTMs) for natural language processing.
在变压器被开发之前,带有额外注意力的RNN(例如,GRU[254]和LSTM[6])[7]为大多数最先进的语言模型赋予了权力。然而,RNN要求信息流从之前的隐藏状态到下一个隐藏状态顺序处理。这排除了在训练期间使用加速和并行化的可能性,并因此阻碍了RNN处理更长序列或构建更大模型的潜力。2017年,Vaswani等人[9]提出了变压器,这是一种新颖的编码器-解码器架构,完全建立在多头自注意力机制和前馈神经网络之上。其目的是通过获取全局依赖关系轻松解决seq-to-seq自然语言任务(例如,机器翻译)。变压器的后续成功表明,仅利用注意力机制就可以实现与专心的RNN相当的性能。此外,转换器的架构有助于大规模并行计算,从而能够在更大的数据集上进行训练。这导致了用于自然语言处理的大型预训练模型(PTM)的激增。

BERT [10] and its variants (e.g., SpanBERT [255], RoBERTa [256]) are a series of PTMs built on the multi-layer transformer encoder architecture. Two tasks are conducted on BookCorpus [257] and English Wikipedia datasets at the pre-training stage of BERT: 1) Masked language modeling (MLM), which involves first randomly masking out some tokens in the input and then training the model to predict; 2) Next sentence prediction, which uses paired sentences as input and predicts whether the second sentence is the original one in the document. After pre-training, BERT can be fine-tuned by adding an output layer on a wide range of downstream tasks. More specifically, when performing sequence-level tasks (e.g., sentiment analysis), BERT uses the representation of the first token for classification; for token-level tasks (e.g., name entity recognition), all tokens are fed into the softmax layer for classification. At the time of its release, BERT achieved the state-of-the-art performance on 11 NLP tasks, setting a milestone in pre-trained language models. Generative Pre-trained Transformer models (e.g., GPT [258], GPT-2 [110]) are another type of PTMs based on the transformer decoder architecture, which uses masked self-attention mechanisms. The main difference between the GPT series and BERT is the way in which pre-training is performed. Unlike BERT, GPT models are unidirectional language models pre-trained using Left-to-Right (LTR) language modeling. Furthermore, BERT learns the sentence separator ([SEP]) and classifier token ([CLS]) embeddings during pre-training, whereas these embeddings are involved in only the fine-tuning stage of GPT. Due to its unidirectional pre-training strategy, GPT achieves superior performance in many natural language generation tasks. More recently, a massive transformer-based model called GPT-3, which has an astonishing 175 billion parameters, was developed [11]. By pre-training on 45 TB of compressed plaintext data, GPT-3 can directly process different types of downstream natural language tasks without fine-tuning. As a result, it achieves strong performance on many NLP datasets, including both natural language understanding and generation. Since the introduction of transformer, many other models have been proposed in addition to the transformer-based PTMs mentioned earlier. We list a few representative models in Table V for interested readers, but this is not the focus of our study.
BERT[10]及其变体(例如,SpanBERT[255],RoBERTa[256])是建立在多层转换器编码器架构上的一系列PTM。在BERT的预训练阶段,在BookCorpus[257]和英文维基百科数据集上进行了两项任务:1)掩蔽语言建模(MLM),它涉及首先随机地掩蔽掉输入中的一些标记,然后训练模型进行预测;2)下一句预测,它使用成对的句子作为输入,并预测第二句话是否是文档中的原始句子。预训练后,BERT可以通过在广泛的下游任务上添加输出层来进行微调。更具体地说,当执行序列级任务(例如,情感分析)时,BERT使用第一个令牌的表示进行分类;对于令牌级任务(例如,命名实体识别),所有令牌都被馈送到softmax层进行分类。在发布时,BERT在11个自然语言处理任务上实现了最先进的性能,在预训练语言模型中树立了里程碑。生成预训练Transformer模型(例如,GPT[258]、GPT-2[110])是基于变压器解码器架构的另一种类型的PTM,它使用掩蔽自注意力机制。GPT系列和BERT之间的主要区别在于执行预训练的方式。与BERT不同,GPT模型是使用从左到右(LTR)语言建模预训练的单向语言模型。此外,BERT在预训练期间学习句子分隔符([SEP])和分类器标记([CLS])嵌入,而这些嵌入仅涉及GPT的微调阶段。由于其单向预训练策略,GPT在许多自然语言生成任务中获得了卓越的性能。 最近,开发了一种名为GPT-3的大规模基于变压器的模型,该模型具有惊人的1750亿参数[11]。通过对45 TB的压缩明文数据进行预训练,GPT-3可以直接处理不同类型的下游自然语言任务,而无需微调。因此,它在许多自然语言处理数据集上实现了强大的性能,包括自然语言理解和生成。自从推出变压器以来,除了前面提到的基于变压器的PTM之外,还提出了许多其他模型。我们在表V中列出了几个有代表性的模型,以供感兴趣的读者使用,但这不是我们研究的重点。

[b]   Models  模型 Architecture 架构 # of Params 参数数 Fine-tuning 微调 GPT [258] GPT[258] Transformer Dec. Transformer12月。 117M Yes 是的 GPT-2 [110] GPT-2[110] Transformer Dec. Transformer12月。 117M-1542M No  GPT-3 [11] GPT-3[11] Transformer Dec. Transformer12月。 125M-175B No  BERT [10] BERT[10] Transformer Enc. TransformerEnc。 110M-340M Yes 是的 RoBERTa [256] 罗伯塔[256] Transformer Enc. TransformerEnc。 355M Yes 是的 XLNet [259] XLNet[259] Two-Stream 双流 \approx BERT  \approx BERT Yes 是的 Transformer Enc. TransformerEnc。 ELECTRA [260] ELECTRA[260] Transformer Enc. TransformerEnc。 335M Yes 是的 UniLM [261] 统一LM[261] Transformer Enc. TransformerEnc。 340M Yes 是的 BART [262] 巴特[262] Transformer 110% of BERT 110%的BERT Yes 是的 T5 [154] T5[154] Transfomer 转化器 220M-11B Yes 是的 ERNIE (THU) [263] ERNIE(THU)[263] Transformer Enc. TransformerEnc。 114M Yes 是的 KnowBERT [264] 知识[264] Transformer Enc. TransformerEnc。 253M-523M Yes 是的

TABLE V: List of representative language models built on transformer. Transformer is the standard encoder-decoder architecture. Transformer Enc. and Dec. represent the encoder and decoder, respectively. Decoder uses mask self-attention to prevent attending to the future tokens. The data of the Table is from  [203].
表五:建立在转换器上的代表性语言模型列表。Transformer是标准的编码器-解码器架构。TransformerEnc.和Dec.分别代表编码器和解码器。解码器使用掩码自我关注来防止关注未来的令牌。表中的数据来自[203]。

Apart from the PTMs trained on large corpora for general NLP tasks, transformer-based models have also been applied in many other NLP-related domains and to multi-modal tasks.
除了在大型语料库上训练用于一般自然语言处理任务的PTM之外,基于转换器的模型还被应用于许多其他与自然语言处理相关的领域和多模态任务。

BioNLP Domain. Transformer-based models have outperformed many traditional biomedical methods. Some examples of such models include BioBERT [265], which uses a transformer architecture for biomedical text mining tasks, and SciBERT [266], which is developed by training transformer on 114M scientific articles (covering biomedical and computer science fields) with the aim of executing NLP tasks in the scientific domain more precisely. Another example is ClinicalBERT, proposed by Huang  et al. [267]. It utilizes transformer to develop and evaluate continuous representations of clinical notes. One of the side effects of this is that the attention map of ClinicalBERT can be used to explain predictions, thereby allowing high-quality connections between different medical contents to be discovered.
BioNLP领域。基于Transformer的模型已经超越了许多传统的生物医学方法。这种模型的一些例子包括BioBERT[265],它使用转换器架构进行生物医学文本挖掘任务,以及SciBERT[266],它是通过在114M科学文章(涵盖生物医学和计算机科学领域)上训练转换器来开发的,目的是更精确地执行科学领域中的自然语言处理任务。另一个例子是VinicalBERT,由黄等人[267]提出。它利用转换器来开发和评估临床笔记的连续表示。这样做的副作用之一是,ClinicalBERT的注意力图可以用来解释预测,从而允许发现不同医学内容之间的高质量连接。

The rapid development of transformer-based models on a variety of NLP-related tasks demonstrates its structural superiority and versatility, opening up the possibility that it will become a universal module applied in many AI fields other than just NLP. The following part of this survey focuses on the applications of transformer in a wide range of computer vision tasks that have emerged over the past two years.
基于转换器的模型在各种与自然语言处理相关的任务上的快速发展展示了其结构优势和多功能性,开启了它成为应用于许多人工智能领域而不仅仅是自然语言处理的通用模块的可能性。本调查的以下部分侧重于变压器在过去两年出现的广泛计算机视觉任务中的应用。

A3. Self-attention for Computer Vision
A3.计算机视觉的自注意力机制

The preceding sections reviewed methods that use a transformer architecture for vision tasks. We can conclude that self-attention plays a pivotal role in transformer. The self-attention module can also be considered a building block of CNN architectures, which have low scaling properties concerning the large receptive fields. This building block is widely used on top of the networks to capture long-range interactions and enhance high-level semantic features for vision tasks. In this section, we delve deeply into the models based on self-attention designed for challenging tasks in computer vision. Such tasks include semantic segmentation, instance segmentation, object detection, keypoint detection, and depth estimation. Here we briefly summarize the existing applications using self-attention for computer vision.
前面的部分回顾了使用变压器架构进行视觉任务的方法。我们可以得出结论,自我注意力在变压器中起着举足轻重的作用。自我注意力模块也可以被认为是卷积神经网络架构的构建块,它具有涉及大感受野的低缩放属性。这个构建块广泛用于网络之上,以捕获远程交互并增强视觉任务的高级语义特征。在本节中,我们深入研究了为计算机视觉中具有挑战性的任务而设计的基于自我注意力的模型。这些任务包括语义分割、实例分割、目标检测、关键点检测和深度估计。在这里,我们简要总结了使用自我注意力进行计算机视觉的现有应用。

Image Classification. Trainable attention for classification consists of two main streams: hard attention [268, 269, 270] regarding the use of an image region, and soft attention [271, 272, 273, 274] generating non-rigid feature maps. Ba et al. [268] first proposed the term “visual attention” for image classification tasks, and used attention to select relevant regions and locations within the input image. This can also reduce the computational complexity of the proposed model regarding the size of the input image. For medical image classification, AG-CNN [275] was proposed to crop a sub-region from a global image by the attention heat map. And instead of using hard attention and recalibrating the crop of feature maps, SENet [276] was proposed to reweight the channel-wise responses of the convolutional features using soft self-attention. Jetley et al. [272] used attention maps generated by corresponding estimators to reweight intermediate features in DNNs. In addition, Han et al. [273] utilized the attribute-aware attention to enhance the representation of CNNs.
图像分类。用于分类的可训练注意力由两个主要流组成:关于图像区域使用的硬注意力[268,269,270]和生成非刚性特征图的软注意力[271,272,273,274]。Ba等人[268]首先提出了用于图像分类任务的术语“视觉注意力”,并使用注意力来选择输入图像内的相关区域和位置。这也可以降低所提出的模型关于输入图像大小的计算复杂性。对于医学图像分类,AG-卷积神经网络[275]被提议通过注意力热图从全局图像中裁剪一个子区域。并且SENet[276]被提议使用软自注意力来重新加权卷积特征的通道响应,而不是使用硬注意力和重新校准特征图的裁剪。Jetley等人[272]使用相应估计器生成的注意力图来重新加权DNN中的中间特征。此外,Han等人[273]利用属性感知注意力来增强CNN的表示。

Semantic Segmentation. PSANet [277], OCNet [278], DANet [279] and CFNet [280] are the pioneering works to propose using the self-attention module in semantic segmentation tasks. These works consider and augment the relationship and similarity [281, 282, 283, 284, 285, 286] between the contextual pixels. DANet [279] simultaneously leverages the self-attention module on spatial and channel dimensions, whereas A2superscript𝐴2A^{2}Net [287] groups the pixels into a set of regions, and then augments the pixel representations by aggregating the region representations with the generated attention weights. DGCNet [288] employs a dual graph CNN to model coordinate space similarity and feature space similarity in a single framework. To improve the efficiency of the self-attention module for semantic segmentation, several works [289, 290, 291, 292, 293] have been proposed, aiming to alleviate the huge amount of parameters brought by calculating pixel similarities. For example, CGNL [289] applies the Taylor series of the RBF kernel function to approximate the pixel similarities. CCNet [290] approximates the original self-attention scheme via two consecutive criss-cross attention modules. In addition, ISSA [291] factorizes the dense affinity matrix as the product of two sparse affinity matrices. There are other related works using attention based graph reasoning modules [294, 295, 292] to enhance both the local and global representations.
语义分割。PSANet[277]、OCNet[278]、DANet[279]和CFNet[280]是提出在语义分割任务中使用自我注意力模块的开创性作品。这些作品考虑并增强了上下文像素之间的关系和相似性[281,282,283,284,285,286]。DANet[279]同时在空间和通道维度上利用自我注意力模块,而 A2superscript𝐴2A^{2} Net[287]将像素分组为一组区域,然后通过将区域表示与生成的注意力权重聚合来增强像素表示。DGCNet[288]采用对偶图卷积神经网络在单个框架中对坐标空间相似性和需求空间相似性进行建模。为了提高自注意力模块进行语义分割的效率,已经提出了几项工作[289,290,291,292,293],旨在缓解计算像素相似度带来的海量参数。例如,CGNL[289]应用RBF核函数的泰勒级数来近似像素相似度。CCNet[290]通过两个连续的交叉注意力模块来近似原始的自注意力方案。此外,ISSA[291]将密集亲和矩阵分解为两个稀疏亲和矩阵的乘积。还有其他相关工作使用基于注意力的图推理模块[294,295,292]来增强局部和全局表示。

Object Detection. Ramachandran et al. [274] proposes an attention-based layer and swapped the conventional convolution layers to build a fully attentional detector that outperforms the typical RetinaNet [129] on COCO benchmark [296]. GCNet [297] assumes that the global contexts modeled by non-local operations are almost the same for different query positions within an image, and unifies the simplified formulation and SENet [276] into a general framework for global context modeling [298, 299, 300, 301]. Vo et al. [302] designs a bidirectional operation to gather and distribute information from a query position to all possible positions. Zhang et al. [120] suggests that previous methods fail to interact with cross-scale features, and proposes Feature Pyramid Transformer, based on the self-attention module, to fully exploit interactions across both space and scales.
目标检测。Ramachandran等人[274]提出了一个基于注意力的层,并交换了传统的卷积层,以构建一个完全注意力检测器,该检测器在COCO基准测试[296]上优于典型的RetinaNet[129]。GCNet[297]假设非本地操作建模的全局上下文对于图像内的不同查询位置几乎相同,并将简化的公式和SENet[276]统一成一个全局上下文建模的通用框架[298,299,300,301]。Vo等人[302]设计了一个双向操作,从一个查询位置收集和分发信息到所有可能的位置。Zhang等人[120]建议以前的方法未能与跨尺度特征交互,并提出了基于自注意力模块的特征PyramidTransformer,以充分利用跨空间和尺度的交互。

Conventional detection methods usually exploit a single visual representation (e.g., bounding box and corner point) for predicting the final results. Hu et al. [303] proposes a relation module based on self-attention to process a set of objects simultaneously through interaction between their appearance features. Cheng et al. [121] proposes RelationNet++ with the bridging visual representations (BVR) module to combine different heterogeneous representations into a single one similar to that in the self-attention module. Specifically, the master representation is treated as the query input and the auxiliary representations are regarded as the key input. The enhanced feature can therefore bridge the information from auxiliary representations and benefit final detection results.
传统的检测方法通常利用单个视觉表示(例如,边界框和角点)来预测最终结果。胡等人[303]提出了一种基于自注意力的关系模块,通过它们的外观特征之间的交互来同时处理一组对象。程等人[121]提出了带有桥接视觉表示(BVR)模块的关系网++,将不同的异构表示组合成一个类似于自注意力模块中的单个表示。具体来说,主表示被视为查询输入,辅助表示被视为关键输入。增强的特征因此可以桥接来自辅助表示的信息,并使最终检测结果受益。

Other Vision Tasks. Zhang et al. [304] proposes a resolution-wise attention module to learn enhanced feature maps when training multi-resolution networks to obtain accurate human keypoint locations for pose estimation task. Furthermore, Chang et al. [305] uses an attention-mechanism based feature fusion block to improve the accuracy of the human keypoint detection model.
其他视觉任务Zhang et al.[304]提出了一种分辨率关注模块,用于在训练多分辨率网络以获得用于姿势估计任务的准确人体关键点位置时学习增强的特征图。此外,Chang et al.[305]使用基于注意力机制的特征融合块来提高人体关键点检测模型的准确性。

To explore more generalized contextual information for improving the self-supervised monocular trained depth estimation, Johnston et al. [306] directly leverages self-attention module. Chen et al. [307] also proposes an attention-based aggregation network to capture context information that differs in diverse scenes for depth estimation. And Aich et al. [308] proposes bidirectional attention modules that utilize the forward and backward attention operations for better results of monocular depth estimation.
为了探索更广义的上下文信息以改进自监督单眼训练深度估计,Johnston等人[306]直接利用自我注意力模块。Chen等人[307]还提出了一种基于注意力的聚合网络,以捕获在不同场景中不同的上下文信息以进行深度估计。Aich等人[308]提出了利用前向和后向注意力操作的双向注意力模块,以获得更好的单眼深度估计结果。

References 参考文献

  • [1] F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
    F. Rosenblatt。感知器,感知和识别自动机项目帕拉康奈尔航空实验室,1957年。
  • [2] F. ROSENBLATT. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical report, 1961.
    F. ROSENBLATT。神经动力学原理。感知器和大脑机制理论。技术报告,1961年。
  • [3] Y. LeCun et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    基于梯度的学习在文档识别中的应用。IEEE学报,86(11):2278-2324,1998。
  • [4] A. Krizhevsky et al. Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105, 2012.
    A. Krizhevsky等人。使用深度卷积神经网络进行Imagenet分类。在NeurIPS中,第1097-1105页,2012年。
  • [5] D. E. Rumelhart et al. Learning internal representations by error propagation. Technical report, 1985.
    鲁梅哈特等人。通过误差传播学习内部表示。技术报告,1985年。
  • [6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    S. Hochreiter 和 J. Schmidhuber。长短期记忆。神经计算,9(8):1735–1780, 1997。
  • [7] D. Bahdanau et al. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • [8] A. Parikh et al. A decomposable attention model for natural language inference. In EMNLP, 2016.
  • [9] A. Vaswani et al. Attention is all you need. In NeurIPS, 2017.
  • [10] J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  • [11] T. B. Brown et al. Language models are few-shot learners. In NeurIPS, 2020.
  • [12] K. He et al. Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
  • [13] S. Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  • [14] M. Chen et al. Generative pretraining from pixels. In ICML, 2020.
  • [15] A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • [16] N. Carion et al. End-to-end object detection with transformers. In ECCV, 2020.
  • [17] X. Zhu et al. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  • [18] S. Zheng et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  • [19] H. Chen et al. Pre-trained image processing transformer. In CVPR, 2021.
  • [20] L. Zhou et al. End-to-end dense video captioning with masked transformer. In CVPR, pp. 8739–8748, 2018.
  • [21] S. Ullman et al. High-level vision: Object recognition and visual cognition, volume 2. MIT press Cambridge, MA, 1996.
  • [22] R. Kimchi et al. Perceptual organization in vision: Behavioral and neural perspectives. Psychology Press, 2003.
  • [23] J. Zhu et al. Top-down saliency detection via contextual pooling. Journal of Signal Processing Systems, 74(1):33–46, 2014.
  • [24] J. Long et al. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [25] H. Wang et al. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, pp. 5463–5474, 2021.
  • [26] R. B. Fisher. Cvonline: The evolving, distributed, non-proprietary, on-line compendium of computer vision. Retrieved January 28, 2006 from http://homepages. inf. ed. ac. uk/rbf/CVonline, 2008.
  • [27] N. Parmar et al. Image transformer. In ICML, 2018.
  • [28] Y. Zeng et al. Learning joint spatial-temporal transformations for video inpainting. In ECCV, pp. 528–543. Springer, 2020.
  • [29] K. Han et al. Transformer in transformer. In NeurIPS, 2021.
  • [30] H. Cao et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv:2105.05537 , 2021.
  • [31] X. Chen et al. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  • [32] K. He et al. Masked autoencoders are scalable vision learners. In CVPR, pp. 16000–16009, 2022.
  • [33] Z. Dai et al. UP-DETR: unsupervised pre-training for object detection with transformers. In CVPR, 2021.
  • [34] Y. Wang et al. End-to-end video instance segmentation with transformers. In CVPR, 2021.
  • [35] L. Huang et al. Hand-transformer: Non-autoregressive structured modeling for 3d hand pose estimation. In ECCV, pp. 17–33, 2020.
  • [36] L. Huang et al. Hot-net: Non-autoregressive transformer for 3d hand-object pose estimation. In ACM MM, pp. 3136–3145, 2020.
  • [37] K. Lin et al. End-to-end human pose and mesh reconstruction with transformers. In CVPR, 2021.
  • [38] P. Esser et al. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  • [39] Y. Jiang et al. Transgan: Two transformers can make one strong gan. In NeurIPS, 2021.
  • [40] F. Yang et al. Learning texture transformer network for image super-resolution. In CVPR, pp. 5791–5800, 2020.
  • [41] A. Radford et al. Learning transferable visual models from natural language supervision. arXiv:2103.00020 , 2021.
  • [42] A. Ramesh et al. Zero-shot text-to-image generation. In ICML, 2021.
  • [43] M. Ding et al. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
  • [44] OpenAI. Gpt-4 technical report, 2023.
  • [45] P. Michel et al. Are sixteen heads really better than one? In NeurIPS, pp. 14014–14024, 2019.
  • [46] X. Jiao et al. TinyBERT: Distilling BERT for natural language understanding. In Findings of EMNLP, pp. 4163–4174, 2020.
  • [47] G. Prato et al. Fully quantized transformer for machine translation. In Findings of EMNLP, 2020.
  • [48] Z.-H. Jiang et al. Convbert: Improving bert with span-based dynamic convolution. NeurIPS, 33, 2020.
  • [49] J. Gehring et al. Convolutional sequence to sequence learning. In ICML, pp. 1243–1252. PMLR, 2017.
  • [50] P. Shaw et al. Self-attention with relative position representations. In NAACL, pp. 464–468, 2018.
  • [51] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415 , 2016.
  • [52] J. L. Ba et al. Layer normalization. arXiv:1607.06450 , 2016.
  • [53] A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In ICLR, 2019.
  • [54] Q. Wang et al. Learning deep transformer models for machine translation. In ACL, pp. 1810–1822, 2019.
  • [55] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [56] S. Shen et al. Powernorm: Rethinking batch normalization in transformers. In ICML, 2020.
  • [57] J. Xu et al. Understanding and improving layer normalization. In NeurIPS, 2019.
  • [58] T. Bachlechner et al. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, 2021.
  • [59] B. Wu et al. Visual transformers: Token-based image representation and processing for computer vision. arXiv:2006.03677 , 2020.
  • [60] H. Touvron et al. Training data-efficient image transformers & distillation through attention. In ICML, 2020.
  • [61] Z. Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • [62] C.-F. Chen et al. Regionvit: Regional-to-local attention for vision transformers. arXiv:2106.02689 , 2021.
  • [63] X. Chu et al. Twins: Revisiting the design of spatial attention in vision transformers. arXiv:2104.13840 , 2021.
  • [64] H. Lin et al. Cat: Cross attention in vision transformer. arXiv, 2021.
  • [65] X. Dong et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv:2107.00652 , 2021.
  • [66] Z. Huang et al. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv:2106.03650 , 2021.
  • [67] J. Fang et al. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. arXiv:2105.15168 , 2021.
  • [68] L. Yuan et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021.
  • [69] D. Zhou et al. Deepvit: Towards deeper vision transformer. arXiv, 2021.
  • [70] P. Wang et al. Kvt: k-nn attention for boosting vision transformers. arXiv:2106.00515 , 2021.
  • [71] D. Zhou et al. Refiner: Refining self-attention for vision transformers. arXiv:2106.03714 , 2021.
  • [72] A. El-Nouby et al. Xcit: Cross-covariance image transformers. arXiv:2106.09681 , 2021.
  • [73] W. Wang et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  • [74] S. Sun* et al. Visual parser: Representing part-whole hierarchies with transformers. arXiv:2107.05790 , 2021.
  • [75] H. Fan et al. Multiscale vision transformers. arXiv:2104.11227 , 2021.
  • [76] Z. Zhang et al. Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In AAAI, 2022.
  • [77] Z. Pan et al. Less is more: Pay less attention in vision transformers. In AAAI, 2022.
  • [78] Z. Pan et al. Scalable visual transformers with hierarchical pooling. In ICCV, 2021.
  • [79] B. Heo et al. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
  • [80] C.-F. Chen et al. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV, 2021.
  • [81] Z. Wang et al. Uformer: A general u-shaped transformer for image restoration. arXiv:2106.03106 , 2021.
  • [82] X. Zhai et al. Scaling vision transformers. arXiv:2106.04560 , 2021.
  • [83] X. Su et al. Vision transformer architecture search. arXiv, 2021.
  • [84] M. Chen et al. Autoformer: Searching transformers for visual recognition. In ICCV, pp. 12270–12280, 2021.
  • [85] B. Chen et al. Glit: Neural architecture search for global and local image transformer. In ICCV, pp. 12–21, 2021.
  • [86] X. Chu et al. Conditional positional encodings for vision transformers. arXiv:2102.10882 , 2021.
  • [87] K. Wu et al. Rethinking and improving relative position encoding for vision transformer. In ICCV, 2021.
  • [88] H. Touvron et al. Going deeper with image transformers. arXiv:2103.17239 , 2021.
  • [89] Y. Tang et al. Augmented shortcuts for vision transformers. In NeurIPS, 2021.
  • [90] I. Tolstikhin et al. Mlp-mixer: An all-mlp architecture for vision. arXiv:2105.01601 , 2021.
  • [91] L. Melas-Kyriazi. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet. arXiv:2105.02723 , 2021.
  • [92] M.-H. Guo et al. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv:2105.02358 , 2021.
  • [93] H. Touvron et al. Resmlp: Feedforward networks for image classification with data-efficient training. arXiv:2105.03404 , 2021.
  • [94] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  • [95] J. Guo et al. Cmt: Convolutional neural networks meet vision transformers. arXiv:2107.06263 , 2021.
  • [96] L. Yuan et al. Volo: Vision outlooker for visual recognition. arXiv:2106.13112 , 2021.
  • [97] H. Wu et al. Cvt: Introducing convolutions to vision transformers. arXiv:2103.15808 , 2021.
  • [98] K. Yuan et al. Incorporating convolution designs into visual transformers. arXiv:2103.11816 , 2021.
  • [99] Y. Li et al. Localvit: Bringing locality to vision transformers. arXiv:2104.05707 , 2021.
  • [100] B. Graham et al. Levit: a vision transformer in convnet’s clothing for faster inference. In ICCV, 2021.
  • [101] A. Srinivas et al. Bottleneck transformers for visual recognition. In CVPR, 2021.
  • [102] Z. Chen et al. Visformer: The vision-friendly transformer. arXiv, 2021.
  • [103] T. Xiao et al. Early convolutions help transformers see better. In NeurIPS, volume 34, 2021.
  • [104] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description length, and helmholtz free energy. NIPS, 6:3–10, 1994.
  • [105] P. Vincent et al. Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096–1103, 2008.
  • [106] A. v. d. Oord et al. Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328 , 2016.
  • [107] D. Pathak et al. Context encoders: Feature learning by inpainting. In CVPR, pp. 2536–2544, 2016.
  • [108] Z. Li et al. Mst: Masked self-supervised transformer for visual representation. In NeurIPS, 2021.
  • [109] H. Bao et al. Beit: Bert pre-training of image transformers. arXiv:2106.08254 , 2021.
  • [110] A. Radford et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [111] Z. Xie et al. Simmim: A simple framework for masked image modeling. In CVPR, pp. 9653–9663, 2022.
  • [112] Z. Xie et al. Self-supervised learning with swin transformers. arXiv:2105.04553 , 2021.
  • [113] C. Li et al. Efficient self-supervised vision transformers for representation learning. arXiv:2106.09785 , 2021.
  • [114] K. He et al. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  • [115] J. Beal et al. Toward transformer-based object detection. arXiv:2012.09958 , 2020.
  • [116] Z. Yuan et al. Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving. IEEE TCSVT, 2021.
  • [117] X. Pan et al. 3d object detection with pointformer. In CVPR, 2021.
  • [118] R. Liu et al. End-to-end lane shape prediction with transformers. In WACV, 2021.
  • [119] S. Yang et al. Transpose: Keypoint localization via transformer. In ICCV, 2021.
  • [120] D. Zhang et al. Feature pyramid transformer. In ECCV, 2020.
  • [121] C. Chi et al. Relationnet++: Bridging visual representations for object detection via transformer decoder. NeurIPS, 2020.
  • [122] Z. Sun et al. Rethinking transformer-based set prediction for object detection. In ICCV, pp. 3611–3620, 2021.
  • [123] M. Zheng et al. End-to-end object detection with adaptive clustering transformer. In BMVC, 2021.
  • [124] T. Ma et al. Oriented object detection with transformer. arXiv:2106.03146 , 2021.
  • [125] P. Gao et al. Fast convergence of detr with spatially modulated co-attention. In ICCV, 2021.
  • [126] Z. Yao et al. Efficient detr: Improving end-to-end object detector with dense prior. arXiv:2104.01318 , 2021.
  • [127] Z. Tian et al. Fcos: Fully convolutional one-stage object detection. In ICCV, pp. 9627–9636, 2019.
  • [128] Y. Fang et al. You only look at one sequence: Rethinking transformer in vision through object detection. In NeurIPS, 2021.
  • [129] T.-Y. Lin et al. Focal loss for dense object detection. In ICCV, 2017.
  • [130] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
  • [131] A. Bar et al. Detreg: Unsupervised pretraining with region priors for object detection. arXiv:2106.04550 , 2021.
  • [132] J. Hu et al. Istr: End-to-end instance segmentation with transformers. arXiv:2105.00637 , 2021.
  • [133] Z. Yang et al. Associating objects with transformers for video object segmentation. In NeurIPS, 2021.
  • [134] S. Wu et al. Fully transformer networks for semantic image segmentation. arXiv:2106.04108 , 2021.
  • [135] B. Dong et al. Solq: Segmenting objects by learning queries. In NeurIPS, 2021.
  • [136] R. Strudel et al. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  • [137] E. Xie et al. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  • [138] J. M. J. Valanarasu et al. Medical transformer: Gated axial-attention for medical image segmentation. In MICCAI, 2021.
  • [139] T. Prangemeier et al. Attention-based transformers for instance segmentation of cells in microstructures. In International Conference on Bioinformatics and Biomedicine, pp. 700–707. IEEE, 2020.
  • [140] C. R. Qi et al. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660, 2017.
  • [141] C. R. Qi et al. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 30:5099–5108, 2017.
  • [142] S. Hampali et al. Handsformer: Keypoint transformer for monocular 3d pose estimation ofhands and object in interaction. arXiv, 2021.
  • [143] Y. Li et al. Tokenpose: Learning keypoint tokens for human pose estimation. In ICCV, 2021.
  • [144] W. Mao et al. Tfpose: Direct human pose estimation with transformers. arXiv:2103.15320 , 2021.
  • [145] T. Jiang et al. Skeletor: Skeletal transformers for robust body-pose estimation. In CVPR, 2021.
  • [146] Y. Li et al. Test-time personalization with a transformer for human pose estimation. Advances in Neural Information Processing Systems, 34, 2021.
  • [147] M. Lin et al. Detr for pedestrian detection. arXiv:2012.06785 , 2020.
  • [148] L. Tabelini et al. Polylanenet: Lane estimation via deep polynomial regression. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6150–6156. IEEE, 2021.
  • [149] L. Liu et al. Condlanenet: a top-to-down lane detection framework based on conditional convolution. arXiv:2105.05003 , 2021.
  • [150] P. Xu et al. A survey of scene graph: Generation and application. IEEE Trans. Neural Netw. Learn. Syst, 2020.
  • [151] J. Yang et al. Graph r-cnn for scene graph generation. In ECCV, 2018.
  • [152] S. Sharifzadeh et al. Classification by attention: Scene graph classification with prior knowledge. In AAAI, 2021.
  • [153] S. Sharifzadeh et al. Improving Visual Reasoning by Exploiting The Knowledge in Texts. arXiv:2102.04760 , 2021.
  • [154] C. Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67, 2020.
  • [155] N. Wang et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 2021.
  • [156] M. Zhao et al. TrTr: Visual Tracking with Transformer. arXiv:2105.03817 [cs], May 2021. arXiv: 2105.03817 .
  • [157] X. Chen et al. Transformer tracking. In CVPR, 2021.
  • [158] P. Sun et al. TransTrack: Multiple Object Tracking with Transformer. arXiv:2012.15460 [cs], May 2021. arXiv: 2012.15460 .
  • [159] S. He et al. TransReID: Transformer-based object re-identification. In ICCV, 2021.
  • [160] X. Liu et al. A video is worth three views: Trigeminal transformers for video-based person re-identification. arXiv:2104.01745 , 2021.
  • [161] T. Zhang et al. Spatiotemporal transformer for video-based person re-identification. arXiv:2103.16469 , 2021.
  • [162] N. Engel et al. Point transformer. IEEE Access, 9:134826–134840, 2021.
  • [163] M.-H. Guo et al. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
  • [164] H. Zhao et al. Point transformer. In ICCV, pp. 16259–16268, 2021.
  • [165] K. Lee et al. Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589 , 2021.
  • [166] A. v. d. Oord et al. Neural discrete representation learning. arXiv, 2017.
  • [167] J. Ho et al. Denoising diffusion probabilistic models. volume 33, pp. 6840–6851, 2020.
  • [168] A. Ramesh et al. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 , 2022.
  • [169] R. Rombach et al. High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695, 2022.
  • [170] X. Wang et al. Sceneformer: Indoor scene generation with transformers. In 3DV, pp. 106–115. IEEE, 2021.
  • [171] Z. Liu et al. Convtransformer: A convolutional transformer network for video frame synthesis. arXiv:2011.10185 , 2020.
  • [172] R. Girdhar et al. Video action transformer network. In CVPR, 2019.
  • [173] H. Liu et al. Two-stream transformer networks for video-based face alignment. T-PAMI, 40(11):2546–2554, 2017.
  • [174] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [175] S. Lohit et al. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In CVPR, 2019.
  • [176] M. Fayyaz and J. Gall. Sct: Set constrained temporal transformer for set supervised action segmentation. In 2020 CVPR, pp. 501–510, 2020.
  • [177] W. Choi et al. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In ICCVW, 2009.
  • [178] K. Gavrilyuk et al. Actor-transformers for group activity recognition. In CVPR, pp. 839–848, 2020.
  • [179] J. Shao et al. Temporal context aggregation for video retrieval with contrastive learning. In WACV, 2021.
  • [180] V. Gabeur et al. Multi-modal transformer for video retrieval. In ECCV, pp. 214–229, 2020.
  • [181] Y. Chen et al. Memory enhanced global-local aggregation for video object detection. In CVPR, pp. 10337–10346, 2020.
  • [182] J. Yin et al. Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention. In 2020 CVPR, pp. 11495–11504, 2020.
  • [183] H. Seong et al. Video multitask transformer network. In ICCVW, 2019.
  • [184] K. M. Schatz et al. A recurrent transformer network for novel view action synthesis. In ECCV (27), pp. 410–426, 2020.
  • [185] C. Sun et al. Videobert: A joint model for video and language representation learning. In ICCV, pp. 7464–7473, 2019.
  • [186] L. H. Li et al. Visualbert: A simple and performant baseline for vision and language. arXiv:1908.03557 , 2019.
  • [187] W. Su et al. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR, 2020.
  • [188] Y.-S. Chuang et al. Speechbert: Cross-modal pre-trained language model for end-to-end spoken question answering. In Interspeech, 2020.
  • [189] R. Hu and A. Singh. Unit: Multimodal multitask learning with a unified transformer. In ICCV, 2021.
  • [190] S. Prasanna et al. When bert plays the lottery, all tickets are winning. In EMNLP, 2020.
  • [191] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2018.
  • [192] Y. Tang et al. Patch slimming for efficient vision transformers. arXiv:2106.02852 , 2021.
  • [193] M. Zhu et al. Vision transformer pruning. arXiv:2104.08500 , 2021.
  • [194] Z. Liu et al. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  • [195] Z. Lan et al. Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2020.
  • [196] C. Xu et al. Bert-of-theseus: Compressing bert by progressive module replacing. In EMNLP, pp. 7859–7869, 2020.
  • [197] S. Shen et al. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI, pp. 8815–8821, 2020.
  • [198] O. Zafrir et al. Q8bert: Quantized 8bit bert. arXiv:1910.06188 , 2019.
  • [199] V. Sanh et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108 , 2019.
  • [200] S. Sun et al. Patient knowledge distillation for bert model compression. In EMNLP-IJCNLP, pp. 4323–4332, 2019.
  • [201] Z. Sun et al. Mobilebert: a compact task-agnostic bert for resource-limited devices. In ACL, pp. 2158–2170, 2020.
  • [202] I. Turc et al. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv:1908.08962 , 2019.
  • [203] X. Qiu et al. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, pp. 1–26, 2020.
  • [204] A. Fan et al. Reducing transformer depth on demand with structured dropout. In ICLR, 2020.
  • [205] L. Hou et al. Dynabert: Dynamic bert with adaptive width and depth. NeurIPS, 33, 2020.
  • [206] Z. Wang et al. Structured pruning of large language models. In EMNLP, pp. 6151–6162, 2020.
  • [207] G. Hinton et al. Distilling the knowledge in a neural network. arXiv:1503.02531 , 2015.
  • [208] C. Buciluǎ et al. Model compression. In SIGKDD, pp. 535–541, 2006.
  • [209] J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS, 2014.
  • [210] S. Mukherjee and A. H. Awadallah. Xtremedistil: Multi-stage distillation for massive multilingual models. In ACL, pp. 2221–2234, 2020.
  • [211] W. Wang et al. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv:2002.10957 , 2020.
  • [212] S. I. Mirzadeh et al. Improved knowledge distillation via teacher assistant. In AAAI, 2020.
  • [213] D. Jia et al. Efficient vision transformers via fine-grained manifold distillation. arXiv:2107.01378 , 2021.
  • [214] V. Vanhoucke et al. Improving the speed of neural networks on cpus. In NIPS Workshop, 2011.
  • [215] Z. Yang et al. Searching for low-bit weights in quantized neural networks. In NeurIPS, 2020.
  • [216] E. Park and S. Yoo. Profit: A novel training method for sub-4-bit mobilenet models. In ECCV, pp. 430–446. Springer, 2020.
  • [217] J. Fromm et al. Riptide: Fast end-to-end binarized neural networks. Proceedings of Machine Learning and Systems, 2:379–389, 2020.
  • [218] Y. Bai et al. Proxquant: Quantized neural networks via proximal operators. In ICLR, 2019.
  • [219] A. Bhandare et al. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv:1906.00532 , 2019.
  • [220] C. Fan. Quantized transformer. Technical report, Stanford Univ., 2019.
  • [221] K. Shridhar et al. End to end binarized neural networks for text classification. In SustaiNLP, 2020.
  • [222] R. Cheong and R. Daniel. transformers. zip: Compressing transformers with pruning and quantization. Technical report, 2019.
  • [223] Z. Zhao et al. An investigation on different underlying quantization schemes for pre-trained language models. In NLPCC, 2020.
  • [224] Z. Liu et al. Post-training quantization for vision transformer. In NeurIPS, 2021.
  • [225] Z. Wu et al. Lite transformer with long-short range attention. In ICLR, 2020.
  • [226] Z. Geng et al. Is attention better than matrix decomposition? In ICLR, 2020.
  • [227] Y. Guo et al. Nat: Neural architecture transformer for accurate and compact architectures. In NeurIPS, pp. 737–748, 2019.
  • [228] D. So et al. The evolved transformer. In ICML, pp. 5877–5886, 2019.
  • [229] C. Li et al. Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In ICCV, 2021.
  • [230] A. Katharopoulos et al. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.
  • [231] C. Yun et al. o(n)𝑜𝑛o(n) connections are expressive enough: Universal approximability of sparse transformers. In NeurIPS, 2020.
  • [232] M. Zaheer et al. Big bird: Transformers for longer sequences. In NeurIPS, 2020.
  • [233] D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4), 2011.
  • [234] F. Chung and L. Lu. The average distances in random graphs with given expected degrees. PNAS, 99(25):15879–15882, 2002.
  • [235] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [236] X. Zhai et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv:1910.04867 , 2019.
  • [237] Y. Cheng et al. Robust neural machine translation with doubly adversarial inputs. In ACL, 2019.
  • [238] W. E. Zhang et al. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM TIST, 11(3):1–41, 2020.
  • [239] K. Mahmood et al. On the robustness of vision transformers to adversarial examples. arXiv:2104.02610 , 2021.
  • [240] X. Mao et al. Towards robust vision transformer. arXiv, 2021.
  • [241] S. Serrano and N. A. Smith. Is attention interpretable? In ACL, 2019.
  • [242] S. Wiegreffe and Y. Pinter. Attention is not not explanation. In EMNLP-IJCNLP, 2019.
  • [243] H. Chefer et al. Transformer interpretability beyond attention visualization. In CVPR, pp. 782–791, 2021.
  • [244] R. Livni et al. On the computational efficiency of training neural networks. In NeurIPS, 2014.
  • [245] B. Neyshabur et al. Towards understanding the role of over-parametrization in generalization of neural networks. In ICLR, 2019.
  • [246] K. Han et al. Ghostnet: More features from cheap operations. In CVPR, pp. 1580–1589, 2020.
  • [247] K. Han et al. Model rubik’s cube: Twisting resolution, depth and width for tinynets. NeurIPS, 33, 2020.
  • [248] T. Chen et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, pp. 269–284, 2014.
  • [249] H. Liao et al. Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019.
  • [250] A. Jaegle et al. Perceiver: General perception with iterative attention. In ICML, volume 139, pp. 4651–4664. PMLR, 18–24 Jul 2021.
  • [251] A. Jaegle et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 , 2021.
  • [252] X. Wang et al. Non-local neural networks. In CVPR, pp. 7794–7803, 2018.
  • [253] A. Buades et al. A non-local algorithm for image denoising. In CVPR, pp. 60–65, 2005.
  • [254] J. Chung et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 , 2014.
  • [255] M. Joshi et al. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020.
  • [256] Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 , 2019.
  • [257] Y. Zhu et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, pp. 19–27, 2015.
  • [258] A. Radford et al. Improving language understanding by generative pre-training, 2018.
  • [259] Z. Yang et al. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5753–5763, 2019.
  • [260] K. Clark et al. Electra: Pre-training text encoders as discriminators rather than generators. arXiv:2003.10555 , 2020.
  • [261] L. Dong et al. Unified language model pre-training for natural language understanding and generation. In NeurIPS, pp. 13063–13075, 2019.
  • [262] M. Lewis et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461 , 2019.
  • [263] Z. Zhang et al. Ernie: Enhanced language representation with informative entities. arXiv:1905.07129 , 2019.
  • [264] M. E. Peters et al. Knowledge enhanced contextual word representations. arXiv:1909.04164 , 2019.
  • [265] J. Lee et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  • [266] I. Beltagy et al. Scibert: A pretrained language model for scientific text. arXiv:1903.10676 , 2019.
  • [267] K. Huang et al. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv:1904.05342 , 2019.
  • [268] J. Ba et al. Multiple object recognition with visual attention. In ICLR, 2014.
  • [269] V. Mnih et al. Recurrent models of visual attention. NeurIPS, pp. 2204–2212, 2014.
  • [270] K. Xu et al. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057, 2015.
  • [271] F. Wang et al. Residual attention network for image classification. In CVPR, pp. 3156–3164, 2017.
  • [272] S. Jetley et al. Learn to pay attention. In ICLR, 2018.
  • [273] K. Han et al. Attribute-aware attention model for fine-grained representation learning. In ACM MM, pp. 2040–2048, 2018.
  • [274] P. Ramachandran et al. Stand-alone self-attention in vision models. In NeurIPS, 2019.
  • [275] Q. Guan et al. Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. In arXiv:1801.09927 , 2018.
  • [276] J. Hu et al. Squeeze-and-excitation networks. In CVPR, pp. 7132–7141, 2018.
  • [277] H. Zhao et al. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, pp. 267–283, 2018.
  • [278] Y. Yuan et al. Ocnet: Object context for semantic segmentation. International Journal of Computer Vision, pp. 1–24, 2021.
  • [279] J. Fu et al. Dual attention network for scene segmentation. In CVPR, pp. 3146–3154, 2019.
  • [280] H. Zhang et al. Co-occurrent features in semantic segmentation. In CVPR, pp. 548–557, 2019.
  • [281] F. Zhang et al. Acfnet: Attentional class feature network for semantic segmentation. In ICCV, pp. 6798–6807, 2019.
  • [282] X. Li et al. Expectation-maximization attention networks for semantic segmentation. In ICCV, pp. 9167–9176, 2019.
  • [283] J. He et al. Adaptive pyramid context network for semantic segmentation. In CVPR, pp. 7519–7528, 2019.
  • [284] O. Oktay et al. Attention u-net: Learning where to look for the pancreas. 2018.
  • [285] Y. Wang et al. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, pp. 12275–12284, 2020.
  • [286] X. Li et al. Global aggregation then local distribution in fully convolutional networks. In BMVC, 2019.
  • [287] Y. Chen et al. A^ 2-nets: Double attention networks. NeurIPS, pp. 352–361, 2018.
  • [288] L. Zhang et al. Dual graph convolutional network for semantic segmentation. In BMVC, 2019.
  • [289] K. Yue et al. Compact generalized non-local network. In NeurIPS, pp. 6510–6519, 2018.
  • [290] Z. Huang et al. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, pp. 603–612, 2019.
  • [291] L. Huang et al. Interlaced sparse self-attention for semantic segmentation. arXiv:1907.12273 , 2019.
  • [292] Y. Li and A. Gupta. Beyond grids: Learning graph representations for visual recognition. NeurIPS, pp. 9225–9235, 2018.
  • [293] S. Kumaar et al. Cabinet: Efficient context aggregation network for low-latency semantic segmentation. arXiv:2011.00993 , 2020.
  • [294] X. Liang et al. Symbolic graph reasoning meets convolutions. NeurIPS, pp. 1853–1863, 2018.
  • [295] Y. Chen et al. Graph-based global reasoning networks. In CVPR, pp. 433–442, 2019.
  • [296] T.-Y. Lin et al. Microsoft coco: Common objects in context. In ECCV, pp. 740–755, 2014.
  • [297] Y. Cao et al. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV Workshops, 2019.
  • [298] W. Li et al. Object detection based on an adaptive attention mechanism. Scientific Reports, pp. 1–13, 2020.
  • [299] T.-I. Hsieh et al. One-shot object detection with co-attention and co-excitation. In NeurIPS, pp. 2725–2734, 2019.
  • [300] Q. Fan et al. Few-shot object detection with attention-rpn and multi-relation detector. In CVPR, pp. 4013–4022, 2020.
  • [301] H. Perreault et al. Spotnet: Self-attention multi-task network for object detection. In 2020 17th Conference on Computer and Robot Vision (CRV), pp. 230–237, 2020.
  • [302] X.-T. Vo et al. Bidirectional non-local networks for object detection. In International Conference on Computational Collective Intelligence, pp. 491–501, 2020.
  • [303] H. Hu et al. Relation networks for object detection. In CVPR, pp. 3588–3597, 2018.
  • [304] K. Zhang et al. Learning enhanced resolution-wise features for human pose estimation. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2256–2260, 2020.
  • [305] Y. Chang et al. The same size dilated attention network for keypoint detection. In International Conference on Artificial Neural Networks, pp. 471–483, 2019.
  • [306] A. Johnston and G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In CVPR, pp. 4756–4765, 2020.
  • [307] Y. Chen et al. Attention-based context aggregation network for monocular depth estimation. International Journal of Machine Learning and Cybernetics, pp. 1583–1596, 2021.
  • [308] S. Aich et al. Bidirectional attention network for monocular depth estimation. In ICRA, 2021.