Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment
局部失真感知高效 Transformer 适配用于图像质量评估
Abstract 摘要
Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.
图像质量评估(IQA)是计算机视觉领域中的一项基础任务,但由于图像的复杂失真条件、多样的图像内容以及数据的有限性,这一问题仍未得到解决。近年来,随着数据量和参数容量的显著增加,许多大规模预训练基础模型相继出现,大大促进了这一领域的发展。然而,一个悬而未决的问题是,高层任务中的扩展规律是否同样适用于与低层线索密切相关的 IQA 任务。在本文中,我们证明通过适当注入局部失真特征,较大的预训练和固定的基础模型在 IQA 任务中表现更佳。具体来说,由于视觉 Transformer(ViT)缺乏局部失真结构和归纳偏差,我们在使用大规模预训练的 ViT 的同时,使用另一个以捕捉局部结构而闻名的预训练卷积神经网络(CNN)来提取多尺度的图像特征。 此外,我们提出了一种局部失真提取器,从预训练的 CNN 中获取局部失真特征,并提出了一种局部失真注入器,将局部失真特征注入到 ViT 中。通过仅训练提取器和注入器,我们的方法可以从强大的基础模型中受益,从而在流行的 IQA 数据集上实现最先进的性能,这表明 IQA 不仅仅是一个低级问题,还可以从大型预训练模型中提取更强的高级特征。
Index Terms:
component, formatting, style, styling, insert索引术语:组件、格式化、样式、造型、插入
I Introduction 一、引言
In the digital era, with millions of images being shared and distributed across various platforms daily, the internet has transformed into a vast repository of visual content. As users exchange and upload images for diverse purposes, spanning from social media interactions to professional applications, ensuring the highest quality and fidelity of these visuals has become highly desirable. Consequently, there has been a substantial increase in the demand for robust image quality assessment (IQA) methods [1, 2, 3]. The precise evaluation of image quality holds significant implications, particularly for social media platforms, as it enables them to determine optimal parameter settings for post-upload processing of images, such as resizing, compression, and enhancement, and further ensure a positive user experience, contributing to user satisfaction and engagement.
在数字时代,每天有数百万张图像在各种平台上共享和传播,互联网已转变为一个庞大的视觉内容库。随着用户出于多种目的交换和上传图像,从社交媒体互动到专业应用,确保这些视觉内容的最高质量和保真性变得极为重要。因此,对稳健的图像质量评估(IQA)方法的需求显著增加[1, 2, 3]。图像质量的精确评估具有重要意义,尤其是对于社交媒体平台,因为这使得它们能够确定图像上传后的最佳处理参数设置,如调整大小、压缩和增强,并进一步确保积极的用户体验,有助于提升用户满意度和参与度。

图 1:在 KonIQ-10k [4] 数据集上的 SOTA IQA 方法比较,其中每个点的大小表示整个网络的模型大小。
Amidst the vast volume of data shared on the internet, numerous pretrained large language models [5, 6], vision models [7, 8], and vision-language models [9, 10] have emerged. Nevertheless, the annotation process for IQA datasets necessitates multiple human annotations for each image, rendering the collection process extremely labor-intensive and financially burdensome. Consequently, the current discipline of IQA suffers from an insufficiency of labeled data, with existing IQA datasets proving inadequate to effectively train large-scale learning models. To address this challenge, a direct approach involves constructing models founded on pretrained Convolutional Neural Networks (CNNs) [11] or Vision Transformers (ViTs) [12, 13]. Additionally, some studies have proposed the design of IQA-specific pretrained approaches [3, 14]. Nevertheless, pretraining large models on large datasets requires a considerable investment of time and resources, causing these methodologies to frequently rely on smaller models and datasets, such as ResNet-50 [15, 14] and ImageNet-1K [13, 14].
在互联网上共享的大量数据中,出现了许多预训练的大型语言模型[5, 6]、视觉模型[7, 8]和视觉-语言模型[9, 10]。然而,IQA 数据集的标注过程需要对每张图片进行多次人工标注,使得数据收集过程极其耗费人力和财力。因此,当前的 IQA 学科缺乏足够的标注数据,现有的 IQA 数据集不足以有效训练大规模学习模型。为了解决这个挑战,一种直接的方法是构建基于预训练卷积神经网络(CNNs)[11]或视觉变压器(ViTs)[12, 13]的模型。此外,一些研究已经提出了为 IQA 专门设计的预训练方法[3, 14]。尽管如此,将大型模型在大型数据集上进行预训练需要大量的时间和资源投入,使得这些方法通常依赖于较小的模型和数据集,例如 ResNet-50[15, 14]和 ImageNet-1K[13, 14]。
Vision models have witnessed a progression from EfficientNet based [16] architectures (comprising 480M parameters) to Transformer-based counterparts [17] with an incredible parameter count of 2,100M, and more recently, they have risen to unprecedented scales, encompassing 22B parameters [18] and 562B [10], with expectations for further growth. Given the magnitude of such large models, traditional pretraining and full-finetuning approaches prove exceptionally challenging, as they necessitate the entire process of each model for every specific task. In light of this, and drawing inspiration from efficient model adaptation techniques in natural language processing (NLP) [19], a variety of visual tuning methods [20, 21] have emerged, enabling the adaptation of pretrained vision or visual-language models to downstream tasks. This practice is different from the operating procedure of transfer learning that either fully fine-tunes the whole model or just fine-tunes the task head [22]. As such, whether or not IQA models can leverage shared parameter weights (typically interpreted as the knowledge of pre-trained models) from large-scale pretrained models to improve performance remains of the greatest significance and interest.
视觉模型经历了从基于 EfficientNet 的架构(包含 480M 参数)到基于 Transformer 的对应模型(惊人的参数量达到 2,100M),最近更是上升到空前规模,涵盖 22B 参数和 562B 参数,并预计将进一步增长。鉴于如此大型模型的规模,传统的预训练和完整微调方法显得异常困难,因为它们需要每个模型为每个具体任务进行整个过程。有鉴于此,并借鉴自然语言处理(NLP)中的高效模型适应技术,各种视觉调优方法已经出现,使预训练的视觉或视觉-语言模型能够适应下游任务。这一实践区别于迁移学习的操作程序,后者要么完全微调整个模型,要么仅微调任务头。 因此,IQA 模型能否利用大规模预训练模型的共享参数权重(通常被解释为预训练模型的知识)来提高性能,仍然是最具重要性和趣味性的问题。
In this work, we make the first attempt to efficiently adapt large-scale pretrained models to IQA tasks, namely LOcal Distortion Aware efficient transformer adaptation (LoDa). The majority of large-scale pretrained models [8, 10] are grounded in the Transformer architecture [23], which is powerful for modeling non-local dependencies [24, 13], but it is weak for local structure and inductive bias [25]. However, IQA is highly reliant on both local and non-local features [24, 13]. In addition, as the human visual system captures an image in a multi-scale fashion, previous works [12, 24] have also shown the benefit of using multi-scale features extracted from CNNs feature maps at different depths for IQA. With the obtained insights, we propose to inject multi-scale features extracted by CNNs into ViT, thereby enriching its representation with local distortion features and inductive bias.
在这项工作中,我们首次尝试将大规模预训练模型高效地适应于图像质量评估(IQA)任务,即本地失真感知的高效变压器适应(LoDa)。大多数大规模预训练模型[8, 10]都是基于变压器架构[23],它在建模非局部依赖方面非常强大[24, 13],但在本地结构和归纳偏差方面较弱[25]。然而,IQA 高度依赖于本地和非局部特征[24, 13]。此外,由于人类视觉系统以多尺度方式捕捉图像,以往的研究[12, 24]也表明,使用从不同深度的 CNN 特征图中提取的多尺度特征对 IQA 有益。基于获得的见解,我们建议将由 CNN 提取的多尺度特征注入 ViT,从而通过本地失真特征和归纳偏差丰富其表示。
Specifically, we feed input images into both a pretrained CNN and a large-scale pretrained ViT, yielding a set of multi-scale features. Then we employ convolution and average pooling processes to collect multi-scale distortion features while discarding redundant data from the multi-scale features. However, the process of infusing these multi-scale features into ViT is not straightforward. Indeed, although we can manipulate and reshape the multi-scale features to mirror the shape of ViT tokens and simply merge them, it is crucial to acknowledge that an image token within ViT corresponds to a patch extracted from the original image, which might not align with the scale of the multi-scale features. To this end, we introduce the cross-attention mechanism, allowing us to query features resembling the image token of ViT from the multi-scale features. These queried features are subsequently fused with the image tokens, ensuring a seamless and meaningful integration of the distortion-related data.
具体来说,我们将输入图像输入到一个预训练的 CNN 和一个大规模预训练的 ViT 中,从而生成一组多尺度特征。然后我们采用卷积和平均池化过程来收集多尺度失真特征,同时从多尺度特征中去除冗余数据。然而,将这些多尺度特征融入 ViT 的过程并不简单。实际上,虽然我们可以操纵和重塑多尺度特征以映射 ViT 代币的形状,并简单合并它们,但必须承认,在 ViT 中,一个图像代币对应从原始图像中提取的 补丁,这可能与多尺度特征的尺度不一致。为此,我们引入了交叉注意力机制,允许我们从多尺度特征中查询类似于 ViT 图像代币的特征。这些查询到的特征随后与图像代币融合,确保失真相关数据的无缝且有意义的整合。
Furthermore, considering the substantial channel dimension of the large-scale pretrained vision transformer (768 for ViT-B), it is imperative to address potential issues stemming from employing this dimension directly in the context of cross-attention. It could lead to an overwhelming number of parameters and computational overhead, which is inconsistent with the principles of efficient model adaptation. Taking inspiration from the concept of adapters in the field of NLP [26], we propose to down-project ViT tokens and multi-scale distortion features to a smaller dimension, which serves to mitigate parameter increase and computational demands. In general, the contributions of this paper can be summarized in three-folds:
此外,考虑到大规模预训练视觉 Transformer 的巨大通道维度(对于 ViT-B 为 768),必须解决直接在交叉注意力上下文中使用该维度可能引发的问题。这可能导致参数数量过多和计算开销过大,与高效模型适应原则相悖。借鉴自然语言处理领域中适配器的概念,我们建议将 ViT 令牌和多尺度失真特征降维到较小的尺寸,以减轻参数增加和计算需求。总体而言,本文的贡献可以总结为三个方面:
-
•
We make the first attempt to efficiently adapt large-scale pretrained models to IQA tasks. We leverage the knowledge of large-scale pretrained models to develop an IQA model that only introduces small trainable parameters to alleviate the scarcity of training data.
我们首次尝试将大规模预训练模型有效地适配到图像质量评估(IQA)任务中。我们利用大规模预训练模型的知识来开发 IQA 模型,仅引入少量可训练参数以缓解训练数据的稀缺性。 -
•
We embed supplementary multi-scale features obtained from pretrained CNNs into large-scale pretrained ViTs. With proper local distortion injection, a larger pretrained backbone could show better IQA performance.
我们将从预训练的 CNN 中获得的补充多尺度特征嵌入到大规模预训练的 ViT 中。通过适当的局部失真注入,较大的预训练骨干网可以展现出更好的图像质量评估性能。 -
•
Extensive experiments on seven IQA benchmarks show that our method significantly outperforms other counterparts with much less trainable parameters, indicating the effectiveness and generalization ability of our methods.
我们在七个图像质量评估(IQA)基准上进行了广泛的实验,结果表明我们的方法在可训练参数更少的情况下显著优于其他对比方法,表明了我们方法的有效性和泛化能力。
II Related Work II 相关工作
II-A Deep Learning Based Image Quality Assessment
II-A 基于深度学习的图像质量评估

图 2:所提出的 LoDa 框架概述
By the success of deep learning in many computer vision tasks, different approaches utilize deep learning for IQA: early CNN-based [11, 1, 2] and recently transformer-based methods [24, 13]. Modern CNN-based models commonly posit that initial stages within the network encapsulate low-level spatial characteristics, whereas subsequent stages are indicative of higher-level semantic features [27]. Based on this, Su et al. [11] put forth a method wherein multi-scale features and semantic features are extracted from images using the ResNet architecture [15]. Subsequently, they endeavor to capture local distortion information from the multi-scale features and generate weights utilizing semantic features to serve as a quality prediction target network. Lastly, the target network assimilates the aggregated local distortion features as input to predict image quality.
随着深度学习在许多计算机视觉任务中的成功,不同的方法利用深度学习进行图像质量评估:早期的基于 CNN 的方法[11, 1, 2]以及最近的基于 Transformer 的方法[24, 13]。现代的基于 CNN 的模型通常认为网络中的初始阶段包含低级别的空间特征,而后续阶段则显示出高级别的语义特征[27]。基于此,Su 等人[11]提出了一种方法,其中使用 ResNet 架构[15]从图像中提取多尺度特征和语义特征。随后,他们努力从多尺度特征中捕捉局部失真信息,并利用语义特征生成权重以作为质量预测目标网络。最后,目标网络将聚合的局部失真特征作为输入来预测图像质量。
Although CNNs capture the local structure of the image, they are well known for missing to capture non-local information and having strong locality bias. On the contrary, Vision Transformer (ViT) [28] has strong capability in modeling the non-local dependencies among features of the image, thus transformer-based methods demonstrate great potential in dealing with the image quality assessment. Golestaneh, Dadsetan, and Kitani [24] proposed a method that utilizes CNNs to extract the perceptual features as inputs to the Transformer encoder. Ke et al. [12] and Qin et al. [13] directly send image patches as inputs to the Transformer encoder.
虽然 CNN 能够捕捉图像的局部结构,但它们因未能捕捉非局部信息和具有较强的局部性偏差而广为人知。相反,Vision Transformer(ViT)在建模图像特征的非局部依赖性方面具有较强的能力,因此基于 Transformer 的方法在处理图像质量评估方面表现出巨大潜力。Golestaneh、Dadsetan 和 Kitani 提出了一种方法,利用 CNN 提取感知特征作为 Transformer 编码器的输入。Ke 等人和 Qin 等人则直接将图像块作为输入发送到 Transformer 编码器。
II-B Large-scale Pretrained Models
大规模预训练模型
Recently, the parameter capacities of vision models are undergoing a rapid expansion, scaling from 480M parameters of EfficientNet-based [16] to 22B parameters of Transformer-based counterparts [18]. As a consequence, their requisition for training data and training techniques is similarly expanding. In terms of this, these models are commonly trained on large-scale labeled datasests [8, 29] in a supervised or self-supervised manner. Moreover, some works [10] adopt large-scale multi-modal data (e.g., image-text pairs) for training, which leads to even more powerful visual representations. In this work, we could take advantage of these well pretrained image models and adapt them efficiently to solve IQA tasks.
最近,视觉模型的参数容量正在迅速扩展,从基于 EfficientNet 的 480M 参数扩展到基于 Transformer 的 22B 参数。因此,它们对训练数据和训练技术的需求也在不断增长。从这方面来看,这些模型通常在大型标注数据集上进行有监督或自监督的训练。此外,一些工作[10]采用大型多模态数据(例如,图像-文本对)进行训练,从而产生更强大的视觉表征。在这项工作中,我们可以利用这些经过良好预训练的图像模型,并有效地适应它们来解决图像质量评估(IQA)任务。
II-C Efficient Model Adaptation
II-C 高效模型适应
In the field of NLP, efficient model adaptation techniques involve adding or modifying a limited number of parameters of the model, as limiting the dimension of the optimization problem can prevent catastrophic forgetting. Conventional arts [11] typically adopt full-tuning in the downstream tasks. Rare attention has been drawn to the field of efficient adaptation, especially in the field of vision Transformers. However, with the surge of large-scale pretrained models, the conventional paradigm is inevitably limited by the huge computational burden, thus some works [20, 21] migrate the efficient model adaptation techniques that appeared in NLP to CV.
在自然语言处理领域,高效的模型适应技术涉及增加或修改模型的少量参数,因为限制优化问题的维度可以防止灾难性遗忘。传统技术[11]通常在下游任务中采用全调优。对于高效适应领域,尤其是在视觉 Transformer 领域,鲜有关注。然而,随着大规模预训练模型的兴起,传统范式不可避免地受到巨大计算负担的限制,因此一些工作[20, 21]将自然语言处理中出现的高效模型适应技术迁移到计算机视觉领域。
Due to the paucity of labeled data for training, IQA methods are unable to realize their full potential. Previous works [11, 24] commonly full-finetune the whole network trained on ImageNet-1K, but the model and data are insufficiently large. In this work, we propose employing efficient model adaptation techniques to adapt large-scale pretrained models to IQA tasks.
由于缺乏用于训练的标注数据,图像质量评估(IQA)方法无法充分发挥其潜力。先前的研究[11, 24]通常在 ImageNet-1K 上训练整个网络并进行完整微调,但模型和数据规模都不够大。在这项工作中,我们提出采用高效的模型适应技术,以将大规模预训练模型适配于 IQA 任务。
III The Proposed Method
III 提出的方法
III-A Overall Architecture
III-A 整体架构
In an effort to further improve the efficiency of pretrained model adaptation and customize it for IQA tasks, we devise a transformer-based adaptation efficient framework, namely LOcal Distortion Aware efficient transformer adaptation (LoDa). As depicted in Figure 2, the framework of LoDa is composed of two components. The initial component incorporates a large-scale pretrained Vision Transformer (ViT) [28]. The second component comprises a pretrained CNN tasked with extracting multi-scale features from the input image. Moreover, it integrates a local distortion extractor responsible for capturing local distortion features from the extracted multi-scale features. Subsequently, a local distortion-aware injector is employed to procure corresponding local distortion tokens, which are similar to tokens of the ViT model, and then infuse them for the later stage.
为了进一步提高预训练模型适应的效率,并使其定制化用于图像质量评估任务,我们设计了一种基于 Transformer 的高效适应框架,称为局部失真感知高效 Transformer 适应(LoDa)。如图 2 所示,LoDa 的框架由两个组件组成。初始组件包含一个大规模预训练的 Vision Transformer(ViT)。第二个组件包括一个预训练的 CNN,负责从输入图像中提取多尺度特征。此外,它整合了一个局部失真提取器,负责从提取的多尺度特征中捕捉局部失真特征。随后,使用一个局部失真感知注入器来获取相应的局部失真标记,这些标记类似于 ViT 模型的标记,然后将它们注入到后续阶段。
Specifically, upon receiving an input image, our process initiates by directing it to a pretrained CNN to extract multi-scale features. Subsequently, these mult-scale feature maps are individually routed into separate local distortion extractors, generating distinct local distortion features. These local distortion features are then reshaped and concatenated to create multi-scale distortion tokens for later interaction. Simultaneously, the input image is further input into the patch embedding layer of the pretrained ViT. Here, the image undergoes division into non-overlapping patches, which are then flattened and transformed into -dimensional tokens by the patch embedding layer. Subsequently, these tokens are added with position embeddings. After this process, the tokens, acting as queries, are coupled with the multi-scale distortion tokens and are subjected to cross-attention. This results in the extraction of similar local distortion features from the multi-scale distortion tokens, which are subsequently injected into the tokens of the ViT, thereby enhancing the distortion-related information encompassed by these tokens. Following this, the tokens, along with the augmented distortion features, traverse through transformer encoder layers and cross-attention blocks. Finally, the CLS token acquired from the ViT serves as the input to the quality regressor, enabling the derivation of the final quality score.
具体而言,在接收到输入图像后,我们的流程首先通过预训练的 CNN 提取多尺度特征。随后,这些多尺度特征图被分别引导至各自的局部畸变提取器,生成不同的局部畸变特征。这些局部畸变特征随后被重新整形并连接起来,以创建用于后续交互的多尺度畸变标记。同时,输入图像还被输入到预训练 ViT 的 patch 嵌入层。在这里,图像被划分为不重叠的 块,然后这些块被扁平化并通过 patch 嵌入层转换为 维标记。接着,这些标记被加上位置嵌入。经过此过程后,作为查询的标记与多尺度畸变标记结合,并接受交叉注意力措施。这导致从多尺度畸变标记中提取出相似的局部畸变特征,这些特征随后被注入到 ViT 的标记中,从而增强了这些标记所包含的失真相关信息。 随后,这些标记连同增强的失真特征一起穿越 个 Transformer 编码器层和交叉注意力块。最后,从 ViT 获取的 CLS 标记作为质量回归器的输入,从而能够得出最终的质量评分。
It is noteworthy that during adaptation, only the local distortion extractor modules, local distortion aware injectors and the head are trainable, but the weights of the pretrained ViT encoder and pretrained CNN are frozen.
值得注意的是,在适配过程中,只有局部失真提取模块、局部失知注入器和头部是可训练的,而预训练的 ViT 编码器和预训练的 CNN 的权重是被冻结的。


图 3:所提出的局部失真提取器和注入器的架构。
III-B Local Distortion Extractor
局部失真提取器
The majority of large-scale pretrained models [8, 7, 17, 10] are grounded in the Transformer architecture [23], renowned for its robust capacity to model non-local dependencies among perceptual features within an image. However, these models exhibit a weak inductive bias. Conversely, CNNs excel at capturing the local structure of an image, exhibiting a strong locality bias, but they falter in capturing non-local information [24, 13]. Nonetheless, IQA is highly reliant on both local and non-local features [24, 13]. In the absence of abundant labeled data [3, 14], the adaptation of large-scale pretrained models to IQA may suffer from a deficiency in local structure and inductive bias. This deficiency can, however, be mitigated by leveraging the capabilities of CNNs [30]. In light of these considerations, we propose the exploitation of the local structure and inductive bias derived from pretrained CNNs to strengthen the adaptation of large-scale pretrained models for IQA without altering their original architecture.
大多数大规模预训练模型[8, 7, 17, 10]基于 Transformer 架构[23],以其强大的能力著称,能够建模图像感知特征中的非局部依赖关系。然而,这些模型表现出了较弱的归纳偏置。相反,CNN 在捕捉图像的局部结构方面表现优异,展现出了强烈的局部性偏置,但在捕捉非局部信息时表现不佳[24, 13]。然而,IQA 高度依赖于局部和非局部特征[24, 13]。在缺乏丰富标注数据的情况下[3, 14],将大规模预训练模型适应于 IQA 可能会因局部结构和归纳偏置的不足而受影响。然而,这种不足可以通过充分利用 CNN 的能力来缓解[30]。鉴于这些考虑,我们提议利用预训练 CNN 的局部结构和归纳偏置来加强大规模预训练模型在 IQA 中的适应性,而不改变其原有架构。
As shown in Figure 2, with the given input image , the pretrained CNN will output a set of multi-scale features , where denotes the -th block of CNN, denotes the batch size and , and denote the channel size, width, and height of the -th features, respectively. The reason why we extract multi-scale features is that semantic features extracted from the last layer merely represent holistic image content [11]. In order to capture local distortions in the real world, we propose to extract multi-scale features through a local distortion extractor. As illustrated in Figure 3(a), we use sequential trainable and convolutional layers to project them into equal dimensions and further extract image quality-related local distortion features and inductive bias from the multi-scale features. Since the initial features of pretrained CNNs have relatively large dimensions, to keep adaptation efficiency, we use average pooling to pool the extracted features into a smaller size. Let denote the output feature after sending to the convolutions and pooling. Next, we flatten and concatenate and obtain the multi-scale distortion tokens , as the input for the local distortion injector.
如图 2 所示,给定输入图像 ,预训练的 CNN 将输出一组多尺度特征 ,其中 表示 CNN 的第 个块, 表示批量大小,而 、 和 分别表示第 个特征的通道大小、宽度和高度。我们提取多尺度特征的原因在于,最后一层提取的语义特征仅代表整体图像内容[11]。为了捕捉真实世界中的局部失真,我们提出通过一个局部失真提取器来提取多尺度特征 。如图 3(a)所示,我们使用序列可训练的 和 卷积层将它们投影到相同维度,并进一步从多尺度特征中提取与图像质量相关的局部失真特征和归纳偏差。由于预训练 CNN 的初始特征维度相对较大,为了保持适应效率,我们使用平均池化将提取的特征池化为较小的尺寸。让 表示将 送入卷积和池化后的输出特征。 接下来,我们将 展平并进行连接,获得多尺度畸变令牌 ,作为局部畸变注入器的输入。
III-C Local Distortion Injector
III-C 局部失真注入器
A direct approach to infusing multi-scale distortion tokens into tokens of large-scale pretrained ViT models involves a simple addition of the features with the tokens. Nevertheless, it should be noted that an image token in ViT corresponds to a patch of the original image, which might not align with the scale of the multi-scale distortion features. To address this misalignment, we introduce to use cross-attention mechanism, which enables to query features akin to the image token of ViT from the multi-scale distortion features. Subsequently, the queried features are adeptly combined with the image tokens, ensuring a coherent and effective integration of the distortion information.
一种将多尺度失真标记直接注入到大规模预训练 ViT 模型的标记中的方法是对特征与标记进行简单的相加。然而,需要注意的是,ViT 中的图像标记对应于原始图像的 补丁,这可能与多尺度失真特征的尺度不一致。为了解决这一不匹配,我们引入了使用交叉注意力机制的方法,这使得能够从多尺度失真特征中查询类似于 ViT 图像标记的特征。随后,被查询的特征与图像标记巧妙地结合,确保失真信息的连贯和有效集成。
As illustrated in Figure 3(b), after passing the input image to large-scale pretrained ViT, assuming that denote the token of block of the ViT (including CLS token and image token). We take as query and multi-scale distortion tokens as key and value of multi-head cross-attention (MHCA) to obtain multi-scale distortion tokens that are similar to from :
如图 3(b) 所示,将输入图像 输入到大规模预训练的 ViT 中,假设 表示 ViT 的第 块的标记(包括 CLS 标记和图像标记)。我们将 作为查询 ,将多尺度畸变标记 作为多头交叉关注(MHCA)的键 和值 ,从 中获得与 类似的多尺度畸变标记:
(1) |
Then, the queried multi-scale distortion tokens are added with ViT tokens , which can be written as Eqn. 2:
随后,查询到的多尺度失真标记与 ViT 标记 相加,可以表示为公式 2:
(2) |
where represents a trainable vector designed to strike a balance between the output of the attention layer and the input feature . To facilitate this balance, is initialized to a value close to 0. This specific initialization strategy ensures that the feature distribution of remains unchanged despite the injection of queried multi-scale distortion features, thereby allowing for more effective utilization of the pretrained weights of ViT in the adaptation process.
其中 表示一个可训练的向量,旨在平衡注意力层的输出和输入特征 。为了促进这种平衡, 被初始化为一个接近于零的值。这一特定的初始化策略确保了 的特征分布在注入多尺度失真特征后保持不变,从而允许在适应过程中更有效地利用 ViT 的预训练权重。
Due to the channel dimension of the large-scale pretrained vision transformer being relatively large (768 for ViT-B), directly using this for extra MHCA will bring a tremendous amount of parameters and computational overhead, which is not consistent with efficient model adaptation. Inspired by adapter [26] in NLP, we propose to down project ViT token and multi-scale distortion features to a smaller dimension ,
由于大规模预训练视觉 Transformer 的通道维度相对较大(例如 ViT-B 为 768),直接使用这种方式进行额外的 MHCA 会导致参数量和计算开销大幅增加,这与高效模型适应不符。受 NLP 中适配器[26]的启发,我们提出将 ViT 标记 和多尺度失真特征 降维到较小的维度 。
(3) |
(4) |
where is a trainable MLP layer, performs the projection of ViT token and multi-scale distortion features into and , respectively. Notably, it is and that take on the roles of query , key and value within MHCA, instead of and . Lastly, we up-project the result from cross-attention by a trainable MLP layer into the dimension of ViT tokens.
其中 是一个可训练的 MLP 层,执行 ViT 令牌 和多尺度失真特征 到 和 的投影。值得注意的是,在 MHCA 中, 和 承担了查询 、键 和值 的角色,而不是 和 。最后,我们通过一个可训练的 MLP 层将交叉注意力的结果向上投影到 ViT 令牌的维度。
III-D IQA Regression III-D IQA 回归
With the output CLS token of ViT, we feed it into a single-layer regressor head to obtain the quality score. A PLCC-induced loss is employed for training. Assuming there are images on the training batch and the predicted quality scores and corresponding label , the PLCC-induced loss is defined as:
利用 ViT 的输出 CLS token,我们将其输入到单层回归头以获得质量评分。采用 PLCC 诱导的损失进行训练。假设训练批次中有 张图像,预测的质量评分为 ,对应的标签为 ,PLCC 诱导损失定义为:
(5) |
where and are the mean values of and , respectively.
其中 和 分别是 和 的平均值。
LIVE | TID2013 | KADID-10k | LIVEC | KonIQ-10k | SPAQ | FLIVE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method 方法 | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC |
ILNIQE | 0.902 | 0.906 | 0.521 | 0.648 | 0.534 | 0.558 | 0.508 | 0.508 | 0.523 | 0.537 | 0.713 | 0.712 | 0.294 | 0.332 |
BRISQUE | 0.929 | 0.944 | 0.626 | 0.571 | 0.528 | 0.567 | 0.629 | 0.629 | 0.681 | 0.685 | 0.809 | 0.817 | 0.303 | 0.341 |
WaDIQaM-NR | 0.960 | 0.955 | 0.835 | 0.855 | 0.739 | 0.752 | 0.682 | 0.671 | 0.804 | 0.807 | - | - | 0.455 | 0.467 |
DB-CNN | 0.968 | 0.971 | 0.816 | 0.865 | 0.851 | 0.856 | 0.851 | 0.869 | 0.875 | 0.884 | 0.911 | 0.915 | 0.545 | 0.551 |
TIQA | 0.949 | 0.965 | 0.846 | 0.858 | 0.850 | 0.855 | 0.845 | 0.861 | 0.892 | 0.903 | - | - | 0.541 | 0.581 |
MetaIQA | 0.960 | 0.959 | 0.856 | 0.868 | 0.762 | 0.775 | 0.835 | 0.802 | 0.887 | 0.856 | - | - | 0.540 | 0.507 |
P2P-BM | 0.959 | 0.958 | 0.862 | 0.856 | 0.840 | 0.849 | 0.844 | 0.842 | 0.872 | 0.885 | - | - | 0.526 | 0.598 |
HyperIQA (27M) HyperIQA(27M) | 0.962 | 0.966 | 0.840 | 0.858 | 0.852 | 0.845 | 0.859 | 0.882 | 0.906 | 0.917 | 0.911 | 0.915 | 0.544 | 0.602 |
MUSIQ (27M) | 0.940 | 0.911 | 0.773 | 0.815 | 0.875 | 0.872 | 0.702 | 0.746 | 0.916 | 0.928 | 0.918 | 0.921 | 0.566 | 0.661 |
TReS (152M) TReS(152M) | 0.969 | 0.968 | 0.863 | 0.883 | 0.859 | 0.858 | 0.846 | 0.877 | 0.915 | 0.928 | - | - | 0.544 | 0.625 |
DEIQT (24M) | 0.980 | 0.982 | 0.892 | 0.908 | 0.889 | 0.887 | 0.875 | 0.894 | 0.921 | 0.934 | 0.919 | 0.923 | 0.571 | 0.663 |
LIQE (151M) | 0.970 | 0.951 | - | - | 0.930 | 0.931 | 0.904 | 0.910 | 0.919 | 0.908 | - | - | - | - |
Re-IQA (24M) | 0.970 | 0.971 | 0.804 | 0.861 | 0.872 | 0.885 | 0.840 | 0.854 | 0.914 | 0.923 | 0.918 | 0.925 | - | - |
QPT-ResNet50 (24M) QPT-ResNet50(24M) | - | - | - | - | - | - | 0.895 | 0.914 | 0.927 | 0.941 | 0.925 | 0.928 | 0.575 | 0.675 |
LoDa (9M) | 0.975 | 0.979 | 0.869 | 0.901 | 0.931 | 0.936 | 0.876 | 0.899 | 0.932 | 0.944 | 0.925 | 0.928 | 0.578 | 0.679 |
表 I:性能比较,以 SRCC 和 PLCC 的中位数衡量,其中加粗的项目表示前两名结果。
IV Experiments IV 实验
IV-A Experimental Setting
IV-A 实验设置
IV-A1 Datasets IV-A1 数据集
Our method is evaluated on seven classical IQA datasets, including three synthetic datasets of LIVE [31], TID2013 [32], KADID-10k [33] and four authentic datasets of LIVEC [34], KonIQ-10k [4], SPAQ [35], FLIVE [1]. For the synthetic datasets, they contain a few pristine images which are synthetically distorted by various distortion types, such as JPEG compression and Gaussian blurring. LIVE contains 799 synthetically distorted images with 5 distortion types. TID2013 and KADID-10k consist of 3000 and 10125 synthetically distorted images involving 24 and 25 distortion types, respectively. For the authentic datasets, LIVEC consists of 1,162 images with diverse authentic distortions captured by mobile devices. KonIQ-10k contains 10,073 images which are selected from YFCC-100M and the selected images cover a wide and uniform range of distortions such as brightness colorfulness, contrast, noise, sharpness, etc. SPAQ consists of 11,125 images captured by different mobile devices, covering a large variety of scene categories. FLIVE is the largest in-the-wild IQA dataset by far, which contains 39,810 real-world images with diverse contents, sizes, and aspect ratios.
我们的方法在七个经典的图像质量评估(IQA)数据集上进行了评估,包括三个合成数据集:LIVE、TID2013、KADID-10k 和四个真实数据集:LIVEC、KonIQ-10k、SPAQ、FLIVE。对于合成数据集,它们包含一些通过各种失真类型合成失真的原始图像,如 JPEG 压缩和高斯模糊。LIVE 包含 799 张经过五种失真类型合成失真的图像。TID2013 和 KADID-10k 分别由 3000 张和 10125 张合成失真的图像组成,涉及 24 和 25 种失真类型。对于真实数据集,LIVEC 由 1162 张包含多种在移动设备上拍摄的真实失真图像组成。KonIQ-10k 包含从 YFCC-100M 中选取的 10073 张图像,所选图像涵盖了宽广且均匀分布的失真范围,如亮度、色彩鲜艳度、对比度、噪声、锐度等。SPAQ 由通过不同移动设备拍摄的 11125 张图像组成,覆盖了各种场景类别。FLIVE 是目前最大的野外图像质量评估数据集,包含 39810 张具有不同内容、大小和宽高比例的真实世界图像。
IV-A2 Evaluation Criteria
IV-A2 评估标准
Spearman’s rank order correlation coefficient (SRCC) and Pearson’s linear correlation coefficient (PLCC) are employed to measure prediction monotonicity and prediction accuracy. The higher value indicates better performance. For PLCC, a logistic regression correction is also applied according to VQEG [36].
斯皮尔曼等级相关系数(SRCC)和皮尔逊线性相关系数(PLCC)用于衡量预测单调性和预测准确性。数值越高表示性能越好。对于 PLCC,根据 VQEG [36] 还应用了逻辑回归校正。
Training 训练 | FLIVE | LIVEC | KonIQ | |
---|---|---|---|---|
Testing 测试 | KonIQ | LIVEC | KonIQ | LIVEC |
DBCNN | 0.716 | 0.724 | 0.754 | 0.755 |
P2P-BM | 0.755 | 0.738 | 0.740 | 0.770 |
HyperIQA | 0.758 | 0.735 | 0.772 | 0.785 |
TReS | 0.713 | 0.740 | 0.733 | 0.786 |
DEIQT | 0.733 | 0.781 | 0.744 | 0.794 |
LoDa | 0.763 | 0.805 | 0.745 | 0.811 |
表 II:跨数据集验证中的 SRCC。最佳表现用粗体突出显示。
IV-A3 Implementation Details
实现细节
We implement the model by PyTorch and conduct training and testing on an NVIDIA RTX 4090 GPU. We resize the smaller edge of images to 384, randomly crop an input image into multiple image patches with a resolution of , and horizontally and vertically augment them randomly to increase the number of data for training [2]. Particularly, the number of patches for training is determined depending on the size of the dataset, i.e., 1 for FLIVE, 3 for KonIQ-10k, and 5 for LIVEC, the number of patches for testing is 15 for all datasets, and patches inherit quality scores from the source image. We create our model based on the ViT-B pretrained on ImageNet-21k with an image size of and patch size of . We use ResNet50 [15] pretrained on ImageNet-1k for the CNN backbone and extract feature maps of the last four blocks as multi-scale features. we use average pooling to pool multi-scale features into a spatial size of . The dimension after the down projection is 64 and the number of heads used for cross-attention is 4.
我们使用 PyTorch 实现模型,并在 NVIDIA RTX 4090 GPU 上进行训练和测试。我们将图像的较短边调整为 384,随机裁剪输入图像为多块分辨率为 的图像块,并随机进行水平和垂直增强,以增加训练数据的数量[2]。特别地,训练所需的图像块数量取决于数据集的大小,即 FLIVE 为 1,KonIQ-10k 为 3,LIVEC 为 5,所有数据集的测试图像块数量为 15,图像块继承源图像的质量分数。我们基于在 ImageNet-21k 上预训练的 ViT-B 模型创建我们的模型,图像大小为 ,图像块大小为 。我们使用在 ImageNet-1k 上预训练的 ResNet50[15]作为 CNN 骨干网,并提取最后四个块的特征图作为多尺度特征。我们使用平均池化将多尺度特征池化到空间大小为 。下降投影后的维度为 64,交叉注意力使用的头数为 4。
We use AdamW optimizer with a weight decay of 0.01 and mini-batch size of 128. The learning rate was initialized with 0.0003 and decayed by the cosine annealing strategy. All experiments are trained for 10 epochs. By default, we select the evaluation of the last epoch. For each dataset, 80% images were used for training and the remaining 20% images were utilized for testing. We repeated this process 10 times for all experiments to mitigate the performance bias and the medians of SRCC and PLCC were reported.
我们使用 AdamW 优化器,权重衰减为 0.01,迷你批大小为 128。学习率初始化为 0.0003,并采用余弦退火策略进行衰减。所有实验均训练 10 个周期。默认情况下,我们选择最后一个周期进行评估。对于每个数据集,80%的图像用于训练,剩余的 20%图像用于测试。我们将这一过程重复了 10 次,以减轻性能偏差,并报告 SRCC 和 PLCC 的中值。
IV-B Comparisons with the State-of-the-art Methods
与现有先进方法的比较
The performance comparison over the State-of-the-art (SOTA) BIQA methods is shown in Table I. Our model outperforms the existing SOTA methods [37, 38, 39, 2, 40, 41, 1, 11, 12, 24, 13, 42, 3, 14] by a significant margin on these datasets of both synthetically and authentically distorted images. Since images on various datasets span a wide variety of image contents and distortion types, it is still challenging to consistently achieve the leading performance on all of them.
与现有的最先进(SOTA)BIQA 方法的性能比较如表 I 所示。我们的模型在合成和真实失真的图像数据集上均显著优于现有的 SOTA 方法 [37, 38, 39, 2, 40, 41, 1, 11, 12, 24, 13, 42, 3, 14]。由于各个数据集中的图像涵盖了广泛的图像内容和失真类型,因此在所有数据集上持续保持领先性能仍然具有挑战性。
KADID-10k | KonIQ-10k | |||
---|---|---|---|---|
Pre-train 预训练 | SRCC | PLCC | SRCC | PLCC |
MAE | 0.917 | 0.924 | 0.927 | 0.938 |
Multi-Modal 多模态 | 0.897 | 0.902 | 0.909 | 0.923 |
ImageNet-1K | 0.912 | 0.920 | 0.920 | 0.933 |
ImageNet-21K | 0.931 | 0.936 | 0.932 | 0.944 |
表格 III:大型预训练模型的影响,使用不同的方法和数据集进行预训练模型。
Specifically, ours surpass traditional methods (e.g., ILNIQE [37] and BRISQUE [38]) and earlier learning-based methods (e.g., TIQA [40] and HyperIQA [11] by a large margin. For LIQE [42] that utilized a large-scale pretrained vision-language model, multi-task labels, and full fine-tuning on multiple datasets simultaneously, LoDa still outperforms on both large synthetical and authentical datasets, i.e., KADID10k and KonIQ10k. Compared with current SOTA methods that required extra pertaining (e.g., DEIQT [13], Re-IQA [3] and QPT-ResNet50 [14]), LoDa obtains competitive or higher results, showing the powerful effectiveness of adaptation of large-scale pretrained models. Correspondingly, the top performance on the largest synthetical datasets KADID-10k confirms the superiority of fusing the multi-scale distortion features from CNN into ViT model.
具体来说,我们的方法大幅超越了传统方法(如 ILNIQE [37]和 BRISQUE [38])以及早期的学习型方法(如 TIQA [40]和 HyperIQA [11])。对于使用了大规模预训练视觉-语言模型、多任务标签以及同时对多个数据集进行完整微调的 LIQE [42],LoDa 在大型合成和真实数据集上仍然表现优异,即 KADID10k 和 KonIQ10k。与需要额外预训练(如 DEIQT [13]、Re-IQA [3]和 QPT-ResNet50 [14])的当前 SOTA 方法相比,LoDa 获得了具有竞争力或更高的结果,显示出大规模预训练模型适应性的强大效果。相应地,在最大的合成数据集 KADID-10k 上表现出的顶级性能证实了将多尺度畸变特征从 CNN 融合到 ViT 模型中的优势。
IV-C Cross-Dataset Evaluation
IV-C 跨数据集评估
We further compare the generalizability of LoDa against competitive BIQA models in a cross-dataset setting following [13]. Training is performed on one specific dataset, and testing is performed on a different dataset without any finetuning or parameter adaptation. The experimental results in terms of the medians of SRCC on four datasets are reported in Table II. As observed, LoDa achieves the best performance on all datasets. These results manifest the strong generalization capability of LoDa.
我们进一步比较了 LoDa 在跨数据集设置中的泛化能力,针对具有竞争力的 BIQA 模型,遵循[13]进行比较。训练在一个特定数据集上进行,测试则在不同的数据集上进行,且没有任何微调或参数调整。实验结果以四个数据集上的 SRCC 中值形式报告在表 II 中。可以观察到,LoDa 在所有数据集上都获得了最佳性能。这些结果表明了 LoDa 强大的泛化能力。
KADID-10k | KonIQ-10k | |||
---|---|---|---|---|
Backbone 主干网络 | SRCC | PLCC | SRCC | PLCC |
ViT-T | 0.892 | 0.900 | 0.914 | 0.926 |
ViT-S | 0.915 | 0.922 | 0.928 | 0.939 |
ViT-B | 0.931 | 0.936 | 0.932 | 0.944 |
表 IV:大规模预训练模型大小的影响。
KADID-10K | KonIQ-10K | |||
---|---|---|---|---|
Fine-tuning Methods 微调方法 | SRCC | PLCC | SRCC | PLCC |
ViT (Linear Probe) ViT(线性探针) | 0.676 | 0.701 | 0.796 | 0.833 |
ViT (Full fine-tune) ViT(完整微调) | 0.889 | 0.899 | 0.874 | 0.891 |
Adapter-ViT | 0.914 | 0.920 | 0.926 | 0.939 |
LoRA-ViT | 0.913 | 0.921 | 0.921 | 0.934 |
VPT-ViT | 0.889 | 0.900 | 0.919 | 0.932 |
LoDa | 0.931 | 0.936 | 0.932 | 0.944 |
表 V:与不同微调方法的比较。


图 4:ViT 和 LoDa 特征的傅里叶分析。(a) ViT 和 LoDa 的傅里叶频谱。(b)傅里叶变换特征图的相对对数幅度。(a)和(b)显示 LoDa 捕捉到更多高频信号。
IV-D Effectiveness of Large-scale Pretrained Models
IV-D 大规模预训练模型的有效性
One of the key properties of our proposed methods is the efficient adaptation of large-scale pretrained models, which allows our model to achieve a competing performance to SOTA BIQA methods. To demonstrate the effectiveness of using large-scale pretrained models in our proposed models, we employ different pretrained weights, including ImageNet-1K pretrained weights [7], ImageNet-21K pretrained weights [7], MAE pretrained weights [43], and Multi-Modal pretrained weights [44], and evaluate them on relatively large synthetical and authentical datasets, KADID-10k and KonIQ-10k. The experimental results are detailed in Table III. The transition from weights pretrained on ImageNet-1K to those pretrained on ImageNet-21K yields more benefits for our model, as the scale of pretraining data expansively increases. Besides, While MAE also employs ImageNet-1K pretraining, it distinguishes itself from supervised ImageNet-1K pretraining by embracing a more potent self-supervised pretraining approach, which also confers substantial advantages upon our model. However, our model faces challenges in effectively leveraging multi-modal pretrained weights. One plausible explanation is that multi-modal pretrained models may prioritize the abstract concepts inherent within images, a focus that diverges from the demands of IQA tasks. Since multi-modal pretrained weights contain more information than single-modal pretrained ones, how to apply these models to the IQA tasks will also be an important topic and we will commit to conducting further research on this.
我们提出的方法的关键特性之一是能够高效地适应大规模预训练模型,这使得我们的模型能够在 BIQA 方法中取得与现有最佳方法相媲美的性能。为了展示在我们提出的模型中使用大规模预训练模型的有效性,我们采用了不同的预训练权重,包括 ImageNet-1K 预训练权重、ImageNet-21K 预训练权重、MAE 预训练权重和多模态预训练权重,并在相对较大的合成和真实数据集 KADID-10k 和 KonIQ-10k 上进行评估。实验结果详见表 III。从 ImageNet-1K 预训练权重过渡到 ImageNet-21K 预训练权重为我们的模型带来了更多的好处,因为预训练数据的规模大幅增加。此外,虽然 MAE 也采用 ImageNet-1K 预训练,但它通过采用更强大的自监督预训练方法,与传统的有监督的 ImageNet-1K 预训练有别,这也为我们的模型带来了显著的优势。然而,我们的模型在有效利用多模态预训练权重方面面临挑战。 一个合理的解释是,多模态预训练模型可能更注重图像内在的抽象概念,而这种关注与图像质量评估任务的要求有所偏离。由于多模态预训练权重包含的信息要比单模态预训练的多,因此如何将这些模型应用于图像质量评估任务也将是一个重要的课题,我们会致力于对此进行进一步研究。
Moreover, the parameter capacities of large-scale pretrained models are another essential component of our method. To verify the effectiveness of large-scale pretrained model size, we evaluate LoDa with ViT-Tiny/Small/Base, and all ViTs are pretrained with ImageNet-21k. Quantitative results are shown in Table IV. From this, we can observe that with the growth of pretrained backbone sizes, our model can benefit from it and thus achieve better performance. In particular, solely employing ViT-S as the backbone, our method can achieve performance on par with SOTA shown in Table I, which further shows the effectiveness of our method.
此外,大规模预训练模型的参数容量是我们方法的另一个重要组成部分。为了验证大规模预训练模型尺寸的有效性,我们使用 ViT-Tiny/Small/Base 进行 LoDa 评估,所有的 ViT 都是在 ImageNet-21k 上预训练的。定量结果如表 IV 所示。从中可以观察到,随着预训练主干尺寸的增加,我们的模型能够从中受益,进而取得更好的性能。特别是,仅使用 ViT-S 作为主干,我们的方法就能够在表 I 中达到与 SOTA 相当的性能,这进一步显示了我们方法的有效性。
KADID-10k | KonIQ-10k | |||
---|---|---|---|---|
Module 模块 | SRCC | PLCC | SRCC | PLCC |
ViT | 0.889 | 0.899 | 0.874 | 0.891 |
ViT + Extractor ViT + 提取器 | 0.915 | 0.921 | 0.925 | 0.936 |
LoDa | 0.931 | 0.936 | 0.932 | 0.944 |
表 VI:在 KADID-10K 和 KonIQ-10K 数据集上的消融实验。加粗的条目表示最佳结果。
IV-E Comparisons with Different Fine-tuning Methods
IV-E 与不同微调方法的比较
At present, numerous efficient model adaptation methods for large-scale pretrained vision models have emerged, Adapter [26], LoRA [45] and visual prompt tuning (VPT) [20] stands as the exemplars. To demonstrate the effectiveness of our proposed method, we employ linear probing ViT that only fine-tunes the head of ViT, full fine-tuning ViT, Adapter-ViT, LoRA-ViT, and VPT-ViT for IQA task, and compare it with our method on KADID-10K and KonIQ-10k datasets. The experimental results are detailed in Table V. It can be noticed that our model outperforms all of these fine-tuning methods on KADID-10K and KonIQ-10K, especially on KADID-10K, which shows the effectiveness and superiority of fusing CNN multi-scale distortion features into ViT.
目前,已经出现了众多大型预训练视觉模型的高效模型适配方法,Adapter [26]、LoRA [45] 和视觉提示调优(VPT)[20] 就是其中的典范。为了证明我们提出方法的有效性,我们应用线性探测 ViT(仅微调 ViT 的头部)、完全微调的 ViT、Adapter-ViT、LoRA-ViT 以及针对 IQA 任务的 VPT-ViT,并在 KADID-10K 和 KonIQ-10K 数据集上与我们的方法进行比较。实验结果详见表 V。可以注意到,我们的模型在 KADID-10K 和 KonIQ-10K 数据集上均优于所有这些微调方法,尤其是在 KADID-10K 上,这表明将 CNN 多尺度失真特征融合到 ViT 中的有效性和优越性。
IV-F Ablation Study IV-F 消融研究
IV-F1 Effect of CNN Features.
卷积神经网络特征的影响。
Recent research [30] highlights the distinct characteristics exhibited by ViT and CNN. Specifically, it demonstrates that ViT is adept at learning low-frequency global signals, whereas CNN exhibits a propensity for extracting high-frequency information. Following previous work [30], we visualize the Fourier analysis of features of ViT and our models (average over 128 images) in Figure 4. From the Fourier spectrum and relative log amplitudes of Fourier transformed feature maps, we can deduce that our model captures more high-frequency signals than the full-finetuned ViT baseline. And from Table V, we can also observe that our model outperforms the full-finetuned ViT baseline by a large margin. This enhanced capability can be attributed to the incorporation of fused multi-scale distortion features extracted by CNN.
最近的研究[30]强调了 ViT 和 CNN 所展示的不同特性。具体而言,该研究表明 ViT 擅长学习低频全局信号,而 CNN 则倾向于提取高频信息。延续之前的工作[30],我们在图 4 中可视化了 ViT 和我们模型的特征傅里叶分析(128 张图像的平均)。从傅里叶谱和傅里叶变换特征图的相对对数幅度,我们可以推断出我们的模型比完全微调的 ViT 基线捕捉到了更多的高频信号。而从表 V 中,我们也可以观察到我们的模型比完全微调的 ViT 基线有显著的性能提升。该增强能力可归因于结合了由 CNN 提取的融合多尺度失真特征。
IV-F2 Ablation for Components.
IV-F2 组件消融实验。
Our models is composed of three essential components, including the large-scale pretrained ViT, local distortion extractor, and local distortion injector. To examine the individual contribution of each component, we report the ablation experiments in Table VI. From this table, we observe that both the local distortion extractor and local distortion injector are highly effective in characterizing the image quality, and thus contributing to the overall performance of LoDa. In particular, even without local distortion injector, we simply add the multi-scale distortion tokens with ViT tokens, it can still outperform the full-finetuned ViT, demonstrating the effectiveness of adaptation of large-scale pretrained models and the extracted multi-scale distortion features.
我们的模型由三个基本组件组成,包括大规模预训练的 ViT、局部失真提取器和局部失真注入器。为检验每个组件的独立贡献,我们在表 VI 中报告了消融实验。从表中可以看到,局部失真提取器和局部失真注入器在刻画图像质量方面都非常有效,因此对 LoDa 的整体性能有所贡献。特别是,即使没有局部失真注入器,我们只需将多尺度失真标记与 ViT 标记结合起来,仍然可以超过完全微调的 ViT,证明了大规模预训练模型的适应性以及提取的多尺度失真特征的有效性。
V Conclusion 五、结论
In this paper, we present a LOcal Distortion Aware efficient transformer adaptation (LoDa) for image quality assessment, which utilizes large-scale pretrained models. Given that IQA is highly reliant on both local and non-local dependencies, while ViT primarily captures the non-local aspects of images, overlooking the local details, henceforth, we propose the integration of CNN for extracting multi-scale distortion features and injecting them into ViT. However, for ViT extracts patches of images, directly adding these multi-scale distortion features to ViT tokens may encounter a challenge of misaligned scale. Thus we propose to utilize the cross-attention mechanism to let ViT tokens query related features from multi-scale distortion features and then combine them. Experiments on seven standard datasets demonstrate the superiority of LoDa in terms of prediction accuracy, training efficiency, and generalization capability.
在本文中,我们提出了一种适用于图像质量评估的局部失真感知高效 Transformer 适配方法(LoDa),该方法利用大规模预训练模型。鉴于图像质量评估高度依赖于局部和非局部依赖性,而 ViT 主要捕捉图像的非局部特性,忽视了局部细节,因此,我们提出结合 CNN 来提取多尺度失真特征并将其注入到 ViT 中。然而,由于 ViT 提取图像的 个 patch,直接将这些多尺度失真特征添加到 ViT 的 tokens 中可能会遇到尺度不匹配的问题。因此,我们建议利用交叉注意力机制,让 ViT 的 tokens 从多尺度失真特征中查询相关特征并将其结合。对七个标准数据集的实验表明,LoDa 在预测准确性、训练效率和泛化能力方面具有优越性。
VI Appendix 附录 VI
VI-A Introduction VI-A 介绍
This supplementary material presents: (1) additional experimental analysis and quantitative results of the ablation study of LoDa; (2) more visualization of Fourier analysis of vision transformer (ViT) [28] and LoDa.
本补充材料介绍:(1) LoDa 消融研究的额外实验分析和定量结果;(2) 更多视觉转换器(ViT)[28]和 LoDa 傅里叶分析的可视化结果。
VI-B More Ablation Study and Discussion
VI-B 更多消融研究和讨论
VI-B1 Evaluation Metrics VI-B1 评估指标
The detailed definitions of the two performance metrics (i.e., SRCC, PLCC) we use in this paper are as follows:
本文中使用的两个性能指标(即 SRCC,PLCC)的详细定义如下:
(6) |
where is the number of distorted images, and is the rank difference between the ground-truth quality score and the predicted score of image .
其中 是失真图像的数量, 是真实质量分数与图像 的预测分数之间的排名差异。
(7) |
where and denote the means of the ground truth and predicted score, respectively.
其中 和 分别表示真实值和预测分数的平均值。
VI-B2 Latent Dimension VI-B2 潜在维度
Due to the potentially overwhelming number of parameters and computational overhead caused by the large dimension of ViT [28] (768 for ViT-B), inspired by the concept of adapters in the field of NLP [26], we propose to down project the ViT tokens and multi-scale distortion tokens to a smaller latent dimension . We study the effect of the latent dimension on KonIQ-10k [4] and KADID-10k [33] datasets. Results are shown in Table VII. From the table, we can observe that on the KonIQ-10k dataset, our model is slightly affected by the effect of latent dimension , and on the KADID-10k dataset, our model performs the best when is 64. Therefore, we empirically set to 64 by default.
由于 ViT [28] 的大维度(ViT-B 为 768)可能导致参数数量过多和计算开销过大,受到了 NLP 领域中 adapters 概念的启发 [26],我们建议将 ViT tokens 和多尺度失真 tokens 下投影到较小的潜在维度 。我们研究了潜在维度 在 KonIQ-10k [4] 和 KADID-10k [33] 数据集上的影响。结果如表 VII 所示。从表中可以看到,在 KonIQ-10k 数据集上,我们的模型受潜在维度 影响较小,而在 KADID-10k 数据集上,当 为 64 时,我们的模型表现最佳。因此,我们经验上默认将 设置为 64。
KADID-10k | KonIQ-10k | |||
---|---|---|---|---|
SRCC | PLCC | SRCC | PLCC | |
48 | 0.929 | 0.934 | 0.934 | 0.945 |
64 | 0.931 | 0.936 | 0.932 | 0.944 |
80 | 0.923 | 0.928 | 0.933 | 0.945 |
VI-B3 Number of Heads in Cross-attention
We run an ablation study on different numbers of heads in cross-attention when the latent dimension is set to 64. As shown in Table VIII, when the latent dimension is fixed, the number of heads in cross-attention has little effect on our model. Thus, we set the number of heads to four so that our model performs slightly better on the KADID-10k dataset.
KADID-10k | KonIQ-10k | |||
---|---|---|---|---|
SRCC | PLCC | SRCC | PLCC | |
2 | 0.929 | 0.934 | 0.932 | 0.944 |
4 | 0.931 | 0.936 | 0.932 | 0.944 |
8 | 0.929 | 0.935 | 0.933 | 0.944 |


VI-B4 Number of Interactions
In the paper, we empirically fuse the ViT tokens with multi-scale features in all of the ViT encoder layers. However, it’s essential to acknowledge that this choice is a result of our empirical decision-making process, in fact, we can fuse them only in part of ViT encoder layers. Thus, we run an ablation study with different numbers of interactions on KonIQ-10k and KADID-10k datasets. In this ablation study, we distribute the ViT encoder layers into blocks, with each block containing encoder layers, where denotes the total number of encoder layers. Then, we only fuse the ViT tokens with multi-scale distortion features in each block instead of each layer. Results are shown in IX. It can be observed that our model’s performance improves with an increased number of interactions. Notably, it is worth observing that even with just half of the interactions, our model yields excellent performance outcomes.
KADID-10k | KonIQ-10k | |||
---|---|---|---|---|
SRCC | PLCC | SRCC | PLCC | |
3 | 0.923 | 0.929 | 0.929 | 0.941 |
6 | 0.927 | 0.933 | 0.932 | 0.943 |
12 | 0.931 | 0.936 | 0.932 | 0.944 |
VI-C Visualization of Fourier Analysis of Vision Transformer and LoDa
In the paper, we show the Fourier analysis of features of ViT and LoDa on the KADID-10k dataset, here we additionally show the Fourier analysis of full-fintuned ViT and LoDa on the KonIQ-10k dataset (average over 128 images) in Figure 5. We can observe the same results on the KonIQ-10k dataset, it further demonstrates that LoDa captures more high-frequency signals and show the effect of our proposed method.
References
- [1] Z. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A. Bovik, “From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality,” in CVPR, 2020, pp. 3575–3585.
- [2] W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 1, pp. 36–47, 2020.
- [3] A. Saha, S. Mishra, and A. C. Bovik, “Re-iqa: Unsupervised learning for image quality assessment in the wild,” CoRR, vol. abs/2304.00451, 2023.
- [4] V. Hosu, H. Lin, T. Sziranyi, and D. Saupe, “Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 4041–4056, 2020.
- [5] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL, 2020, pp. 7871–7880.
- [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, and et al., “Language models are few-shot learners,” in NeurIPS, 2020.
- [7] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, “How to train your vit? data, augmentation, and regularization in vision transformers,” Trans. Mach. Learn. Res., vol. 2022, 2022.
- [8] T. Ridnik, E. B. Baruch, A. Noy, and L. Zelnik, “Imagenet-21k pretraining for the masses,” in NeurIPS, J. Vanschoren and S.-K. Yeung, Eds., 2021.
- [9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, and et al., “Learning transferable visual models from natural language supervision,” in ICML, vol. 139, 2021, pp. 8748–8763.
- [10] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, and et al., “Palm-e: An embodied multimodal language model,” CoRR, vol. abs/2303.03378, 2023.
- [11] S. Su, Q. Yan, Y. Zhu, C. Zhang, X. Ge, J. Sun, and Y. Zhang, “Blindly assess image quality in the wild guided by a self-adaptive hyper network,” in CVPR, 2020, pp. 3664–3673.
- [12] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “Musiq: Multi-scale image quality transformer,” in ICCV, 2021, pp. 5128–5137.
- [13] G. Qin, R. Hu, Y. Liu, X. Zheng, H. Liu, X. Li, and Y. Zhang, “Data-efficient image quality assessment with attention-panel decoder,” in AAAI, 2023, pp. 2091–2100.
- [14] K. Zhao, K. Yuan, M. Sun, M. Li, and X. Wen, “Quality-aware pre-trained models for blind image quality assessment,” CoRR, vol. abs/2303.00521, 2023.
- [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- [16] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in CVPR, 2021, pp. 11 557–11 568.
- [17] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” Trans. Mach. Learn. Res., vol. 2022, 2022.
- [18] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, and et al., “Scaling vision transformers to 22 billion parameters,” CoRR, vol. abs/2302.05442, 2023.
- [19] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, and et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” CoRR, vol. abs/2302.09419, 2023.
- [20] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. J. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV, 2022, pp. 709–727.
- [21] Y. Zhang, K. Zhou, and Z. Liu, “Neural prompt search,” CoRR, vol. abs/2206.04673, 2022.
- [22] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proc. IEEE, vol. 109, no. 1, pp. 43–76, 2021.
- [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- [24] S. A. Golestaneh, S. Dadsetan, and K. M. Kitani, “No-reference image quality assessment via transformers, relative ranking, and self-consistency,” in WACV, 2022, pp. 3989–3999.
- [25] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. H. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in ICCV, 2021, pp. 538–547.
- [26] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in ICML, 2019, pp. 2790–2799.
- [27] R. Hu, Y. Liu, K. Gu, X. Min, and G. Zhai, “Toward a no-reference quality metric for camera-captured images,” IEEE Trans. Cybern., vol. 53, no. 6, pp. 3651–3664, 2023.
- [28] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, and et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- [29] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in CVPR, 2022, pp. 1204–1213.
- [30] C. Si, W. Yu, P. Zhou, Y. Zhou, X. Wang, and S. Yan, “Inception transformer,” in NeurIPS, 2022.
- [31] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
- [32] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, and et al., “Image database tid2013: Peculiarities, results and perspectives,” Signal processing: Image communication, vol. 30, pp. 57–77, 2015.
- [33] H. Lin, V. Hosu, and D. Saupe, “Kadid-10k: A large-scale artificially distorted iqa database,” in 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), 2019.
- [34] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2015.
- [35] Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang, “Perceptual quality assessment of smartphone photography,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3677–3686.
- [36] VQEG, “Final report from the video quality experts group on the validation of objective models of video quality assessment,” Tech. Rep., 2000.
- [37] L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2579–2591, 2015.
- [38] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-Reference Image Quality Assessment in the Spatial Domain,” IEEE Trans. Image Process., vol. 21, no. 12, pp. 4695–4708, 2012.
- [39] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, 2018.
- [40] J. You and J. Korhonen, “Transformer for image quality assessment,” in ICIP, 2021, pp. 1389–1393.
- [41] H. Zhu, L. Li, J. Wu, W. Dong, and G. Shi, “Metaiqa: Deep meta-learning for no-reference image quality assessment,” in CVPR 2020, 2020, pp. 14 131–14 140.
- [42] W. Zhang, G. Zhai, Y. Wei, X. Yang, and K. Ma, “Blind image quality assessment via vision-language correspondence: A multitask learning perspective,” CoRR, vol. abs/2303.14968, 2023.
- [43] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022, pp. 15 979–15 988.
- [44] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” CoRR, vol. abs/2212.07143, 2022.
- [45] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR. OpenReview.net, 2022.