Introduction 引言
AI Generated Content (AIGC) refers to all types of content generated by artificial intelligence technology. As vision is the most important way for humans to perceive external information, AI-Generated Images (AGI), especially Text-to-Image (T2I) generation, has become one of the most representative forms of AIGC [1]. With the rapid technological advancement of visual computing and networking, a huge variety of AGI models have emerged which include the following 3 types [2]. Generative Adversarial Networks [3] (GAN)-based models, such as Text-conditional GAN [4], [5], [6] series, are the earliest end-to-end AGI model from character level to the pixel level. Since then, AGI has differentiated into two technical routes, namely auto regressive (AR)-based models [7], [8], [9] as CogView, and diffusion-based models like Stable-Diffusion [10], [11], [12] (SD). According to the statistics [1], [2], there have been at least 20 representative T2I AGI models up to 2023.
人工智能生成内容(AIGC)是指由人工智能技术生成的各种类型的内容。由于视觉是人类感知外部信息的最重要方式,人工智能生成图像(AGI),特别是文本到图像(T2I)生成,已成为 AIGC 最具代表性的形式之一 [1] 。随着视觉计算和网络技术的快速发展,出现了多种多样的 AGI 模型,包括以下三种类型 [2] 。基于生成对抗网络 [3] (GAN)的模型,如文本条件 GAN [4] , [5] , [6] 系列,是最早的端到端 AGI 模型,从字符级到像素级。此后,AGI 分化为两条技术路线,即基于自回归(AR)的模型 [7] , [8] , [9] ,如 CogView,以及基于扩散的模型,如 Stable-Diffusion [10] , [11] , [12] (SD)。根据统计数据 [1] , [2] ,截至 2023 年,至少有 20 个代表性的 T2I AGI 模型。
With such a large number of models, the quality of AGI also varies widely [2]. Firstly, for different models, the GAN-based models generate AGI with the worst quality, the quality of the AR-based models is comparatively improved, and the diffusion-based models generate the best results overall. Furthermore, the quality of AGIs generated by the same model can still vary greatly. For example, a large amount of training data, sufficient epoch iterations, and well-designed prompts have a huge impact on the generated result. Considering the great variance of T2I AGI content, how to fairly evaluate their quality becomes a pivotal question. Since the AGIs’ receiving end is the Human Visual System (HVS), subjective assessment is the most direct and reliable way to quantify their quality. A fine-grained comprehensive subjective quality experiment not only helps to understand the perception mechanism for AGIs but also lay the foundation for assessing, comparing, and optimizing AGI models.
随着模型数量的增加,AGI 的质量也存在很大差异 [2] 。首先,对于不同的模型,基于 GAN 的模型生成的 AGI 质量最差,基于 AR 的模型质量相对有所改善,而基于扩散的模型总体上生成的结果最好。此外,同一模型生成的 AGI 质量仍然可能有很大差异。例如,大量的训练数据、足够的迭代周期和精心设计的提示对生成结果有着巨大的影响。考虑到 T2I AGI 内容的巨大差异,如何公平地评估其质量成为一个关键问题。由于 AGI 的接收端是人类视觉系统(HVS),主观评估是量化其质量最直接和可靠的方法。细致全面的主观质量实验不仅有助于理解 AGI 的感知机制,还为评估、比较和优化 AGI 模型奠定了基础。
However, large-scale experiments on subjective quality assessment face two main practical challenges. Firstly, with the great diversity of T2I AGI models and the differences of AGIs themselves, it is difficult to provide fine-grained subjective scores for numerous generation models under different inputs. Therefore, the selection of generative models, parameter components, and input prompts needs to represent as wide a range of AGI as possible under a limited data scale. Secondly, considering the different properties of AGIs and Natural Sense Images (NSIs), it is necessary to establish a solid subjective evaluation standard. There are several [13], [14] but no unified standards on the dimensions of AGIs’ quality, and the specific information included in each dimension. Therefore, we walk through the previous AGI model, including the model itself, parameters, and prompts, set up a comprehensive set of subjective testing methods, and established the most comprehensive fine-grained, multi-dimensional AGIs quality database so far, namely AGIQA-3K as shown in Fig. 1. On this database, we can further explore the generation or quality model of AGIs, so as to optimize the human perception experience of AGIs. The main contributions of our work include:
然而,大规模的主观质量评估实验面临两个主要的实际挑战。首先,由于 T2I AGI 模型的多样性以及 AGI 本身的差异,难以为在不同输入下的众多生成模型提供细致的主观评分。因此,在有限的数据规模下,生成模型、参数组件和输入提示的选择需要尽可能代表广泛的 AGI 范围。其次,考虑到 AGI 和自然感知图像(NSI)的不同特性,有必要建立一个稳固的主观评估标准。目前在 AGI 质量的维度上有几个 [13] , [14] ,但没有统一的标准,以及每个维度中包含的具体信息。因此,我们回顾了之前的 AGI 模型,包括模型本身、参数和提示,建立了一套全面的主观测试方法,并建立了迄今为止最全面的细粒度、多维度 AGI 质量数据库,即 AGIQA-3K,如 Fig. 1 所示。 在这个数据库上,我们可以进一步探索 AGI 的生成或质量模型,以优化人类对 AGI 的感知体验。我们工作的主要贡献包括:
A large-scale AGI database consists of 2,982 AGIs generated from 6 different models. This is the first database that covers AGIs from GAN/AR/diffusion-based model altogether. Meanwhile, the input prompt and internal parameters in AGI models have been carefully designed and adjusted.
一个大规模的 AGI 数据库由来自 6 种不同模型的 2,982 个 AGI 组成。这是第一个涵盖来自 GAN/AR/扩散模型的 AGI 的数据库。同时,AGI 模型中的输入提示和内部参数经过精心设计和调整。A fine-grained subjective experiment carried out in a standardized laboratory environment. We collected the Mean Opinion Score (MOS), to annotate images in detail from both perception and T2I alignment dimensions. Therefore, we compared the result of different AGI models in different dimensions.
在标准化实验室环境中进行了一项细粒度的主观实验。我们收集了平均意见分数(MOS),以从感知和图像到图像(T2I)对齐维度详细标注图像。因此,我们比较了不同 AGI 模型在不同维度上的结果。A benchmark experiment was conducted to evaluate the performance of current perceptual quality and T2I alignment assessment metrics. Moreover, we proposed StairReward, an alignment metric to improve the existing alignment assessment result from different AGI models.
进行了一项基准实验,以评估当前感知质量和 T2I 对齐评估指标的性能。此外,我们提出了 StairReward,这是一种对齐指标,用于改善来自不同 AGI 模型的现有对齐评估结果。
Related Work 相关工作
A. AGI Quality Metric A. AGI 质量指标
For AGIs’ quality assessment, perceptual quality and T2I alignment [2] have always been the two major components. In the perspective of perceptual quality, Inception Score (IS) [18] is the earliest quality criterion by calculating the uniformity of a set of AGIs’ features. Subsequently, methods such as Fréchet Inception Distance (FID) [19] and Kernel Inception Distance (KID) [20] appeared, which use the distance between the AGIs group and the NSIs group to represent the perceptual quality. However, the above methods are usually only suitable for evaluating the quality of a group of images (E.g. The performance of an AGI model) or style transfer [21], which is unsuitable for evaluating the perceptual quality of only one image. Therefore, for a single AGI’s quality, the Image Quality Assessment (IQA) [22] method is usually used. However, considering the complexity of AGI models and the diversity of factors affecting AGIs’ quality, factors affecting the quality of AGIs [13] are different from those of NSIs [23], [24], [25], [26], and Screen Content Images (SCIs) [27], [28], [29]. Thus, the reliability of IQA measures is also limited.
对于 AGI 的质量评估,感知质量和 T2I 对齐一直是两个主要组成部分。从感知质量的角度来看,Inception Score (IS) 是通过计算一组 AGI 特征的均匀性而提出的最早的质量标准。随后,Fréchet Inception Distance (FID) 和 Kernel Inception Distance (KID)等方法出现,这些方法使用 AGI 组与 NSI 组之间的距离来表示感知质量。然而,上述方法通常仅适用于评估一组图像的质量(例如,AGI 模型的性能)或风格迁移,这不适合评估单个图像的感知质量。因此,对于单个 AGI 的质量,通常使用图像质量评估(IQA)方法。然而,考虑到 AGI 模型的复杂性和影响 AGI 质量的因素的多样性,影响 AGI 质量的因素与 NSI、屏幕内容图像(SCI)等的因素是不同的。因此,IQA 度量的可靠性也受到限制。
When it comes to T2I alignment, several metrics represented by Contrastive Language-Image Pre-Training (CLIP) [30], [31], [32] are widely applied. Those metrics can link the text with the image, which trades off against quality metrics to provide guidance for image generation. Unfortunately, the great difficulty of training these alignment models has resulted in most users may only load their pre-trained parameters and bee to tune on the small-scale database. Therefore, for the diverse morpheme composition of the prompts in the AGI database, the consistency between the alignment result and human subjective rating still needs to be improved.
在 T2I 对齐方面,几种由对比语言-图像预训练(CLIP) [30] , [31] , [32] 表示的指标被广泛应用。这些指标可以将文本与图像联系起来,权衡质量指标以为图像生成提供指导。不幸的是,训练这些对齐模型的巨大困难导致大多数用户只能加载其预训练参数,并在小规模数据库上进行微调。因此,对于 AGI 数据库中提示的多样形态组合,对齐结果与人类主观评分之间的一致性仍需改进。
B. AGI Quality Database B. AGI 质量数据库
The popularity of the T2I AGI model in recent years has spawned several related databases as shown in Tab. I.
近年来,T2I AGI 模型的流行催生了几个相关数据库,如 Tab. I 所示。
DiffusionDB [15] is the earliest database for AGIs, including 1.8+ millions of Text-Image pairs generated by the Stable-Diffusion model. Although it has no subjective scoring, its large number of images and prompts have laid the foundation for subsequent subjective databases.
DiffusionDB [15] 是最早的 AGI 数据库,包括由 Stable-Diffusion 模型生成的超过 180 万对文本-图像对。尽管它没有主观评分,但其大量的图像和提示为后续的主观数据库奠定了基础。
AGIQA-1K [13] is the first subjective database for perceptual AGI quality assessment that conducted fine-grained scoring through MOS. Its input only contains 180 prompts, and these prompts are just simple combinations of image labels from the real world, which is difficult to represent a wide range of AGIs.
AGIQA-1K [13] 是第一个用于感知 AGI 质量评估的主观数据库,通过 MOS 进行了细粒度评分。其输入仅包含 180 个提示,这些提示只是来自现实世界的图像标签的简单组合,难以代表广泛的 AGI。
Pick-A-Pic [16] and HPS [17] further expand the scale of image and prompts, which crawls the results generated by Stable-Diffusion on the Discord website or directly applies the Text-Image pairs in DiffusionDB and give a subjective score, but only using diffusion-based model leads to a limitation of representing various AGIs. Moreover, they combined the perception and alignment together with an overall score, which fails to characterize the AGIs’ quality of multiple dimensions.
Pick-A-Pic [16] 和 HPS [17] 进一步扩大了图像和提示的规模,它们爬取了在 Discord 网站上由 Stable-Diffusion 生成的结果,或直接应用 DiffusionDB 中的文本-图像对并给出主观评分,但仅使用基于扩散的模型导致了在表示各种 AGI 方面的局限。此外,它们将感知和对齐结合在一起,并给出一个总体评分,这未能表征 AGI 在多个维度上的质量。
ImageReward [14] is a well-labeled AGI quality database. For image generation, in addition to generating with four diffusion-based models, it also considers an AR-based model. It also performed a better subjective test by scoring the AGI from 0 to 7 and manually labeled Non-Safe For Work (NSFW) content to avoid unsafe generation. However, the absence of the GAN-based model and only using four excellent diffusion-based models lead to insufficient coverage of AGI quality; the granularity of scoring is discrete, and each picture only contains one person’s scoring, so this kind of coarse-grained score cannot accurately characterize the quality of AGI. Meanwhile, NSFW is not the only unsafe content according to the previous definition of responsible AI [33], which should also include social problems and deepfake.
ImageReward [14] 是一个标注良好的 AGI 质量数据库。在图像生成方面,除了使用四种基于扩散的模型外,它还考虑了一种基于增强现实的模型。它通过将 AGI 评分从 0 到 7 进行主观测试,并手动标注不适合工作(NSFW)内容,以避免不安全的生成。然而,缺乏基于 GAN 的模型,仅使用四种优秀的基于扩散的模型导致 AGI 质量的覆盖不足;评分的粒度是离散的,每张图片只包含一个人的评分,因此这种粗粒度的评分无法准确表征 AGI 的质量。同时,根据负责任的人工智能的先前定义 [33] ,NSFW 并不是唯一的不安全内容,还应包括社会问题和深度伪造。
Suffering from the problems mentioned above, a fine-grained AGI quality database for both perception and alignment is needed, attached by abundant indicators to avoid unsafe content. The above issue motivates us to build a new database for AGI perception and generation in the future, which aims to cover more AGI models in different performances/parameters and to provide more accurate quality results by further refining the scoring granularity.
由于遭受上述问题的困扰,需要一个细粒度的 AGI 质量数据库,涵盖感知和对齐,并附有丰富的指标以避免不安全内容。上述问题促使我们在未来构建一个新的 AGI 感知和生成数据库,旨在涵盖更多不同性能/参数的 AGI 模型,并通过进一步细化评分粒度提供更准确的质量结果。
Database Construction 数据库构建
A. AGI Model Collection A. AGI 模型集合
To ensure content diversity, our AGIQA-3K database considered six representative generative models. Referring to the previous classification, considering that the overall generation effect of the diffusion-based model on the T2I AGI task is the best and the most widely used, we selected four diffusion-based models for image generation. Including the earliest GLIDE [10], the most popular Stable Diffusion V-1.5 [11] with its latest upgraded Beta version named Stable Diffusion XL-2.2 [12], and the highest-rated Midjourney [35]1 by the user community. At the same time, to consider the other two types of models, we used the most popular frameworks of these two types, namely AttnGAN [6] representing the GAN-based model, and DALLE2 [34]2 as the AR-based model. Fig. 2 shows the output of these six AGI models. In addition, to study the relationship between the internal parameters of the model and AGI, we take the Classifier Free Guidance (CFG) scale of the AGI model as 0.5 and 2 times the default value, and explore the impact of the trade-off between perception and alignment on the generation effect; meanwhile, we set the number of iterations to half of the default value to simulate the distortion of AGI when the iterations are insufficient. These two parameter adjustments are performed on the most widely used Stable Diffusion [11] and Midjourney [35] respectively. It can be seen that the AGIQA-3K database uses different models in different periods that effectively represent the wide quality range of AGI since the birth of the T2I AGI model.
为了确保内容的多样性,我们的 AGIQA-3K 数据库考虑了六种具有代表性的生成模型。根据之前的分类,考虑到基于扩散模型在 T2I AGI 任务上的整体生成效果最佳且应用最广泛,我们选择了四种基于扩散的图像生成模型。包括最早的 GLIDE [10] ,最受欢迎的 Stable Diffusion V-1.5 [11] 及其最新升级的 Beta 版本 Stable Diffusion XL-2.2 [12] ,以及用户社区中评价最高的 Midjourney [35] 1 。同时,为了考虑其他两种类型的模型,我们使用了这两种类型中最受欢迎的框架,即代表基于 GAN 模型的 AttnGAN [6] ,以及作为 AR 模型的 DALLE2 [34] 2 。 Fig. 2 展示了这六种 AGI 模型的输出。此外,为了研究模型内部参数与 AGI 之间的关系,我们将 AGI 模型的无分类器引导(CFG)尺度设为 0。5 倍和 2 倍的默认值,并探讨感知与对齐之间权衡对生成效果的影响;同时,我们将迭代次数设置为默认值的一半,以模拟当迭代不足时 AGI 的失真。这两个参数调整分别在最广泛使用的 Stable Diffusion [11] 和 Midjourney [35] 上进行。可以看出,AGIQA-3K 数据库在不同阶段使用了不同的模型,有效地代表了自 T2I AGI 模型诞生以来 AGI 的广泛质量范围。
To assess the statistical difference between NSIs and AGIs, we propose distributions of five quality-related attributes for comparison. NSI was obtained from the KonIQ-10k IQA database [36] in the wild, while AGI was collected via the previous AGIQA-1K [13] database and the proposed AGIQA-3K database. The quality-related attributes under consideration are lighting, contrast, color, blur, and spatial information (SI). The ‘color’ indicates the colorfulness of the images and the SI stands for the content diversity of the images. A detailed description of these properties can be found in [38]. As shown in Fig. 3, there is a noticeable difference between NSIs and AGIs in the blur distribution curve because AGIs sometimes encounter insufficient iterations during the generation process. Frequent occurrence of blur causes the center of the blur distribution curve shiftting to the left. Compared with AGIQA-1K, our AGIQA-3K further adds data with insufficient model iterations, making the distortion distribution curve sharper. Except for Blur, the similarity of the distributions between four other quality-related attributes of NSIs and AGIs prove the plausibility of our AGI database.
为了评估 NSI 和 AGI 之间的统计差异,我们提出了五个与质量相关的属性分布进行比较。NSI 来自 KonIQ-10k IQA 数据库 [36] ,而 AGI 则通过之前的 AGIQA-1K [13] 数据库和提议的 AGIQA-3K 数据库收集。考虑的质量相关属性包括光照、对比度、颜色、模糊和空间信息(SI)。“颜色”表示图像的色彩丰富度,而 SI 代表图像的内容多样性。这些属性的详细描述可以在 [38] 中找到。如 Fig. 3 所示,NSI 和 AGI 在模糊分布曲线上的差异显著,因为 AGI 在生成过程中有时会遇到迭代不足的情况。模糊的频繁出现导致模糊分布曲线的中心向左移动。与 AGIQA-1K 相比,我们的 AGIQA-3K 进一步增加了模型迭代不足的数据,使得失真分布曲线更加陡峭。除了模糊之外,NSI 和 AGI 的其他四个质量相关属性的分布相似性证明了我们 AGI 数据库的合理性。
B. AGI Prompt Collection B. AGI 提示集合
Under the requirements of Fine-grained scoring, the AGIQA-3K database cannot perform image generation and scoring tasks on more than ten thousand prompts like the previous coarse-grained database [16], [17]. Therefore, how to use relatively few prompts to cover a large number of real user inputs is a key issue in the prompt collection process. Due to the insufficient prompt, directly extracting a part of the prompt from the real input will inevitably lead to one-sidedness. Facing this challenge, AGIQA-1K [13] extracted high-frequency words from the Internet and created 180 Human designed prompts. However, its high-frequency words are not directly derived from the prompt input in the AGI task, and the combination of high-frequency words is also different from the common prompt input format.
在细粒度评分的要求下,AGIQA-3K 数据库无法像之前的粗粒度数据库 [16] 、 [17] 那样对超过一万条提示进行图像生成和评分任务。因此,如何使用相对较少的提示来覆盖大量真实用户输入是提示收集过程中的一个关键问题。由于提示不足,直接从真实输入中提取部分提示不可避免地会导致片面性。面对这一挑战,AGIQA-1K [13] 从互联网提取了高频词,并创建了 180 个人工设计的提示。然而,其高频词并不是直接来源于 AGI 任务中的提示输入,并且高频词的组合也不同于常见的提示输入格式。
Therefore, our prompts in the AGIQA-3K database apply a ‘real’ + ‘human designed’ mechanism that uses real prompts in AGI as a framework and combines them together manually. Conform to the prompt structure of the Stable Diffusion official prompt book,3 we divide the prompt into three items, namely subject, detail, and style. The subject is the most important item that exists in all prompts. We extract 300 subjects from the prompts of DiffusionDB [15] according to the proportion of the respective categories (E.g. People, Arts, Outdoor Sense, Artifacts, Animals, etc.) in ImageReward [14]. Detail refers to the adjectives added after the main object of the prompt, usually no more than two. We select the ten most commonly used adjectives with reference to the real input4 of Midjourney users. In terms of the artistic style of the entire picture, we also selected the five most commonly used styles like the detail item. Finally, we combine the subject, 0–2 details, and 0–1 style together as shown in Fig. 4. Thus, we ensure that the prompts in AGIQA-3K cover a wide range of input content of the T2I generation task.
因此,我们在 AGIQA-3K 数据库中的提示应用了“真实”+“人类设计”的机制,使用 AGI 中的真实提示作为框架,并手动将它们结合在一起。遵循 Stable Diffusion 官方提示手册的提示结构 3 ,我们将提示分为三个项目,即主题、细节和风格。主题是所有提示中最重要的项目。我们根据 ImageReward [14] 中各类别(例如:人物、艺术、户外感、工艺品、动物等)的比例,从 DiffusionDB [15] 的提示中提取了 300 个主题。细节是指添加在提示主要对象后面的形容词,通常不超过两个。我们参考 Midjourney 用户的真实输入 4 ,选择了十个最常用的形容词。在整个图像的艺术风格方面,我们也选择了五种最常用的风格,类似于细节项目。最后,我们将主题、0-2 个细节和 0-1 个风格结合在一起,如 Fig. 4 所示。因此,我们确保 AGIQA-3K 中的提示涵盖了 T2I 生成任务的广泛输入内容。
C. Subjective Experiment C. 主观实验
To obtain the subjective quality score of AGIQA-3K, we conducted a one-month subjective experiment in the SJTU Multimedia Laboratory. Complying with the ITU-R BT.500-13 [40] standard, we set up the environment as a normal indoor home setting, with normal lighting levels, presented AGIs on the screen in random order on the iMac monitor with a resolution of up to
为了获得 AGIQA-3K 的主观质量评分,我们在上海交通大学多媒体实验室进行了为期一个月的主观实验。遵循 ITU-R BT.500-13 [40] 标准,我们将环境设置为正常的室内家庭环境,光照水平正常,在 iMac 显示器上以随机顺序展示 AGI,分辨率高达
The interface in Fig. 4 allows viewers to browse the previous and next AGIs and move sliders ranging from 0 to 5 with a minimum interval of 0.1 as the quality score. A total of 21 graduate students (10 males and 11 females with 6 nationalities) participate in the experiment for 14 sessions. In case of visual fatigue, each session includes 213 images that limit the experiment time to half an hour. After collecting
在 Fig. 4 中,界面允许观众浏览前一个和后一个 AGI,并移动范围从 0 到 5 的滑块,质量评分的最小间隔为 0.1。共有 21 名研究生(10 名男性和 11 名女性,来自 6 个国家)参与了为期 14 个会话的实验。为了防止视觉疲劳,每个会话包含 213 张图像,将实验时间限制为半小时。在收集到
Beyond the perceptual quality and alignment, considering the potential safety hazards brought about by the rapid development of AIGC, viewers have marked the most typical three types of unsafe content, including Social Problem [33] (controversial content such as religion, politics, racial prejudice), NSFW [14] (unsafe content such as pornography, violence, drugs), and Fake Generation [47] (image can be regarded as NSIs by HVS, leads to risks like fake news). In the subjective experiment, considering that viewers with different nationalities and backgrounds have different definitions of Social Problem and NSFW content, as long as one viewer feels offended or uncomfortable with a certain picture, the picture is defined as Social Problem or NSFW content. Meanwhile, viewers will identify which one is generated by AI among the current AGI and an NSI. If more than half of the viewers make a wrong judgment, the current AGI is realistic enough to be defined as Fake Generation.
超越感知质量和一致性,考虑到 AIGC 快速发展带来的潜在安全隐患,观众标记了三种最典型的不安全内容,包括社会问题 [33] (如宗教、政治、种族偏见等争议性内容)、NSFW [14] (如色情、暴力、毒品等不安全内容)和虚假生成 [47] (图像可以被 HVS 视为 NSI,导致假新闻等风险)。在主观实验中,考虑到不同国籍和背景的观众对社会问题和 NSFW 内容的定义不同,只要有一位观众对某张图片感到冒犯或不适,该图片就被定义为社会问题或 NSFW 内容。同时,观众将识别当前 AGI 和 NSI 中哪个是由 AI 生成的。如果超过一半的观众做出错误判断,则当前 AGI 足够真实,可以被定义为虚假生成。
D. Subjective Data Analysis
D. 主观数据分析
Although a large number of T2I AGI models [4], [5], [6], [7], [8], [9], [10], [11], [12] have been developed in recent years, there is limited work [37] investigating their generative performance. Under multiple models and different inputs, the quality (in terms of perception and alignment) of AGIs generated by the model is still an open question. Thanks to the abundant subjective quality scores and diverse prompts in the AGIQA-3K database, we conduct an in-depth analysis of this issue, and summarize the influencing factors of AGI subjective quality as follows:
尽管近年来已经开发了大量的 T2I AGI 模型 [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] ,但关于它们的生成性能的研究仍然有限 [37] 。在多种模型和不同输入下,由模型生成的 AGI 的质量(在感知和一致性方面)仍然是一个悬而未决的问题。得益于 AGIQA-3K 数据库中丰富的主观质量评分和多样的提示,我们对这一问题进行了深入分析,并总结了 AGI 主观质量的影响因素如下:
AGI model: The AGI model applied in the T2I generation task plays a major role in generating quality. With the same input prompt, the generation quality of different AGI models varies greatly.
AGI 模型:在 T2I 生成任务中应用的 AGI 模型在生成质量中起着重要作用。在相同的输入提示下,不同 AGI 模型的生成质量差异很大。
Prompt length: When the prompt is short, the model is easy to generate high-quality images; but when it reaches a certain length, it is difficult for AGI to meet the requirement of the entire prompt at the alignment level; even if successfully aligned, a certain level of perception will be sacrificed as a trade-off.
提示长度:当提示较短时,模型容易生成高质量的图像;但当提示达到一定长度时,AGI 在对齐层面上很难满足整个提示的要求;即使成功对齐,也会作为权衡牺牲一定程度的感知。
Prompt style: The ‘style’ item in the prompt is crucial to the generation quality. Considering the training process of the AGI model, the artistic style contained in the training data determines the generation quality of the AGI model for a certain style, which is reflected in perception and alignment together.
提示风格:提示中的“风格”项对生成质量至关重要。考虑到 AGI 模型的训练过程,训练数据中包含的艺术风格决定了 AGI 模型在特定风格下的生成质量,这在感知和对齐中共同体现。
Model parameter: The internal parameters of the model can affect the quality of AGI profoundly. CFG represents the ‘importance’ ratio between perception and alignment. The larger the CFG, the model will value the alignment of AGI and prompt higher, but emphasize less perception accordingly; The number of iterations can also affect AGI’s quality as the model gives an intermediate result when the iteration is insufficient.
模型参数:模型的内部参数可以深刻影响 AGI 的质量。CFG 代表感知与对齐之间的“重要性”比率。CFG 越大,模型越重视 AGI 的对齐,并提高提示,但相应地减少对感知的强调;迭代次数也会影响 AGI 的质量,因为当迭代不足时,模型会给出一个中间结果。
Considering the above factors, we verified the subjective quality scores on prompt length and style under six AGI models, and showed the quality scores of Stable Diffusion and Midjourney after parameter tuning. The subjective quality under different prompt lengths (with only ‘Subject’ as prompt 0, to ‘Subject’ +
考虑到上述因素,我们验证了在六个 AGI 模型下,关于提示长度和风格的主观质量评分,并展示了经过参数调优后的 Stable Diffusion 和 Midjourney 的质量评分。不同提示长度下的主观质量(仅以“主题”作为提示 0,到“主题” +
To analyze the performance of each T2I AGI model on different styles, we calculated the subjective quality scores of five styles: ‘Abstract’, ‘Anime’, ‘Baroque’, ‘Realistic’, and ‘Sci-fi’ as shown in Fig. 8. Perception and alignment scores show that each model is good at generating ‘Baroque’ style images, followed by ‘Anime’ and ‘Realistic’, and the worst performance on ‘Abstract’ and ‘Sci-fi’. This is because the first three are relatively broad concepts, while the latter two are more specialized. Since the training data of the T2I AGI model usually contains a large number of NSIs, artworks and cartoon images, their generation result on the first three categories is fine; but for minority styles such as ‘Abstract’ and ‘Sci-fi’, the lack of training data will lead to the defects of the final AGI in perception and alignment. Through horizontal comparison, we found that Midjourney has relatively good versatility in different styles, but the versatility of DALLE2 and Stable Diffusion still need to be improved. Especially for DALLE2, the difference in subjective perception scores between the two styles has reached 0.6. Therefore, for the future T2I AGI model, improving the versatility of different styles is under remarkable significance.
为了分析每个 T2I AGI 模型在不同风格上的表现,我们计算了五种风格的主观质量评分:‘抽象’、‘动漫’、‘巴洛克’、‘写实’和‘科幻’,如 Fig. 8 所示。感知和对齐评分显示,每个模型在生成‘巴洛克’风格图像方面表现良好,其次是‘动漫’和‘写实’,而在‘抽象’和‘科幻’上的表现最差。这是因为前三者相对宽泛,而后两者则更为专业。由于 T2I AGI 模型的训练数据通常包含大量的 NSI、艺术作品和卡通图像,因此在前三类的生成结果良好;但对于‘抽象’和‘科幻’等少数风格,缺乏训练数据将导致最终 AGI 在感知和对齐方面的缺陷。通过横向比较,我们发现 Midjourney 在不同风格上的通用性相对较好,但 DALLE2 和 Stable Diffusion 的通用性仍需提高。特别是对于 DALLE2,两种风格之间的主观感知评分差异已达到 0.6。 因此,对于未来的 T2I AGI 模型,提高不同风格的多样性具有重要意义。
We also adjusted CFG and the number of iterations considering the impact of model internal parameters on AGI quality. Since Midjourney closed the field for CFG, this parameter is only adjusted to 0.5 times and 2 times the default value in Stable Diffusion; Meanwhile, we studied the subjective quality when the training step is insufficient by halving the number of iterations. The three situations above are characterized as ‘Low Corr’ ‘High Corr’ and ‘Low Step’ in Fig. 9. The data shows that Stable Diffusion has a strong tolerance for insufficient iterations, and both quality scores decline by less than 0.1; However, Midjourney’s quality drops significantly after the iteration was halved, especially since the perception score is almost reduced to the level of GLIDE. By observing the generation process of Midjourney, we find that in the first half of the step, the images only have blurred outlines, and certain details are rendered in the next half. So this kind of blur is likely to dominate the decline in perception score. For CFG, we found that increasing or decreasing this value will lead to a decrease in quality; if increased, the perception score will decrease significantly, and if decreased, the alignment will decrease more, which is consistent with the definition of CFG. It is worth mentioning that the decrease of one score will not increase the other, which shows the rationality of the default CFG value in Stable Diffusion, and it is not recommended to adjust it at will.
我们还调整了 CFG 和迭代次数,考虑到模型内部参数对 AGI 质量的影响。由于 Midjourney 关闭了 CFG 的领域,该参数仅调整为 Stable Diffusion 默认值的 0.5 倍和 2 倍;同时,我们通过将迭代次数减半来研究训练步骤不足时的主观质量。上述三种情况在 Fig. 9 中被标记为“低相关”、“高相关”和“低步骤”。数据显示,Stable Diffusion 对迭代不足具有很强的容忍度,两个质量评分下降均低于 0.1;然而,Midjourney 在迭代减半后质量显著下降,尤其是感知评分几乎降至 GLIDE 的水平。通过观察 Midjourney 的生成过程,我们发现,在前半段步骤中,图像仅有模糊的轮廓,而某些细节在后半段呈现。因此,这种模糊可能主导了感知评分的下降。 对于 CFG,我们发现增加或减少该值都会导致质量下降;如果增加,感知分数会显著下降;如果减少,对齐度会下降得更多,这与 CFG 的定义是一致的。值得一提的是,减少一个分数并不会增加另一个分数,这显示了 Stable Diffusion 中默认 CFG 值的合理性,不建议随意调整。
Alignment Quality Metric 对齐质量指标
A. Framework A. 框架
Considering the remarkable variance of the subjective alignment quality score in Sec. III-D, we propose an objective alignment quality assessment metric StairReward. This method disassembles the alignment quality assessment to the morpheme level for the first time, instead of using the entire prompt as the previous method [17], [30]. The framework of StairReward is shown in Fig. 10, which divides the prompt into multiple morphemes while cutting the whole picture into multiple stairs and gives the final score through their one-to-one alignment. The detail of each component in the proposed model is described as follows.
考虑到主观对齐质量评分在 Sec. III-D 中的显著差异,我们提出了一种客观对齐质量评估指标 StairReward。该方法首次将对齐质量评估拆解到语素层面,而不是像之前的方法 [17] , [30] 那样使用整个提示。StairReward 的框架如 Fig. 10 所示,它将提示分解为多个语素,同时将整体图像切割成多个阶梯,并通过它们的一对一对齐给出最终评分。所提模型中每个组件的细节如下所述。
B. Prompt Segmentation B. 提示分段
A prompt contains multiple morphemes, while a human has different saliency in its different position. For alignment quality, the earlier morphemes have a greater impact on the subjective score, while the later morphemes have less impact. Therefore, we first decompose the prompt into morphemes, so that the objective quality model is more consistent with the subjective perception mechanism. Considering that there are certain differences between prompt [48], [49] and natural language [50], it is not reasonable to directly use previous word segmentation algorithms [51] to split prompt. Therefore, We adopt our own prompt segmentation method. By observing a large number of prompts in DiffusionDB [15], we found that prepositions and punctuation marks are the two most common elements that separate prompts. Therefore, the morpheme sequence
一个提示包含多个语素,而人类在不同位置的显著性不同。对于对齐质量,较早的语素对主观评分的影响更大,而较晚的语素影响较小。因此,我们首先将提示分解为语素,以使客观质量模型与主观感知机制更一致。考虑到提示 [48] 、 [49] 与自然语言 [50] 之间存在一定差异,直接使用之前的分词算法 [51] 来拆分提示是不合理的。因此,我们采用了自己的提示分割方法。通过观察 DiffusionDB 中的大量提示 [15] ,我们发现介词和标点符号是分隔提示的两个最常见元素。因此,提示
C. Image Cutting C. 图像切割
Since the prompt is split as a morpheme in AGIs, we need to first locate the corresponding region of the morpheme in AGI and then compute the T2I alignment score. To avoid extra complexity in analyzing image content, we assume that the center of the image contains the most information and the edges the least. Therefore, we use the same box as the center of the image for sampling and cut the image into different stairs by adjusting the length of the box. Fig. 11 (a) proves the rationality of this method. When the prompt contains three morphemes, the growth of the clip score of the first morpheme and the picture suddenly slows down after the box length reaches 0.5, while the clip score of the third morpheme and the picture keeps steady growth. It can be seen that the first morpheme mainly corresponds to the central part of the picture, and the later morphemes correspond to a larger area. Therefore, based on the morpheme number, the stair-image
由于提示在 AGI 中被分割为一个语素,我们需要首先定位 AGI 中该语素的对应区域,然后计算 T2I 对齐分数。为了避免在分析图像内容时增加额外的复杂性,我们假设图像的中心包含最多的信息,而边缘包含最少的信息。因此,我们使用与图像中心相同的框进行采样,并通过调整框的长度将图像切割成不同的阶梯。 Fig. 11 (a) 证明了该方法的合理性。当提示包含三个语素时,第一个语素和图像的剪辑分数在框的长度达到 0.5 后突然增长减缓,而第三个语素和图像的剪辑分数则保持稳定增长。可以看出,第一个语素主要对应于图像的中央部分,而后面的语素对应于更大的区域。因此,基于语素数量,阶梯图像
D. Final Score Combination
D. 最终得分组合
With sub-images and morphemes, we compute their alignment scores one by one. As ImageReward [14] uses extensive training data, it is suitable for prompts of different lengths and fits the morpheme. Thus, we choose ImageReward as our alignment model. After calculating the above scores, since each score has a different impact on the overall alignment, we set the weight of the latter score to half of the previous one referring to the vertical axis in Fig. 11 (b). Finally, to reserve information between morphemes, we also calculated the alignment score between the entire image and the prompt, and then added the above scores for the final score
通过子图像和语素,我们逐一计算它们的对齐分数。由于 ImageReward [14] 使用了大量的训练数据,因此它适用于不同长度的提示,并且适合语素。因此,我们选择 ImageReward 作为我们的对齐模型。在计算上述分数后,由于每个分数对整体对齐的影响不同,我们将后一个分数的权重设置为前一个分数的一半,参考 Fig. 11 (b) 中的纵轴。最后,为了保留语素之间的信息,我们还计算了整个图像与提示之间的对齐分数,然后将上述分数相加以得到最终分数
Experiment Results 实验结果
A. Experiment Settings 实验设置
To benchmark the performance of AGI perception and alignment metrics, three commonly used indicators, including SRoCC, Kendall Rank-order Correlation Coefficient (KRoCC), and Pearson Linear Correlation Coefficient (PLCC) are applied to evaluate the consistency between the predicted score and the subjective MOS, among which the SRoCC and KRoCC indicate the prediction monotonicity while the PLCC represents the prediction accuracy. To map the predicted scores to MOSs, a five-parameter logistic function is applied, which is a standard practice suggested in [52]:
为了基准测试 AGI 感知和对齐指标的性能,采用三种常用指标,包括 SRoCC、Kendall 秩次相关系数(KRoCC)和 Pearson 线性相关系数(PLCC),来评估预测分数与主观 MOS 之间的一致性,其中 SRoCC 和 KRoCC 指示预测的单调性,而 PLCC 则表示预测的准确性。为了将预测分数映射到 MOS,应用了五参数逻辑函数,这是 [52] :
We select a wide range of AGI perception and alignment benchmarks for comparison. For perception, only No-Reference (NR) metrics are selected considering the absence of reference in the T2I AGI task as Sec. II-A reviewed:
我们选择了一系列广泛的 AGI 感知和对齐基准进行比较。对于感知,仅选择无参考(NR)指标,因为在 T2I AGI 任务中缺乏参考,如 Sec. II-A 所述:
Handcrafted-based models: This group includes four mainstream perceptual quality metrics, namely CEIQ [53], DSIQA [54], NIQE [55], and Sisblim [44]. These models extract handcrafted features based on prior knowledge about image quality.
基于手工特征的模型:该组包括四种主流感知质量指标,即 CEIQ [53] 、DSIQA [54] 、NIQE [55] 和 Sisblim [44] 。这些模型基于对图像质量的先验知识提取手工特征。Loss-function models: This group includes three loss-function that are commonly used in AGI iterations, namely FID [19], InCeption Score (ICS) [18] and KID [20]. The FID and KID measure the distance between AGIs and the MS-COCO [56] database.
损失函数模型:该组包括三种在 AGI 迭代中常用的损失函数,即 FID [19] 、InCeption Score (ICS) [18] 和 KID [20] 。FID 和 KID 测量 AGI 与 MS-COCO [56] 数据库之间的距离。SVR-based models: This group includes BMPRI [57], GMLF [58], HIGRADE [59]. These models combine hand-crafted features from a Support Vector Regression (SVR) to represent perceptual quality.
基于 SVR 的模型:该组包括 BMPRI [57] 、GMLF [58] 、HIGRADE [59] 。这些模型结合了来自支持向量回归(SVR)的手工特征,以表示感知质量。DL-based models: This group includes the latest deep learning (DL) metrics, namely DBCNN [60], CLIPIQA [61], CNNIQA [62], HyperIQA [63], and UNIQUE [64]. These models characterize quality-aware information [64], [65], [66], [67] by training deep neural networks from labeled data.
基于深度学习的模型:该组包括最新的深度学习(DL)指标,即 DBCNN [60] 、CLIPIQA [61] 、CNNIQA [62] 、HyperIQA [63] 和 UNIQUE [64] 。这些模型通过从标记数据中训练深度神经网络来表征质量感知信息 [64] 、 [65] 、 [66] 、 [67] 。
For alignment, we select the most popular CLIP [30] model and the latest ImageReward [14], HPS [17], PickScore [16], and the StairScore we proposed.
为了对齐,我们选择了最受欢迎的 CLIP [30] 模型以及最新的 ImageReward [14] 、HPS [17] 、PickScore [16] 和我们提出的 StairScore。
The AGIQA-3K is split randomly in an 80/20 ratio for training/testing while ensuring the image with the same object label falls into the same set. The partitioning and evaluation process is repeated several times for a fair comparison while considering the computational complexity, and the average result is reported as the final performance. For SVR models, the repeating time is 1,000, implemented by LIBSVM [68] with radial basis function (RBF) kernel. For DL models, we use the pyiqa [69] framework with 10 similar repeatings. The Adam optimizer [70] (with an initial learning rate of 0.00001 and batch size 40) is used for 100-epochs training on an NVIDIA GTX 4090 GPU.
AGIQA-3K 随机分为 80/20 的比例用于训练/测试,同时确保具有相同对象标签的图像落入同一组。分区和评估过程重复多次以进行公平比较,同时考虑计算复杂性,最终结果以平均值报告作为最终性能。对于 SVR 模型,重复次数为 1,000 次,使用 LIBSVM [68] 实现,采用径向基函数(RBF)核。对于 DL 模型,我们使用 pyiqa [69] 框架进行 10 次相似的重复。使用 Adam 优化器 [70] (初始学习率为 0.00001,批量大小为 40)在 NVIDIA GTX 4090 GPU 上进行 100 个周期的训练。
B. Experiment Results and Discussion
B. 实验结果与讨论
Tab II lists the performance result of different perception models on the proposed AGIQA-3K database. To analyze the assessment consistency of the perception model and subjective score generated by different T2I AGI models, we divide six AGI models into three groups, namely bad model (AttnGAN [6], GLIDE [10]), medium model (DALLE2 [34], Stable Diffusion [11]), and good model (Midjourney [35], Stable Diffusion XL) based on the subjective performance/alignment score in Fig. 7. Tab. II (a) shows that comparing with loss-function in AGI iterations, the other three types of models are more compatible with HVS, especially the DL model with an overall SRoCC about 0.8. However, their performance is not that satisfying on three subsets generated by different AGI models. For each subset, even the SRoCC of the best perception model can only reach about 0.5. There are also significant differences in the analytical capabilities of different models for low/high-quality content. For example, the overall performance of CLIPIQA and DBCNN is comparable, but CLIP shows a significant advantage in analyzing aesthetic features in the good model subset, and the performance is not satisfactory when the bad model subset has more distortion; In contrast, DBCNN is more balanced when analyzing data from different AGI models.
Tab II 列出了不同感知模型在所提出的 AGIQA-3K 数据库上的性能结果。为了分析感知模型与不同 T2I AGI 模型生成的主观评分之间的评估一致性,我们将六个 AGI 模型分为三组,即差模型(AttnGAN [6] ,GLIDE [10] )、中等模型(DALLE2 [34] ,Stable Diffusion [11] )和好模型(Midjourney [35] ,Stable Diffusion XL),基于 Fig. 7 中的主观性能/对齐评分。 Tab. II (a) 显示,与 AGI 迭代中的损失函数相比,其他三种类型的模型与 HVS 的兼容性更高,特别是 DL 模型的整体 SRoCC 约为 0.8。然而,它们在由不同 AGI 模型生成的三个子集上的表现并不令人满意。对于每个子集,即使是最佳感知模型的 SRoCC 也只能达到约 0.5。不同模型在低/高质量内容的分析能力上也存在显著差异。 例如,CLIPIQA 和 DBCNN 的整体性能相当,但在良好模型子集的美学特征分析中,CLIP 显示出显著优势,而当不良模型子集有更多失真时,其性能并不令人满意;相比之下,DBCNN 在分析来自不同 AGI 模型的数据时更为平衡。
Considering the difference in phrase length, we define the phrase as prompt 0–3 according to the number of ‘detail’ and ‘style’ according to the description of Sec. III-D. The larger the number, the more complex the phrase. Under the subset of different prompt lengths, the performance of the Perception model is shown in Tab. II (b). Generally, the perceptual quality prediction performance of each model decreases to some extent as the prompts become longer. From the distribution in Fig. 7, it’s believed that the decrease is a combined result of longer prompts making the content difficult to generate, and the perception model’s insufficient predictive ability for low-quality content.
考虑到短语长度的差异,我们根据‘细节’和‘风格’的数量将短语定义为提示 0-3,具体依据 Sec. III-D 的描述。数字越大,短语越复杂。在不同提示长度的子集中,感知模型的性能如 Tab. II (b)所示。一般来说,随着提示变长,每个模型的感知质量预测性能在某种程度上都会下降。从 Fig. 7 中的分布来看,认为这种下降是由于较长的提示使内容生成变得困难,以及感知模型对低质量内容的预测能力不足的综合结果。
For different styles in AGIs, due to the similarity of Abstract & Sci-fi style (unpopular) and Anime & Realistic Style (second-unpopular) shown in Fig. 8, we classify the styles as Tab. II (c). The data show that the perceptual quality model works well in predicting the quality of AGI in unpopular styles, but not satisfactorily in popular styles.
对于 AGI 中的不同风格,由于在 Fig. 8 中显示的抽象与科幻风格(不受欢迎)和动漫与现实主义风格(第二不受欢迎)之间的相似性,我们将这些风格分类为 Tab. II (c)。数据表明,感知质量模型在预测不受欢迎风格的 AGI 质量方面表现良好,但在受欢迎风格中表现不尽如人意。
For T2I alignment, we conduct similar validation in Tab. III like perception. The result shows that the alignment model has a lot to improve when predicting the T2I alignment of images generated by the good AGI model, long prompts, and with popular styles. It is worth mentioning that due to the reasonable disassembly of the prompt, our StairReward far outperforms other methods in predicting the alignment of long prompts, thus taking the lead in the alignment index of the entire AGIQA-3K.
对于 T2I 对齐,我们在 Tab. III 进行类似的验证,如感知所示。结果表明,当预测由优秀的 AGI 模型生成的图像的 T2I 对齐时,对齐模型还有很大的改进空间,尤其是在长提示和流行风格下。值得一提的是,由于对提示的合理拆分,我们的 StairReward 在预测长提示的对齐方面远超其他方法,从而在整个 AGIQA-3K 的对齐指数中处于领先地位。
Generally, considering the performance of perception and alignment evaluation models, future work can be carried out in the following aspects:
一般来说,考虑到感知和对齐评估模型的性能,未来的工作可以在以下几个方面进行:
For both types of models, the existing perception and alignment models are good at distinguishing between excellent and poor quality AGI, but when faced with similar quality results from the same T2I AGI model, the assessment is not accurate enough. How to distinguish AGIs with similar subjective quality is the most urgent problem to be solved in future quality models.
对于这两种类型的模型,现有的感知和对齐模型擅长区分优秀和劣质的 AGI,但在面对来自同一 T2I AGI 模型的相似质量结果时,评估的准确性不足。如何区分具有相似主观质量的 AGI 是未来质量模型中亟待解决的最紧迫问题。For perception models, the IQA models (especially the DL-based models) have excellent agreement with the HVS subjective score. Therefore, the future T2I AGI model can consider replacing the traditional loss function with a DL-based model and inspiring generation through perception.
对于感知模型,IQA 模型(特别是基于深度学习的模型)与 HVS 主观评分具有很好的一致性。因此,未来的 T2I AGI 模型可以考虑用基于深度学习的模型替代传统损失函数,并通过感知来激发生成。For alignment models, their performance has a certain gap with the above-mentioned perception assessment task. Our proposed StairReward improves the alignment assessment performance to some extent, but a more accurate quality model is still needed in the future.
对于对齐模型,它们的性能与上述感知评估任务存在一定差距。我们提出的 StairReward 在一定程度上提高了对齐评估的性能,但未来仍需要更准确的质量模型。
C. Ablation Study C. 切除研究
To validate the contributions of the different modules in StairReward, we also conduct an ablation study and its results are shown in Tab. IV. The factors are specified as:
为了验证 StairReward 中不同模块的贡献,我们还进行了消融研究,其结果如 Tab. IV 所示。因素具体如下:
(word): prompt segmentation, ‘seg pre/pon’ only split the prompt only when preposition/punctuation appear.
提示分割,‘seg pre/pon’ 仅在出现介词/标点符号时分割提示。(image): the minimum cut size of the image. 0 indicates the most extreme cutting strategy while 1 means without cutting.
图像的最小切割大小。0 表示最极端的切割策略,而 1 表示不切割。(ratio): merging ratio of each word-image pair. The larger the value, the front morphemes in the prompt will contribute more to the final quality rating.
(比率):每个词-图像对的合并比率。值越大,提示中的前缀素将对最终质量评分的贡献越大。
The results show that removing any single factor leads to performance degradation, which confirms that they all contribute to the performance results in Tab. III.
结果表明,去除任何单一因素都会导致性能下降,这证实了它们都对 Tab. III 中的性能结果有所贡献。
Conclusion 结论
With the continuous advancement of deep learning technology, a large number of T2I AGI models have emerged in recent years. Different prompt inputs and parameter selections will lead to huge differences in AGI quality. Therefore, refinement and filtering are required before actual use. Therefore, there is an urgent need to develop objective models to assess the quality of AGI. In this paper, we first discuss important evaluation aspects, formulating subjective evaluation criteria in terms of perception/alignment. Then, we established the AGI quality assessment database AGIQA-3K, which covers the largest number of AGI models, the most fine-grained and the most layers, and contains 2,982 AGIs generated from the diffusion model. A well-organized subjective experiment is conducted to collect quality labels for AGI. Subsequently, benchmark experiments are conducted to evaluate the performance of current IQA models. Experimental results show that current perception/alignment models cannot handle the AGIQA task well, especially considering the limited performance of existing alignment models, we propose StairReward to objectively evaluate the alignment quality. In conclusion, both perception/alignment models need to be improved in the future, and AGIQA-3K lays the foundation for this improvement.
随着深度学习技术的不断进步,近年来出现了大量的 T2I AGI 模型。不同的提示输入和参数选择会导致 AGI 质量的巨大差异。因此,在实际使用之前需要进行精细化和过滤。因此,迫切需要开发客观模型来评估 AGI 的质量。本文首先讨论了重要的评估方面,从感知/对齐的角度制定了主观评估标准。然后,我们建立了 AGI 质量评估数据库 AGIQA-3K,该数据库涵盖了数量最多、最细粒度和层次最多的 AGI 模型,包含了 2,982 个由扩散模型生成的 AGI。我们进行了一个组织良好的主观实验,以收集 AGI 的质量标签。随后,进行了基准实验以评估当前 IQA 模型的性能。实验结果表明,当前的感知/对齐模型无法很好地处理 AGIQA 任务,特别是考虑到现有对齐模型的有限性能,我们提出了 StairReward 来客观评估对齐质量。 总之,感知/对齐模型在未来需要改进,而 AGIQA-3K 为这一改进奠定了基础。