这是用户在 2025-1-10 24:52 为 https://arxiv.org/html/2310.18961v3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非独占许可证
arXiv:2310.18961v3 [cs.CV] 03 Dec 2023
arXiv:2310.18961v3 [cs.CV] 2023 年 12 月 3 日

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
AnomalyCLIP:面向零样本异常检测的对象无关提示学习

Qihang Zhou1 , Guansong Pang2\ast, Yu Tian3, Shibo He1\dagger, Jiming Chen1\dagger
1Zhejiang University  2 Singapore Management University  3Harvard University
1 {zqhang, s18he, cjm}@zju.edu.cn2gspang@smu.edu.sg
3ytian11@meei.harvard.edu
Equal contribution. \dagger Corresponding authors.
Abstract  摘要

Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, e.g., data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.
零样本异常检测(ZSAD)要求使用辅助数据训练的检测模型能够在目标数据集中没有任何训练样本的情况下检测异常。当由于各种原因(例如数据隐私)无法访问训练数据时,这是一项至关重要的任务,但由于模型需要泛化到不同领域的异常情况,这些领域中前景对象、异常区域和背景特征(如不同产品/器官上的缺陷/肿瘤)的外观可能会有显著差异,因此具有挑战性。最近,大型预训练视觉语言模型(VLMs),如 CLIP,在各种视觉任务(包括异常检测)中展示了强大的零样本识别能力。然而,它们的 ZSAD 性能较弱,因为 VLMs 更侧重于建模前景对象的类别语义,而不是图像中的异常/正常性。在本文中,我们提出了一种新方法,即 AnomalyCLIP,以调整 CLIP 在不同领域中实现准确的 ZSAD。AnomalyCLIP 的关键见解是学习与对象无关的文本提示,这些提示能够捕捉图像中的通用正常性和异常性,而不受其前景对象的影响。 这使得我们的模型能够专注于异常图像区域而非物体语义,从而实现对多种类型物体的广义正常与异常识别。在 17 个现实世界异常检测数据集上的大规模实验表明,AnomalyCLIP 在检测和分割来自不同缺陷检测和医学成像领域、具有高度多样类别语义的数据集中的异常时,展现了卓越的零样本性能。代码将在 https://github.com/zqhang/AnomalyCLIP 提供。

1 Introduction  1 引言

Anomaly detection (AD) has been widely applied in various applications, such as industrial defect inspection (Bergmann et al., 2019; Xie et al., 2023; Roth et al., 2022; Huang et al., 2022; Mou et al., 2022; Chen et al., 2022; Bergmann et al., 2020; Pang et al., 2021a; Reiss & Hoshen, 2023; You et al., 2022; Liznerski et al., 2020; Ding et al., 2022; Cao et al., 2023) and medical image analysis (Pang et al., 2021a; Qin et al., 2022; Liu et al., 2023; Ding et al., 2022; Tian et al., 2021; 2023; Fernando et al., 2021). Existing AD approaches typically assume that training examples in a target application domain are available for learning the detection models (Pang et al., 2021b; Ruff et al., 2021). However, this assumption may not hold in various scenarios, such as i) when accessing training data violates data privacy policies (e.g., to protect the sensitive information of patients), or ii) when the target domain does not have relevant training data (e.g., inspecting defects in a manufacturing line of new products). Zero-shot anomaly detection (ZSAD) is an emerging task for AD in such scenarios, to which the aforementioned AD approaches are not viable, as it requires detection models to detect anomalies without any training sample in a target dataset. Since anomalies from different application scenarios typically have substantial variations in their visual appearance, foreground objects, and background features, e.g., defects on the surface of one product vs. that on the other products, lesions/tumors on different organs, or industrial defects vs. tumors/lesions in medical images, detection models with strong generalization ability w.r.t. such variations are needed for accurate ZSAD. Recently large pre-trained vision-language models (VLMs) (Radford et al., 2021; Kirillov et al., 2023) have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection (Jeong et al., 2023). Particularly, being pre-trained using millions/billions of image-text pairs, CLIP (Radford et al., 2021) has been applied to empower various downstream tasks (Zhou et al., 2022b; Rao et al., 2022; Khattak et al., 2023; Sain et al., 2023) with its strong generalization capability. WinCLIP (Jeong et al., 2023) is a seminal work in the ZSAD line, which designs a large number of artificial text prompts to exploit the CLIP’s generalizability for ZSAD. However, the VLMs such as CLIP are primarily trained to align with the class semantics of foreground objects rather than the abnormality/normality in the images, and as a result, their generalization in understanding the visual abnormality/normality is restricted, leading to weak ZSAD performance. Further, the current prompting approaches, using either manually defined text prompts (Jeong et al., 2023) or learnable prompts (Sun et al., 2022; Zhou et al., 2022a), often result in prompt embeddings that opt for global features for effective object semantic alignment (Zhong et al., 2022; Wu et al., 2023), failing to capture the abnormality that often manifests in fine-grained, local features, as shown in Fig. 1d and Fig. 1e. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. AnomalyCLIP aims to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. It first devises a simple yet universally-effective learnable prompt template for the two general classes – normality and abnormality – and then utilizes both image-level and pixel-level loss functions to learn the generic normality and abnormality globally and locally in our prompt embeddings using auxiliary data. This allows our model to focus on the abnormal image regions rather the object semantics, enabling remarkable zero-shot capability of recognizing the abnormality that has similar abnormal patterns to those in auxiliary data. As shown in Fig. 1a and Fig. 1b, the foreground object semantics can be completely different in the fine-tuning auxiliary data and target data, but the anomaly patterns remain similar, e.g., scratches on metal nuts and plates, the misplacement of transistors and PCB, tumors/lesions on various organ surfaces, etc. Text prompt embeddings in CLIP fail to generalize across different domains, as illustrated in Fig. 1c, but object-agnostic prompt embeddings learned by AnomalyCLIP can effectively generalize to recognize the abnormality across different domain images in Fig. 1f.
异常检测(AD)已广泛应用于各种应用中,例如工业缺陷检测(Bergmann 等,2019;Xie 等,2023;Roth 等,2022;Huang 等,2022;Mou 等,2022;Chen 等,2022;Bergmann 等,2020;Pang 等,2021a;Reiss & Hoshen,2023;You 等,2022;Liznerski 等,2020;Ding 等,2022;Cao 等,2023)和医学图像分析(Pang 等,2021a;Qin 等,2022;Liu 等,2023;Ding 等,2022;Tian 等,2021;2023;Fernando 等,2021)。现有的 AD 方法通常假设目标应用领域中的训练样本可用于学习检测模型(Pang 等,2021b;Ruff 等,2021)。然而,这一假设在各种场景中可能不成立,例如 i)当访问训练数据违反数据隐私政策时(例如,为了保护患者的敏感信息),或 ii)当目标领域没有相关的训练数据时(例如,在新产品的制造线上检测缺陷)。 零样本异常检测(ZSAD)是在此类场景中用于异常检测的新兴任务,前述的异常检测方法对此并不适用,因为它要求检测模型在目标数据集中没有任何训练样本的情况下检测异常。由于来自不同应用场景的异常通常在其视觉外观、前景对象和背景特征上存在显著差异,例如一种产品表面的缺陷与另一种产品上的缺陷、不同器官上的病变/肿瘤,或工业缺陷与医学图像中的肿瘤/病变,因此需要具有强大泛化能力的检测模型来应对这些变化,以实现准确的 ZSAD。最近,大型预训练视觉语言模型(VLMs)(Radford 等,2021;Kirillov 等,2023)在各种视觉任务中展示了强大的零样本识别能力,包括异常检测(Jeong 等,2023)。特别是,通过使用数百万/数十亿的图像-文本对进行预训练,CLIP(Radford 等,2021)已被应用于赋能各种下游任务(Zhou 等,2022b;Rao 等,2022;Khattak 等,2023;Sain 等,2023),凭借其强大的泛化能力。 WinCLIP(Jeong 等人,2023 年)是 ZSAD 领域的一项开创性工作,它设计了大量人工文本提示,以利用 CLIP 的泛化能力进行 ZSAD。然而,诸如 CLIP 之类的视觉语言模型(VLMs)主要训练用于与前景对象的类别语义对齐,而非图像中的异常/正常性,因此,它们在理解视觉异常/正常性方面的泛化能力受到限制,导致 ZSAD 性能较弱。此外,当前的提示方法,无论是使用手动定义的文本提示(Jeong 等人,2023 年)还是可学习的提示(Sun 等人,2022 年;Zhou 等人,2022a 年),往往会导致提示嵌入选择全局特征以实现有效的对象语义对齐(Zhong 等人,2022 年;Wu 等人,2023 年),而未能捕捉到通常表现在细粒度局部特征中的异常,如图 1d 和图 1e 所示。在本文中,我们介绍了一种新方法,即 AnomalyCLIP,旨在适应 CLIP 以跨不同领域实现准确的 ZSAD。AnomalyCLIP 旨在学习与对象无关的文本提示,以捕捉图像中的通用正常性和异常性,而不论其前景对象如何。 它首先为两个通用类别——正常和异常——设计了一个简单但普遍有效的可学习提示模板,然后利用图像级和像素级损失函数,通过辅助数据在我们的提示嵌入中全局和局部地学习通用的正常和异常。这使得我们的模型能够专注于异常图像区域而非对象语义,从而实现了识别与辅助数据中异常模式相似的异常的显著零样本能力。如图 1a 和图 1b 所示,微调辅助数据和目标数据中的前景对象语义可能完全不同,但异常模式保持相似,例如金属螺母和板上的划痕、晶体管和 PCB 的错位、各种器官表面的肿瘤/病变等。CLIP 中的文本提示嵌入无法跨不同领域泛化,如图 1c 所示,但 AnomalyCLIP 学习的与对象无关的提示嵌入能够有效地泛化,以识别图 1f 中不同领域图像的异常。

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 1: Comparison of ZSAD results on (b) test data using (c) original text prompts in CLIP (Radford et al., 2021), (d) tailored text prompts for AD in WinCLIP (Jeong et al., 2023), (e) learnable text prompts for general vision tasks in CoOp (Zhou et al., 2022a), and (f) object-agnostic text prompts in our AnomalyCLIP. (a) presents a set of auxiliary data we can use to learn the text prompts. The results are obtained by measuring the similarity between text prompt embeddings and image embeddings. The ground-truth anomaly regions are circled in red in (a) and (b). (c), (d), and (e) suffer from poor generalization across different domains, while our AnomalyCLIP in (f) can well generalize to anomalies in diverse types of objects from different domains.
图 1:ZSAD 结果对比,分别在(b)测试数据上使用(c)CLIP 中的原始文本提示(Radford 等,2021)、(d)WinCLIP 中为 AD 定制的文本提示(Jeong 等,2023)、(e)CoOp 中适用于通用视觉任务的可学习文本提示(Zhou 等,2022a),以及(f)我们 AnomalyCLIP 中的对象无关文本提示。(a)展示了一组可用于学习文本提示的辅助数据。结果通过测量文本提示嵌入与图像嵌入之间的相似度获得。真实异常区域在(a)和(b)中用红色圈出。(c)、(d)和(e)在不同领域的泛化能力较差,而我们的 AnomalyCLIP 在(f)中能够很好地泛化到来自不同领域、不同类型对象的异常情况。

In summary, this paper makes the following main contributions.
总之,本文做出了以下主要贡献。

  • We reveal for the first time that learning object-agnostic text prompts of normality and abnormality is a simple yet effective approach for accurate ZSAD. Compared to current text prompting approaches that are primarily designed for object semantic alignment (Jeong et al., 2023; Zhou et al., 2022b), our text prompt embeddings model semantics of generic abnormality and normality, allowing object-agnostic, generalized ZSAD performance.


    • 我们首次揭示,学习与对象无关的正常与异常文本提示是一种简单而有效的准确 ZSAD 方法。与当前主要设计用于对象语义对齐的文本提示方法(Jeong 等,2023;Zhou 等,2022b)相比,我们的文本提示嵌入模型能够捕捉一般异常与正常的语义,从而实现与对象无关的广义 ZSAD 性能。
  • We then introduce a novel ZSAD approach, called AnomalyCLIP, in which we utilize an object-agnostic prompt template and a glocal abnormality loss function (i.e., a combination of global and local loss functions) to learn the generic abnormality and normality prompts using auxiliary data. In doing so, AnomalyCLIP largely simplifies the prompt design and can effectively apply to different domains without requiring any change on its learned two prompts, contrasting to existing methods like WinCLIP whose effectiveness relies heavily on extensive engineering on hundreds of manually defined prompts.


    • 随后,我们介绍了一种名为 AnomalyCLIP 的新型 ZSAD 方法,该方法采用了一种与对象无关的提示模板及一种全局与局部异常损失函数相结合的策略(即全局与局部损失函数的组合),利用辅助数据学习通用的异常与正常提示。如此一来,AnomalyCLIP 极大地简化了提示设计,并能有效应用于不同领域,无需对其学习到的两个提示进行任何更改,这与现有方法如 WinCLIP 形成鲜明对比,后者依赖于对数百个手动定义提示的大量工程优化才能发挥效果。
  • Comprehensive experiments on 17 datasets from various industrial and medical domains demonstrate that AnomalyCLIP achieves superior ZSAD performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from defect inspection and medical imaging domains.


    • 在来自工业和医疗领域的 17 个数据集上的全面实验表明,AnomalyCLIP 在检测和分割缺陷检测和医学成像领域中具有高度多样类别语义的数据集异常方面,实现了卓越的零样本异常检测(ZSAD)性能。

2 Preliminary  2 初步

CLIP consists of a text encoder and visual encoder denoted as T()T(\cdot)italic_T ( ⋅ ) and F()F(\cdot)italic_F ( ⋅ ), respectively. Both encoders are mainstream multi-layer networks such as ViT (Dosovitskiy et al., 2020; Vaswani et al., 2017). Using text prompts is a typical way to achieve the embeddings of different classes for zero-shot recognition. Particularly, a text prompt template 𝔾{\mathbb{G}}blackboard_G with the class name ccitalic_c can be passed through T()T(\cdot)italic_T ( ⋅ ) to obtain its corresponding textual embedding gcDg_{c}\in\mathbb{R}^{D}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. The text prompt template commonly used in CLIP looks like A photo of a [cls], where [cls] represents the target class name. Then F()F(\cdot)italic_F ( ⋅ ) encodes an image xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to derive visual representations, where the class token fiDf_{i}\in\mathbb{R}^{D}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is treated as its visual embedding (global visual embedding), and patch tokens fimH×W×Df_{i}^{m}\in\mathbb{R}^{H\times W\times D}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT are referred to as local visual embeddings. CLIP performs zero-shot recognition by measuring the similarity between textual and visual embeddings. In specific, given a target class set 𝒞\mathcal{C}caligraphic_C and an image xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, CLIP predicts the probability of xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belonging to ccitalic_c as follows:
CLIP 由一个文本编码器和一个视觉编码器组成,分别表示为 T()T(\cdot)italic_T ( ⋅ )F()F(\cdot)italic_F ( ⋅ ) 。这两个编码器都是主流的多层网络,如 ViT(Dosovitskiy 等人,2020;Vaswani 等人,2017)。使用文本提示是实现零样本识别中不同类别嵌入的典型方式。特别是,带有类别名称 ccitalic_c 的文本提示模板 𝔾{\mathbb{G}}blackboard_G 可以通过 T()T(\cdot)italic_T ( ⋅ ) 传递,以获得其对应的文本嵌入 gcDsubscriptsuperscriptg_{c}\in\mathbb{R}^{D}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT 。CLIP 中常用的文本提示模板看起来像 A photo of a [cls] ,其中 [cls] 代表目标类别名称。然后, F()F(\cdot)italic_F ( ⋅ ) 对图像 xisubscriptx_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 进行编码以得出视觉表示,其中类别标记 fiDsubscriptsuperscriptf_{i}\in\mathbb{R}^{D}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT 被视为其视觉嵌入(全局视觉嵌入),而补丁标记 fimH×W×Dsuperscriptsubscriptsuperscriptf_{i}^{m}\in\mathbb{R}^{H\times W\times D}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT 被称为局部视觉嵌入。CLIP 通过测量文本嵌入和视觉嵌入之间的相似性来执行零样本识别。具体来说,给定一个目标类别集 𝒞\mathcal{C}caligraphic_C 和一张图像 xisubscriptx_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,CLIP 预测 xisubscriptx_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 属于 ccitalic_c 的概率如下:

p(y=c|xi)=P(gc,fi)=exp(<gc,fi>/τ)c𝒞exp(<gc,fi>)/τ),p(y=c|x_{i})=P(g_{c},f_{i})={\frac{exp(<g_{c},f_{i}>/\tau)}{\sum_{c\in\mathcal% {C}}exp(<g_{c},f_{i}>)/\tau)}},italic_p ( italic_y = italic_c | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_P ( italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_e italic_x italic_p ( < italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT italic_e italic_x italic_p ( < italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > ) / italic_τ ) end_ARG , (1)

where τ\tauitalic_τ is a temperature hyperparameter, and the operator <,><\cdot,\cdot>< ⋅ , ⋅ > represents the computation of cosine similarity. Unlike many vision tasks that involve many objects and use the name of the objects as the class name [cls], we posit that performing ZSAD tasks using CLIP should be object-agnostic, so we propose to design two classes of text prompts (i.e., normality and abnormality) and compute the possibility of these two classes according to Eq. 1. We denote the probability of being abnormal P(ga,fi)P(g_{a},f_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the anomaly score. The computation is extended from global visual embeddings to local visual embeddings to derive the corresponding segmentation maps SnH×WS_{n}\in\mathbb{R}^{H\times W}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT and SaH×WS_{a}\in\mathbb{R}^{H\times W}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, where each entry (j,k)(j,k)( italic_j , italic_k ) are computed as P(gn,fim(j,k))P(g_{n},f^{m(j,k)}_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and P(ga,fim(j,k))P(g_{a},f^{m(j,k)}_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).
其中 τ\tauitalic_τ 是温度超参数,运算符 <,><\cdot,\cdot>< ⋅ , ⋅ > 表示余弦相似度的计算。与许多涉及多个对象并使用对象名称作为类别名称 [cls] 的视觉任务不同,我们假设使用 CLIP 执行 ZSAD 任务应该是与对象无关的,因此我们建议设计两类文本提示(即正常和异常),并根据公式 1 计算这两类的可能性。我们将异常的概率 P(ga,fi)subscriptsubscriptP(g_{a},f_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 表示为异常分数。计算从全局视觉嵌入扩展到局部视觉嵌入,以得出相应的分割图 SnH×WsubscriptsuperscriptS_{n}\in\mathbb{R}^{H\times W}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPTSaH×WsubscriptsuperscriptS_{a}\in\mathbb{R}^{H\times W}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT ,其中每个条目 (j,k)(j,k)( italic_j , italic_k ) 计算为 P(gn,fim(j,k))subscriptsubscriptsuperscriptP(g_{n},f^{m(j,k)}_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )P(ga,fim(j,k))subscriptsubscriptsuperscriptP(g_{a},f^{m(j,k)}_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Refer to caption
Figure 2: Overview of AnomalyCLIP. To adapt CLIP to ZSAD, AnomalyCLIP introduces object-agnostic text prompt templates to capture generic normality and abnormality regardless of the object semantics. Then, we introduce glocal context optimization to incorporate global and fine-grained anomaly semantics into object-agnostic text prompt learning. Finally, textual prompt tuning and DPAM are used to enable the prompt learning in the textual and local visual spaces of CLIP.
图 2:AnomalyCLIP 概述。为了使 CLIP 适应 ZSAD,AnomalyCLIP 引入了对象无关的文本提示模板,以捕捉通用的正常性和异常性,而不考虑对象语义。然后,我们引入了全局上下文优化,将全局和细粒度的异常语义融入对象无关的文本提示学习中。最后,使用文本提示调优和 DPAM 在 CLIP 的文本和局部视觉空间中实现提示学习。

3 AnomalyCLIP: object-agnostic prompt learning
3AnomalyCLIP:对象无关的提示学习

3.1 Approach overview  3.1 方法概述

In this paper, we propose AnomalyCLIP to adapt CLIP to ZSAD via object-agnostic prompt learning. As shown in Fig. 2, AnomalyCLIP first introduces object-agnostic text prompt templates, where we design two generic object-agnostic text prompt templates of gng_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and gag_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to learn generalized embedding for the normality and abnormality classes, respectively (see Sec. 3.2). To learn such generic text prompt templates, we introduce global and local context optimization to incorporate global and fine-grained anomaly semantics into object-agnostic textual embedding learning. In addition, textual prompt tuning and DPAM are used to support the learning in the textual and local visual spaces of CLIP. Finally, we integrate the multiple intermediate layers to provide more local visual details. During training, all modules are jointly optimized by the combination of global and local context optimization. During inference, we quantify the misalignment of textual and global/local visual embeddings to obtain the anomaly score and anomaly score map, respectively (see Sec. 3.3).
在本文中,我们提出了 AnomalyCLIP,通过对象无关的提示学习将 CLIP 适应于 ZSAD。如图 2 所示,AnomalyCLIP 首先引入了对象无关的文本提示模板,其中我们设计了两个通用的对象无关文本提示模板 gnsubscriptg_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTgasubscriptg_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ,分别用于学习正常类和异常类的广义嵌入(见第 3.2 节)。为了学习这些通用文本提示模板,我们引入了全局和局部上下文优化,将全局和细粒度的异常语义融入对象无关的文本嵌入学习中。此外,文本提示调优和 DPAM 被用来支持 CLIP 的文本和局部视觉空间中的学习。最后,我们整合了多个中间层以提供更多的局部视觉细节。在训练过程中,所有模块通过全局和局部上下文优化的组合进行联合优化。在推理过程中,我们量化文本和全局/局部视觉嵌入的错位,分别获得异常分数和异常分数图(见第 3.3 节)。

3.2 Object-agnostic text prompt design
3.2 对象无关的文本提示设计

Commonly used text prompt templates in CLIP, like A photo of a [cls], primarily focus on object semantics. Consequently, they fail to generate textual embeddings that capture anomaly and normal semantics to query corresponding visual embeddings. To support the learning of anomaly-discriminative textual embeddings, we aim to incorporate prior anomaly semantics into text prompt templates. A trivial solution is to design the templates with specific anomaly types, such as A photo of a [cls] with scratches However, the pattern of anomaly is typically unknown and diverse, so it is practically difficult to list all possible anomaly types. Therefore, it is important to define text prompt templates with generic anomaly semantics. For this purpose, we can adopt the text damaged [cls] to cover comprehensive anomaly semantics, facilitating the detection of diverse defects such as scratches and holes. Nevertheless, utilizing such text prompt templates poses challenges in generating generic anomaly-discriminating textual embeddings. This is because CLIP’s original pre-training focuses on aligning with object semantics instead of the abnormality and normality within images. To address this limitation, we can introduce learnable text prompt templates and tune the prompts using auxiliary AD-relevant data. During the fine-tuning process, these learnable templates can incorporate both broad and detailed anomaly semantics, resulting in textual embeddings that are more discriminative between normality and abnormality. This helps avoid the need for manually defined text prompt templates that require extensive engineering (Jeong et al., 2023). These text prompts are referred to as object-aware text prompt templates and defined as follows:
CLIP 中常用的文本提示模板,如 A photo of a [cls] ,主要关注对象语义。因此,它们无法生成能够捕捉异常和正常语义的文本嵌入,以查询相应的视觉嵌入。为了支持异常区分性文本嵌入的学习,我们旨在将先验的异常语义融入文本提示模板中。一个简单的解决方案是设计包含特定异常类型的模板,例如 A photo of a [cls] with scratches 。然而,异常的模式通常是未知且多样的,因此实际上很难列出所有可能的异常类型。因此,定义具有通用异常语义的文本提示模板至关重要。为此,我们可以采用文本 damaged [cls] 来涵盖全面的异常语义,便于检测诸如划痕和孔洞等多样缺陷。然而,使用此类文本提示模板在生成通用的异常区分性文本嵌入方面存在挑战。这是因为 CLIP 的原始预训练侧重于与对象语义对齐,而非图像中的异常与正常性。 为了解决这一限制,我们可以引入可学习的文本提示模板,并利用辅助的 AD 相关数据对提示进行调整。在微调过程中,这些可学习的模板能够融合广泛且细致的异常语义,从而生成在正常与异常之间更具区分性的文本嵌入。这有助于避免需要大量工程投入的手动定义文本提示模板(Jeong 等,2023)。这些文本提示被称为对象感知文本提示模板,并定义如下:

gn\displaystyle g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =[V1][V2][VE][cls]\displaystyle=[V_{1}][V_{2}]\dots[V_{E}][cls]= [ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] [ italic_c italic_l italic_s ]
ga\displaystyle g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT =[W1][W2][WE][damaged][cls],\displaystyle=[W_{1}][W_{2}]\dots[W_{E}][damaged][cls],= [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] [ italic_d italic_a italic_m italic_a italic_g italic_e italic_d ] [ italic_c italic_l italic_s ] ,

where [V]i[V]_{i}[ italic_V ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and [W]i[W]_{i}[ italic_W ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i1,,Ei\in{1,\dots,E}italic_i ∈ 1 , … , italic_E) are learnable word embeddings in normality and abnormality text prompt templates, respectively.
其中 [V]isubscriptdelimited-[][V]_{i}[ italic_V ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT[W]isubscriptdelimited-[][W]_{i}[ italic_W ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTi1,,E1i\in{1,\dots,E}italic_i ∈ 1 , … , italic_E )分别是正常和异常文本提示模板中的可学习词嵌入。

ZSAD tasks require models to detect anomalies in previously unseen target datasets. These datasets often exhibit significant variations in object semantics among different objects, like various defects on one product vs. another, or discrepancies between industrial defects and medical imaging tumors. However, despite these substantial differences in object semantics, the underlying anomaly patterns could be similar. For instance, anomalies like scratches on metal nuts and plates, or the misplacement of transistors and PCB, as well as tumors on the surface of various organs, can share similar anomaly patterns. We hypothesize that the key of accurate ZSAD is to identify these generic anomaly patterns regardless of the varying semantics of different objects. Therefore, the inclusion of object semantics in object-aware text prompt templates is often unnecessary for ZSAD. It can even hinder the detection of anomalies in classes that have not been seen during the learning process. More importantly, excluding the object semantics from text prompt templates allows learnable text prompt templates to focus on capturing the characteristics of anomalies themselves, rather than the objects. Motivated by this, we introduce object-agnostic prompt learning, with the aim to capture generic normality and abnormality within images regardless of the object semantics. Different from object-aware text prompt templates, as shown below, the object-agnostic text prompt templates replace the class name in gng_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and gag_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT with object, blocking out the class semantics of objects:
ZSAD 任务要求模型在先前未见过的目标数据集中检测异常。这些数据集通常在不同对象之间表现出显著的物体语义变化,例如一种产品与另一种产品上的各种缺陷,或工业缺陷与医学成像肿瘤之间的差异。然而,尽管物体语义存在显著差异,潜在的异常模式可能是相似的。例如,金属螺母和板材上的划痕、晶体管和 PCB 的错位,以及各种器官表面的肿瘤,都可能共享相似的异常模式。我们假设,准确 ZSAD 的关键在于识别这些通用异常模式,而不受不同物体语义变化的影响。因此,在面向物体的文本提示模板中包含物体语义对于 ZSAD 通常是不必要的,甚至可能阻碍在学习过程中未见过的类别中检测异常。更重要的是,从文本提示模板中排除物体语义,使可学习的文本提示模板能够专注于捕捉异常本身的特征,而非物体。 受此启发,我们引入了对象无关的提示学习,旨在捕捉图像中的一般正常性和异常性,而不考虑对象语义。不同于对象感知的文本提示模板,如下所示,对象无关的文本提示模板将 gnsubscriptg_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTgasubscriptg_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 中的类名替换为 object ,从而屏蔽了对象的类语义:

gn\displaystyle g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =[V1][V2][VE][object]\displaystyle=[V_{1}][V_{2}]\dots[V_{E}][object]= [ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] [ italic_o italic_b italic_j italic_e italic_c italic_t ]
ga\displaystyle g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT =[W1][W2][WE][damaged][object].\displaystyle=[W_{1}][W_{2}]\dots[W_{E}][damaged][object].= [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] [ italic_d italic_a italic_m italic_a italic_g italic_e italic_d ] [ italic_o italic_b italic_j italic_e italic_c italic_t ] .

This design empowers the object-agnostic text prompt template to learn the shared patterns of different anomalies. As a result, the generated textual embeddings are more generic and capable of identifying anomalies across diverse objects and different domains. Further, this prompt design is versatile and can be applied to different target domains without any modification, e.g., requiring no knowledge about the object name or anomaly types in a target dataset.
该设计使对象无关的文本提示模板能够学习不同异常的共享模式。因此,生成的文本嵌入更加通用,能够识别跨不同对象和领域的异常。此外,这种提示设计具有通用性,无需修改即可应用于不同的目标领域,例如,不需要了解目标数据集中对象的名称或异常类型。

3.3 Learning generic abnormality and normality prompts
3.3 学习通用的异常和正常提示

Glocal context optimization
全球本地化上下文优化

To effectively learn the object-agnostic text prompts, we devise a joint optimization approach that enables the normality and abnormality prompt learning from both global and local perspectives, namely global and local context optimization. The global context optimization aims to enforce that our object-agnostic textual embeddings are matched with the global visual embeddings of images of diverse objects. This helps effectively capture the normal/abnormal semantics from a global feature perspective. The local context optimization is introduced to enable object-agnostic text prompts to concentrate on fine-grained, local abnormal regions from MMitalic_M intermediate layers of the visual encoder, in addition to the global normal/abnormal features. Formally, let \mathcal{M}caligraphic_M be the set of intermediate layers used (i.e., M=||M=|\mathcal{M}|italic_M = | caligraphic_M |), our text prompts are learned by minimizing the following glocal loss function:
为了有效学习与对象无关的文本提示,我们设计了一种联合优化方法,该方法能够从全局和局部两个角度进行正常和异常提示的学习,即全局和局部上下文优化。全局上下文优化的目的是确保我们的与对象无关的文本嵌入与不同对象图像的全局视觉嵌入相匹配。这有助于从全局特征的角度有效捕捉正常/异常语义。引入局部上下文优化是为了使与对象无关的文本提示能够集中关注视觉编码器 MMitalic_M 中间层的细粒度局部异常区域,同时兼顾全局正常/异常特征。形式上,设 \mathcal{M}caligraphic_M 为所使用的中间层集合(即 M=||M=|\mathcal{M}|italic_M = | caligraphic_M | ),我们的文本提示通过最小化以下全局-局部损失函数来学习:

Ltotal=Lglobal+λMkLlocalMk,\displaystyle L_{total}=L_{global}+\lambda\textstyle\sum_{M_{k}\in\mathcal{M}}% L_{local}^{M_{k}},italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (2)

where λ\lambdaitalic_λ is a hyperparameter to balance the global and local losses, LglobalL_{global}italic_L start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT is a cross entropy loss that matches the cosine similarity between the object-agnostic textual embeddings and visual embeddings of normal/abnormal images from auxiliary data, and let SH×WS\in\mathbb{R}^{H\times W}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT be the ground-truth segmentation mask, with Sij=1S_{ij}=1italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if the pixel is as an anomaly and Sij=0S_{ij}=0italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 otherwise, then we have
其中 λ\lambdaitalic_λ 是平衡全局和局部损失的超参数, LglobalsubscriptL_{global}italic_L start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT 是交叉熵损失,用于匹配来自辅助数据的正常/异常图像的对象无关文本嵌入和视觉嵌入之间的余弦相似度, SH×WsuperscriptS\in\mathbb{R}^{H\times W}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT 为真实分割掩码,若像素为异常则 Sij=1subscript1S_{ij}=1italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ,否则 Sij=0subscript0S_{ij}=0italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 ,则有

Sn(j,k)=P(gn,fim(j,k)),j[1,H],k[1,W],\displaystyle S^{(j,k)}_{n}=P(g_{n},f^{m(j,k)}_{i}),j\in[1,H],k\in[1,W],italic_S start_POSTSUPERSCRIPT ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_P ( italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_j ∈ [ 1 , italic_H ] , italic_k ∈ [ 1 , italic_W ] , Sa(j,k)=P(ga,fim(j,k)),j[1,H],k[1,W]\displaystyle\quad S^{(j,k)}_{a}=P(g_{a},f^{m(j,k)}_{i}),j\in[1,H],k\in[1,W]italic_S start_POSTSUPERSCRIPT ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m ( italic_j , italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_j ∈ [ 1 , italic_H ] , italic_k ∈ [ 1 , italic_W ]
Llocal=Focal([Sn,Sa],S)\displaystyle L_{local}=Focal([S_{n},S_{a}],S)italic_L start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT = italic_F italic_o italic_c italic_a italic_l ( [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] , italic_S ) +Dice(Sn,IS)+Dice(Sa,S),\displaystyle+Dice(S_{n},I-S)+Dice(S_{a},S),+ italic_D italic_i italic_c italic_e ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_I - italic_S ) + italic_D italic_i italic_c italic_e ( italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_S ) ,

where Focal(,)Focal(\cdot,\cdot)italic_F italic_o italic_c italic_a italic_l ( ⋅ , ⋅ ) and Dice(,)Dice(\cdot,\cdot)italic_D italic_i italic_c italic_e ( ⋅ , ⋅ ) denote a focal loss (Lin et al., 2017) and a Dice loss (Li et al., 2019) respectively, the operator [,][\cdot,\cdot][ ⋅ , ⋅ ] represents the concatenation along with the channel, and IIitalic_I represents the full-one matrix. Since the anomalous regions are typically smaller than the normal ones, we use focal loss to address the imbalance problem. Furthermore, to ensure that the model establishes an accurate decision boundary, we employ the Dice loss to measure the overlaps between the predicted segmentation SnS_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT/SaS_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and the ground truth mask.
其中 Focal(,)Focal(\cdot,\cdot)italic_F italic_o italic_c italic_a italic_l ( ⋅ , ⋅ )Dice(,)Dice(\cdot,\cdot)italic_D italic_i italic_c italic_e ( ⋅ , ⋅ ) 分别表示焦点损失(Lin 等人,2017)和 Dice 损失(Li 等人,2019),运算符 [,][\cdot,\cdot][ ⋅ , ⋅ ] 表示沿通道的串联, IIitalic_I 表示全一矩阵。由于异常区域通常小于正常区域,我们使用焦点损失来解决不平衡问题。此外,为了确保模型建立准确的决策边界,我们采用 Dice 损失来衡量预测分割 SnsubscriptS_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / SasubscriptS_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 与真实掩码之间的重叠。

Refinement of the textual space
文本空间的精炼

To facilitate the learning of a more discriminative textual space via Eq. 2, inspired by Jia et al. (2022) and Khattak et al. (2023), we introduce text prompt tuning to refine the original textual space of CLIP by adding additional learnable token embeddings into its text encoder. Specifically, we first attach randomly initialized learnable token embeddings tmt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT into TmT_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the mmitalic_m-th layer of the frozen CLIP text encoder. Then, we concatenate tmt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the original token embeddings tmt_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT along the dimension of the channel, and forward them to TmT_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to get the corresponding rm+1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and tm+1t_{m+1}italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT. To ensure proper calibration, we discard the obtained rm+1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and initialize new learnable token embeddings tm+1t^{\prime}_{m+1}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT. Note that even though the output rm+1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT is discarded, the updated gradients can still be backpropagated to optimize the learnable tokens tmt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT due to the self-attention mechanism. We repeat this operation until we reach the designated layer MM^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. During fine-tuning, these learnable token embeddings are optimized to refine the original textual space. More details see Appendix D.
为了通过公式 2 促进更具区分性的文本空间的学习,受 Jia 等人(2022 年)和 Khattak 等人(2023 年)的启发,我们引入了文本提示调优,通过向 CLIP 的文本编码器添加额外的可学习令牌嵌入来优化其原始文本空间。具体而言,我们首先将随机初始化的可学习令牌嵌入 tmsubscriptsuperscriptt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 附加到冻结的 CLIP 文本编码器的第 mmitalic_mTmsubscriptT_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 中。然后,我们沿通道维度连接 tmsubscriptsuperscriptt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 和原始令牌嵌入 tmsubscriptt_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,并将它们前向传递到 TmsubscriptT_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 以获得相应的 rm+1subscriptsuperscript1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPTtm+1subscript1t_{m+1}italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 。为了确保适当的校准,我们丢弃获得的 rm+1subscriptsuperscript1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 并初始化新的可学习令牌嵌入 tm+1subscriptsuperscript1t^{\prime}_{m+1}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 。请注意,即使输出 rm+1subscriptsuperscript1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 被丢弃,由于自注意力机制,更新的梯度仍可以反向传播以优化可学习令牌 tmsubscriptsuperscriptt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 。我们重复此操作,直到达到指定层 MsuperscriptM^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 。在微调期间,这些可学习令牌嵌入被优化以改进原始文本空间。更多细节见附录 D。

Refinement of the local visual space
局部视觉空间的优化
Refer to caption
(a) Input  (a)输入
Refer to caption
(b) Q-K attention  (b)Q-K 注意力
Refer to caption
(c) Q-Q attention  (c)Q-Q 注意
Refer to caption
(d) K-K attention  (d)K-K 注意力
Refer to caption
(e) V-V attention  (e)V-V 注意力
Figure 3: DPAM visualization.
图 3:DPAM 可视化。

Since the visual encoder of CLIP is originally pre-trained to align global object semantics, the contrastive loss used in CLIP makes the visual encoder produce a representative global embedding for class recognition. Through the self-attention mechanism, the attention map in the visual encoder focuses on the specific tokens highlighted within the red rectangle in Fig 3b. Although these tokens may contribute to global object recognition, they disrupt the local visual semantics, which directly hinders the effective learning of the fine-grained abnormality in our object-agnostic text prompts. We found empirically that a diagonally prominent attention map helps reduce the disturbance from other tokens, leading to improved local visual semantics. Therefore, we propose a mechanism called Diagonally Prominent Attention Map to refine the local visual space, with the visual encoder kept frozen during training. To this end, we replace the original QQitalic_Q-KKitalic_K attention in the visual encoder with diagonally prominent attention, such as QQitalic_Q-QQitalic_Q, KKitalic_K-KKitalic_K, and VVitalic_V-VVitalic_V self-attention schemes. As demonstrated in Fig.3c, Fig.3d, and Fig. 3e, the refined DPAM attention maps are more diagonally prominent, resulting in substantially improved segmentation maps in both original CLIP and our AnomalyCLIP. Compared to CLIP that is based on global features and manually defined text prompts, the text prompts learned by AnomalyCLIP are more fine-grained, enabling substantially more accurate alignment between the normality/abnormality prompt embeddings and the local visual embeddings across four different self-attention schemes. This, in turn, allows AnomalyCLIP to generate accurate SnS_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and SaS_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT for the joint optimization in Eq. 2. Unless otherwise specified, AnomalyCLIP utilizes VVitalic_V-VVitalic_V self-attention due to its superior overall performance. The performance of different self-attention mechanisms is analyzed in Sec. D. We also provide a detailed explanation about DPAM in Appendix C.
由于 CLIP 的视觉编码器最初是预训练用于对齐全局对象语义的,CLIP 中使用的对比损失使视觉编码器生成用于类别识别的代表性全局嵌入。通过自注意力机制,视觉编码器中的注意力图聚焦于图 3b 中红色矩形内突出显示的特定标记。尽管这些标记可能有助于全局对象识别,但它们扰乱了局部视觉语义,这直接阻碍了我们对象无关文本提示中细粒度异常的有效学习。我们经验性地发现,对角线突出的注意力图有助于减少其他标记的干扰,从而改善局部视觉语义。因此,我们提出了一种称为对角线突出注意力图的机制,以优化局部视觉空间,同时在训练期间保持视觉编码器冻结。为此,我们将视觉编码器中的原始 QQitalic_Q - KKitalic_K 注意力替换为对角线突出的注意力,例如 QQitalic_Q - QQitalic_QKKitalic_K - KKitalic_KVVitalic_V - VVitalic_V 自注意力方案。如图 3c、图 3d 和图所示。 3e,经过优化的 DPAM 注意力图在斜对角线上更为突出,从而在原始 CLIP 和我们的 AnomalyCLIP 中显著改善了分割图。与基于全局特征和手动定义文本提示的 CLIP 相比,AnomalyCLIP 学习的文本提示更为精细,使得在四种不同的自注意力机制下,正常/异常提示嵌入与局部视觉嵌入之间的对齐更加精确。这进而使 AnomalyCLIP 能够为公式 2 中的联合优化生成准确的 SnsubscriptS_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTSasubscriptS_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 。除非另有说明,AnomalyCLIP 采用 VVitalic_V - VVitalic_V 自注意力机制,因其整体性能优越。不同自注意力机制的性能分析见第 D 节。我们还在附录 C 中提供了关于 DPAM 的详细解释。

Training and Inference  训练与推理

During training, AnomalyCLIP minimizes the loss in Eq. 2 using an auxiliary AD-related dataset. As for inference, given a test image xix_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we use the similarity score P(ga,fi)P(g_{a},f_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the image-level anomaly score, with the anomaly score leaning toward one when the anomaly textual embedding gag_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is aligned with the global visual embedding fif_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For pixel-wise predictions, we merge the segmentation SnS_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and SaS_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, followed by an interpolation and smoothing operation. Formally, our anomaly score map MH×WM\in\mathbb{R}^{H\times W}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT is computed as M=Gσ(Bilinear_interploation(12(ISn+Sa)))M=G_{\sigma}(Bilinear\_interploation(\frac{1}{2}(I-S_{n}+S_{a})))italic_M = italic_G start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_B italic_i italic_l italic_i italic_n italic_e italic_a italic_r _ italic_i italic_n italic_t italic_e italic_r italic_p italic_l italic_o italic_a italic_t italic_i italic_o italic_n ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_I - italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) ), where GσG_{\sigma}italic_G start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT represents a Gaussian filter, and σ\sigmaitalic_σ controls the extent of smoothing.
在训练过程中,AnomalyCLIP 利用一个辅助的 AD 相关数据集最小化公式 2 中的损失。至于推理阶段,给定测试图像 xisubscriptx_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,我们使用相似度分数 P(ga,fi)subscriptsubscriptP(g_{a},f_{i})italic_P ( italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 作为图像级别的异常分数,当异常文本嵌入 gasubscriptg_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 与全局视觉嵌入 fisubscriptf_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 对齐时,异常分数倾向于 1。对于像素级预测,我们融合分割结果 SnsubscriptS_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTSasubscriptS_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ,随后进行插值和平滑操作。形式上,我们的异常分数图 MH×WsuperscriptM\in\mathbb{R}^{H\times W}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT 计算为 M=Gσ(Bilinear_interploation(12(ISn+Sa)))subscript12subscriptsubscriptM=G_{\sigma}(Bilinear\_interploation(\frac{1}{2}(I-S_{n}+S_{a})))italic_M = italic_G start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_B italic_i italic_l italic_i italic_n italic_e italic_a italic_r _ italic_i italic_n italic_t italic_e italic_r italic_p italic_l italic_o italic_a italic_t italic_i italic_o italic_n ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_I - italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) ) ,其中 GσsubscriptG_{\sigma}italic_G start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT 代表高斯滤波器, σ\sigmaitalic_σ 控制平滑程度。

4 Experiments  4 实验

4.1 Experiment setup  4.1 实验设置

Datasets and Evaluation Metrics
数据集和评估指标

We conducted extensive experiments on 17 publicly available datasets, covering various industrial inspection scenarios and medical imaging domains (including photography, endoscopy, and radiology) to evaluate the performance of AnomalyCLIP. In industrial inspection, we consider MVTec AD (Bergmann et al., 2019), VisA (Zou et al., 2022), MPDD (Jezek et al., 2021), BTAD (Mishra et al., 2021), SDD (Tabernik et al., 2020), DAGM (Wieler & Hahn, 2007), and DTD-Synthetic (Aota et al., 2023). In medical imaging, we consider skin cancer detection dataset ISIC (Gutman et al., 2016), colon polyp detection datasets CVC-ClinicDB (Bernal et al., 2015), CVC-ColonDB (Tajbakhsh et al., 2015), Kvasir (Jha et al., 2020), and Endo (Hicks et al., 2021), thyroid nodule detection dataset TN3k (Gong et al., 2021), brain tumor detection datasets HeadCT (Salehi et al., 2021), BrainMRI (Salehi et al., 2021), Br35H (Hamada., 2020), and COVID-19 detection dataset COVID-19 (Chowdhury et al., 2020; Rahman et al., 2021). The SOTA competing methods include CLIP (Radford et al., 2021), CLIP-AC (Radford et al., 2021), WinCLIP (Jeong et al., 2023), VAND (Chen et al., 2023), and CoOp (Zhou et al., 2022b). We provide more details about the methods and data pre-processing in Appendix A. The anomaly detection performance is evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC). Additionally, average precision (AP) for anomaly detection and AUPRO (Bergmann et al., 2020) for anomaly segmentation are also used to provide more in-depth analysis of the performance.
我们在 17 个公开可用的数据集上进行了广泛的实验,涵盖了各种工业检测场景和医学影像领域(包括摄影、内窥镜和放射学),以评估 AnomalyCLIP 的性能。在工业检测中,我们考虑了 MVTec AD(Bergmann 等,2019)、VisA(Zou 等,2022)、MPDD(Jezek 等,2021)、BTAD(Mishra 等,2021)、SDD(Tabernik 等,2020)、DAGM(Wieler & Hahn,2007)和 DTD-Synthetic(Aota 等,2023)。在医学影像中,我们考虑了皮肤癌检测数据集 ISIC(Gutman 等,2016)、结肠息肉检测数据集 CVC-ClinicDB(Bernal 等,2015)、CVC-ColonDB(Tajbakhsh 等,2015)、Kvasir(Jha 等,2020)和 Endo(Hicks 等,2021)、甲状腺结节检测数据集 TN3k(Gong 等,2021)、脑肿瘤检测数据集 HeadCT(Salehi 等,2021)、BrainMRI(Salehi 等,2021)、Br35H(Hamada,2020)以及 COVID-19 检测数据集 COVID-19(Chowdhury 等,2020;Rahman 等,2021)。SOTA 竞争方法包括 CLIP(Radford 等,2021)、CLIP-AC(Radford 等,2021)、WinCLIP(Jeong 等,2023)、VAND(Chen 等,2023)和 CoOp(Zhou 等,2022b)。 我们在附录 A 中提供了关于方法和数据预处理的更多详细信息。异常检测性能通过接收者操作特征曲线下面积(AUROC)进行评估。此外,还使用了异常检测的平均精度(AP)和异常分割的 AUPRO(Bergmann 等,2020)来提供更深入的性能分析。

Implementation details  实施细节

We use the publicly available CLIP model111https://github.com/mlfoundations/open_clip
我们使用公开可用的 CLIP 模型 1
(VIT-L/14@336px) as our backbone. Model parameters of CLIP are all frozen. The length of learnable word embeddings EEitalic_E is set to 12. The learnable token embeddings are attached to the first 9 layers of the text encoder for refining the textual space, and their length in each layer is set to 4. We fine-tune AnomalyCLIP using the test data on MVTec AD and evaluate the ZSAD performance on other datasets. As for MVTec AD, we fine-tune AomalyCLIP on the test data of VisA. We report dataset-level results, which are averaged across their respective sub-datasets. All experiments are conducted in PyTorch-2.0.0 with a single NVIDIA RTX 3090 24GB GPU. More details can be found in Appendix A.
( VIT-L/14@336px ) 作为我们的骨干网络。CLIP 的模型参数全部冻结。可学习词嵌入 EEitalic_E 的长度设置为 12。可学习的 token 嵌入附加到文本编码器的前 9 层,用于细化文本空间,每层的长度设置为 4。我们使用 MVTec AD 的测试数据对 AnomalyCLIP 进行微调,并在其他数据集上评估 ZSAD 性能。对于 MVTec AD,我们在 VisA 的测试数据上对 AomalyCLIP 进行微调。我们报告数据集级别的结果,这些结果是其各自子数据集的平均值。所有实验均在 PyTorch-2.0.0 中使用单个 NVIDIA RTX 3090 24GB GPU 进行。更多细节可在附录 A 中找到。

Table 1: ZSAD performance comparison on industrial domain. The best performance is highlighted in red, and the second-best is highlighted in blue. {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT denotes results taken from original papers.
表 1:工业领域 ZSAD 性能对比。最佳性能以红色突出显示,次佳性能以蓝色突出显示。 {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 表示取自原始论文的结果。
Task  任务 Category  分类 Datasets  数据集 |𝒞||\mathcal{C}|| caligraphic_C | CLIP CLIP-AC WinCLIP VAND CoOp AnomalyCLIP
Image-level  图像级别 (AUROC, AP) Obj &texture MVTec AD 15 (74.1, 87.6) (71.5, 86.4) (91.8, 96.5){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (86.1, 93.5){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (88.8, 94.8) (91.5, 96.2)
Obj VisA 12 (66.4, 71.5) (65.0, 70.1) (78.1, 81.2){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (78.0, 81.4){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (62.8, 68.1) (82.1, 85.4)
MPDD 6 (54.3, 65.4) (56.2, 66.0) (63.6, 69.9) (73.0, 80.2) (55.1, 64.2) (77.0, 82.0)
BTAD 3 (34.5, 52.5) (51.0, 62.1) (68.2, 70.9) (73.6, 68.6) (66.8, 77.4) (88.3, 87.3)
SDD 1 (65.7, 45.2) (65.2, 45.7) (84.3, 77.4) (79.8, 71.4) (74.9, 65.1) (84.7, 80.0)
Texture  纹理 DAGM 10 (79.6, 59.0) (82.5, 63.7) (91.8, 79.5) (94.4, 83.8) (87.5, 74.6) (97.5, 92.3)
DTD-Synthetic 12 (71.6, 85.7) (66.8, 83.2) (93.2, 92.6) (86.4, 95.0) (-, -) (93.5, 97.0)
Pixel-level  像素级 (AUROC, PRO) Obj &texture MVTec AD 15 (38.4, 11.3) (38.2, 11.6) (85.1, 64.6){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (87.6, 44.0){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (33.3, 6.7) (91.1, 81.4)
Obj VisA 12 (46.6, 14.8) (47.8, 17.3) (79.6, 56.8){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (94.2, 86.8){}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT (24.2, 3.8) (95.5, 87.0)
MPDD 6 (62.1, 33.0) (58.7, 29.1) (76.4, 48.9) (94.1, 83.2) (15.4, 2.3) (96.5, 88.7)
BTAD 3 (30.6, 4.4) (32.8, 8.3) (72.7, 27.3) (60.8, 25.0) (28.6, 3.8) (94.2, 74.8)
SDD 1 (39.0, 8.9) (32.5, 5.8) (68.8, 24.2) (79.8, 65.1) (28.9, 7.1) (90.6, 67.8)
Texture  纹理 DAGM 10 (28.2, 2.9) (32.7, 4.8) (87.6, 65.7) (82.4, 66.2) (17.5, 2.1) (95.6, 91.0)
DTD-Synthetic 12 (33.9, 12.5) (23.7, 5.5) (83.9, 57.8) (95.3, 86.9) (-, -) (97.9, 92.3)
Table 2: ZSAD performance comparison on medical domain. The best performance is highlighted in red, and the second-best is highlighted in blue. Note that the image-level medical AD datasets do not contain segmentation ground truth, so the pixel-level medical AD datasets are different from the image-level datasets.
表 2:ZSAD 在医疗领域的性能比较。最佳性能以红色突出显示,次佳性能以蓝色突出显示。请注意,图像级医疗 AD 数据集不包含分割的真实标签,因此像素级医疗 AD 数据集与图像级数据集不同。
Task  任务 Category  分类 Datasets  数据集 |𝒞||\mathcal{C}|| caligraphic_C | CLIP CLIP-AC WinCLIP VAND CoOp AnomalyCLIP
Image-level  图像级别 (AUROC, AP) Brain  大脑 HeadCT  头部 CT 1 (56.5, 58.4) (60.0, 60.7) (81.8, 80.2) (89.1, 89.4) (78.4, 78.8) (93.4, 91.6)
BrainMRI 1 (73.9, 81.7) (80.6, 86.4) (86.6, 91.5) (89.3, 90.9) (61.3, 44.9) (90.3, 92.2)
Br35H 1 (78.4, 78.8) (82.7, 81.3) (80.5, 82.2) (93.1, 92.9) (86.0, 87.5) (94.6, 94.7)
Chest  胸部 COVID-19 1 (73.7, 42.4) (75.0, 45.9) (66.4, 42.9) (15.5, 8.5) (25.3, 9.2) (80.1, 58.7)
Pixel-level  像素级 (AUROC, PRO) Skin  皮肤 ISIC 1 (33.1, 5.8) (36.0, 7.7) (83.3, 55.1) (89.4, 77.2) (51.7, 15.9) (89.7, 78.4)
Colon  冒号 CVC-ColonDB 1 (49.5, 15.8) (49.5, 11.5) (70.3,32.5) (78.4, 64.6) (40.5, 2.6) (81.9, 71.3)
CVC-ClinicDB 1 (47.5, 18.9) (48.5, 12.6) (51.2,13.8) (80.5, 60.7) (34.8, 2.4) (82.9, 67.8)
Kvasir 1 (44.6, 17.7) (45.0, 16.8) (69.7, 24.5) (75.0, 36.2) (44.1, 3.5) (78.9, 45.6)
Endo 1 (45.2, 15.9) (46.6, 12.6) (68.2, 28.3) (81.9, 54.9) (40.6, 3.9) (84.1, 63.6)
Thyroid  甲状腺 TN3K 1 (42.3, 7.3) (35.6, 5.2) (70.7, 39.8) (73.6, 37.8) (34.0, 9.5) (81.5, 50.4)

4.2 Main results  4.2 主要结果

ZSAD performance on diverse industrial inspection domains
ZSAD 在多种工业检测领域的表现

Table 1 shows the ZSAD results of AnomalyCLIP with five competing methods over seven industrial defect datasets of very different foreground objects, background, and/or anomaly types. AnomalyCLIP achieves superior ZSAD performance across the datasets, substantially outperforming the other five methods in most datasets. The weak performance of CLIP and CLIP-AC can be attributed to CLIP’s original pre-training, which focuses on aligning object semantics rather than anomaly semantics. By using manually defined text prompts, WinCLIP and VAND achieve better results. Alternatively, CoOp adopts learnable prompts to learn the global anomaly semantics. However, those prompts focus on the global feature and ignore the fine-grained local anomaly semantics, leading to their poor performance on anomaly segmentation. To adapt CLIP to ZSAD, AnomalyCLIP learns object-agnostic text prompts to focus on learning the generic abnormality/normality using global and local context optimization, enabling the modeling of both global and local abnormality/normality. Our resulting prompts can also generalize to different datasets from various domains. To provide more intuitive results, we visualize the anomaly segmentation results of AnomalyCLIP, VAND, and WinCLIP across different datasets in Fig. 4. Compared to VAND and WinCLIP, AnomalyCLIP can perform much more accurate segmentation for the defects from different industrial inspection domains.
表 1 展示了 AnomalyCLIP 与五种竞争方法在七个工业缺陷数据集上的 ZSAD 结果,这些数据集的前景对象、背景和/或异常类型差异很大。AnomalyCLIP 在这些数据集上实现了卓越的 ZSAD 性能,在大多数数据集中显著优于其他五种方法。CLIP 和 CLIP-AC 的弱性能可归因于 CLIP 的原始预训练,其侧重于对齐对象语义而非异常语义。通过使用手动定义的文本提示,WinCLIP 和 VAND 取得了更好的结果。而 CoOp 则采用可学习的提示来学习全局异常语义。然而,这些提示侧重于全局特征,忽略了细粒度的局部异常语义,导致其在异常分割上的表现不佳。为了使 CLIP 适应 ZSAD,AnomalyCLIP 学习与对象无关的文本提示,专注于通过全局和局部上下文优化学习通用的异常/正常性,从而能够建模全局和局部的异常/正常性。我们生成的提示还能够泛化到来自不同领域的不同数据集。 为了提供更直观的结果,我们在图 4 中可视化了 AnomalyCLIP、VAND 和 WinCLIP 在不同数据集上的异常分割结果。与 VAND 和 WinCLIP 相比,AnomalyCLIP 能够对不同工业检测领域的缺陷进行更精确的分割。

Generalization from defect datasets to diverse medical domain datasets
从缺陷数据集到多样化医疗领域数据集的泛化

To evaluate the generalization ability of our model, we further examine the ZSAD performance of AnomalyCLIP on 10 medical image datasets of different organs across different imaging devices. Table 2 shows the results, where learning-based methods, including AnomalyCLIP, VAND and CoOp, are all tuned using MVTec AD data. It is remarkable that methods like AnomalyCLIP and VAND obtain promising ZSAD performance on various medical image datasets, even though they are tuned using a defect detection dataset. Among all these methods, AnomalyCLIP is the best performer due to its strong generalization brought by object-agnostic prompt learning. As illustrated in Fig. 4, AnomalyCLIP can accurately detect various types of anomalies in diverse medical images, such as skin cancer regions in photography images, colon polyps in endoscopy images, thyroid nodules in ultrasound images, and brain tumors in MRI images, having substantially better performance in locating the abnormal lesion/tumor regions than the other two methods WinCLIP and VAND. This again demonstrates the superior ZSAD performance of AnomalyCLIP in datasets of highly diverse object semantics from medical imaging domains.
为了评估我们模型的泛化能力,我们进一步检查了 AnomalyCLIP 在 10 个不同器官、不同成像设备的医学图像数据集上的 ZSAD 性能。表 2 展示了结果,其中基于学习的方法,包括 AnomalyCLIP、VAND 和 CoOp,均使用 MVTec AD 数据进行调优。值得注意的是,尽管 AnomalyCLIP 和 VAND 等方法使用的是缺陷检测数据集进行调优,但在各种医学图像数据集上仍获得了令人瞩目的 ZSAD 性能。在这些方法中,AnomalyCLIP 表现最佳,这得益于其通过对象无关提示学习带来的强大泛化能力。如图 4 所示,AnomalyCLIP 能够准确检测多种医学图像中的各类异常,如摄影图像中的皮肤癌区域、内窥镜图像中的结肠息肉、超声图像中的甲状腺结节以及 MRI 图像中的脑肿瘤,在定位异常病变/肿瘤区域方面显著优于 WinCLIP 和 VAND 这两种方法。这再次证明了 AnomalyCLIP 在医学成像领域高度多样化的对象语义数据集上卓越的 ZSAD 性能。

Can we obtain better ZSAD performance if fine-tuned using medical image data?
如果我们使用医学图像数据进行微调,能否获得更好的 ZSAD 性能?

Comparing the promising performance in industrial datasets, AnomalyCLIP presents a relatively low performance in medical datasets. This is partly due to the impact of auxiliary data used in our prompt learning. So, then we examine whether the ZSAD performance on medical images can be improved if the prompt learning is trained on an auxiliary medical dataset. One challenge is that there are no available large 2D medical datasets that include both image-level and pixel-level annotations for our training. To address this issue, we create such a dataset based on ColonDB (More details see Appendix A), and then optimize the prompts in AnomalyCLIP and VAND using this dataset and evaluate their performance on the medical image datasets. The results are presented in Table 4. AnomalyCLIP and VAND largely improve their detection and segmentation performance compared to that fine-tuned on MVTec AD, especially for the colon polyp-related datasets such as CVC-ClincDB, Kvasir, and Endo (note that these datasets are all from different domains compared to the fine-tuning ColonDB dataset). AnomalyCLIP also exhibits performance improvement in detecting brain tumors in datasets such as HeadCT, BrainMRI, and Br35H. This is attributed to the visual similarities between colon polyps and brain tumors. Conversely, the symptom of the colon polyp differs significantly from that of diseased skin or chest, leading to performance degradation in ISIC and COVID-19. Overall, compared to VAND, AnomalyCLIP performs consistently better across all datasets of anomaly detection and segmentation.
对比在工业数据集上的优异表现,AnomalyCLIP 在医疗数据集上的表现相对较低。这部分归因于我们提示学习中所用辅助数据的影响。因此,我们随后探讨了若提示学习基于辅助医疗数据集进行训练,是否能提升医疗图像上的零样本异常检测(ZSAD)性能。一个挑战在于,目前尚无包含图像级别和像素级别注释的大型二维医疗数据集可供我们训练使用。为解决这一问题,我们基于 ColonDB 创建了这样一个数据集(更多详情见附录 A),并利用此数据集优化 AnomalyCLIP 和 VAND 中的提示,随后在医疗图像数据集上评估它们的性能。结果如表 4 所示,与在 MVTec AD 上微调相比,AnomalyCLIP 和 VAND 在检测和分割性能上有了显著提升,尤其是在与结肠息肉相关的数据集上,如 CVC-ClincDB、Kvasir 和 Endo(需注意,这些数据集与微调所用的 ColonDB 数据集来自不同领域)。 AnomalyCLIP 在检测如 HeadCT、BrainMRI 和 Br35H 等数据集中的脑肿瘤时也表现出性能提升。这归因于结肠息肉与脑肿瘤之间的视觉相似性。相反,结肠息肉的症状与病变皮肤或胸部的症状差异显著,导致在 ISIC 和 COVID-19 数据集上性能下降。总体而言,与 VAND 相比,AnomalyCLIP 在所有异常检测和分割数据集上均表现出更稳定的优异性能。

Figure 4: Segmentation visualization.
图 4:分割可视化。
Refer to caption
\captionof tableZSAD performance on medical images when fine-tuned by medical image datasets.
tableZSAD 在通过医学图像数据集微调后在医学图像上的表现。
Category  分类 Datasets  数据集 VAND AnomalyCLIP Classification  分类 Brain  大脑 HeadCT  头部 CT (89.1, 89.4) (93.5, 95.1) BrainMRI (89.3, 90.9) (95.5, 97.2) Br35H (93.1, 92.9) (97.9, 98.0) Chest  胸部 COVID-19 (15.5, 8.5) (70.9, 33.7) Segmentation  分割 Skin  皮肤 ISIC (58.8, 31.2) (83.0, 63.8) Colon  冒号 CVC-ClinicDB (89.4, 82.3) (92.4, 82.9) Kvasir (87.6, 39.3) (92.5, 61.5) Endo (88.5, 81.9) (93.2, 84.8) Thyroid  甲状腺 TN3K (60.5, 16.8) (79.2, 47.0)
Figure 4: Segmentation visualization.
图 4:分割可视化。
Refer to caption
Figure 5: Performance gain of using object-agnostic prompts compared to object-aware prompts.
图 5:使用对象无关提示与对象感知提示相比的性能增益。
Object-agnostic vs. object-aware prompt learning
对象无关与对象感知的提示学习

To study the effectiveness of object-agnostic prompt learning in AnomalyCLIP, we compare AnomalyCLIP with its variant that uses an object-aware prompt template. The performance gain of AnomalyCLIP to its object-aware prompt learning variant is shown in Fig. 5, where positive values indicate our object-agnostic prompt templates are better than the object-aware one. It is clear that our object-agnostic prompt learning performs much better than, or on par with, the object-aware version in both image-level and pixel-level anomaly detection. This indicates that having object-agnostic prompts helps better learn the generic abnormality and normality in images, as the object semantics are often not helpful, or can even become noisy features, for the ZSAD task.
为了研究 AnomalyCLIP 中对象无关提示学习的有效性,我们将其与使用对象感知提示模板的变体进行了比较。图 5 展示了 AnomalyCLIP 相较于其对象感知提示学习变体的性能提升,正值表明我们的对象无关提示模板优于对象感知模板。显然,在图像级和像素级异常检测中,我们的对象无关提示学习表现远优于或至少与对象感知版本相当。这表明,拥有对象无关提示有助于更好地学习图像中的一般异常和正常情况,因为对象语义对于 ZSAD 任务往往无益,甚至可能成为噪声特征。

4.3 Ablation study  4.3 消融研究

Module ablation  模块消融

We first validate the effectiveness of different high-level modules of our AnomalyCLIP, including DPAM (T1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), object-agnostic text prompts (T2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), added learnable tokens in text encoders (T3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), and multi-layer visual encoder features (T4T_{4}italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT). As shown in Table 4.3, each module contributes to the remarkable performance of AnomalyCLIP. DPAM improves the segmentation performance by enhancing local visual semantics (T1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). Object-agnostic text prompts focus on the abnormality/normality within images instead of the object semantics, allowing AnomalyCLIP to detect anomalies in diverse unseen objects. Therefore, introducing object-agnostic text prompts (T2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) significantly improves AnomalyCLIP. Furthermore, text prompt tuning (T3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) also brings performance improvement via the refinement of original textual space. Finally, T4T_{4}italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT integrates multi-layer visual semantics to provide more visual details, which further promotes the performance of ZSAD.
我们首先验证了 AnomalyCLIP 中不同高级模块的有效性,包括 DPAM( T1subscript1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )、对象无关的文本提示( T2subscript2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )、文本编码器中添加的可学习标记( T3subscript3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )以及多层视觉编码器特征( T4subscript4T_{4}italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT )。如表 4.3 所示,每个模块都对 AnomalyCLIP 的显著性能有所贡献。DPAM 通过增强局部视觉语义( T1subscript1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )提高了分割性能。对象无关的文本提示关注图像中的异常/正常性,而非对象语义,使得 AnomalyCLIP 能够检测各种未见对象的异常。因此,引入对象无关的文本提示( T2subscript2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )显著提升了 AnomalyCLIP。此外,文本提示调优( T3subscript3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )通过优化原始文本空间也带来了性能提升。最后, T4subscript4T_{4}italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 整合了多层视觉语义以提供更多视觉细节,进一步提升了 ZSAD 的性能。

\captionof tableModule ablation. Module  模块 MVTec AD VisA Pixel-level  像素级 Image-level  图像级别 Pixel-level  像素级 Image-level  图像级别 Base  基础 (46.8, 15.4) (66.3, 83.3) (47.9, 17.1) (54.4, 61.7) +T1+T_{1}+ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (68.4, 47.4) (66.3, 83.3) (54.8, 32.7) (54.4, 61.7) +T2+T_{2}+ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (89.5, 81.2) (90.8, 96.0) (95.0, 85.3) (81.7, 85.2) +T3+T_{3}+ italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (90.0, 81.1) (91.0, 96.1) (95.2, 86.0) (81.9, 85.2) +T4+T_{4}+ italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT (91.1, 81.4) (91.5, 96.2) (95.5, 87.0) (82.1, 85.4)
\captionof tableContext optimization ablation.
tableContext 优化消融。
Local.  本地。 Global. MVTec AD VisA Pixel-level  像素级 Image-level  图像级别 Pixel-level  像素级 Image-level  图像级别 (71.7, 57.7) (68.8, 85.8) (74.7, 62.1) (61.1, 69.1) (80.3, 77.8) (89.9, 95.4) (86.6, 78.1) (82.2, 84.9) (91.0, 80.4) (89.9, 96.0) (95.2, 86.5) (79.5, 83.2) (91.1, 81.4) (91.5, 96.2) (95.5, 87.0) (82.1, 85.4)
Refer to caption
Figure 6: DPAM component ablation.
图 6:DPAM 组件消融。
Context optimization  上下文优化

Next we examine key modules in detail. The object-agnostic prompt learning is the most effective module, and it is driven by our glocal context optimization, so we consider two different optimization terms, local and global losses, in Eq. 2. The results are shown in Table 4.3. Both global and local context optimization contribute to the superiority of AnomalyCLIP. Global context optimization helps to capture global anomaly semantics, thus enabling more accurate image-level detection. Compared to global context optimization, local context optimization incorporates local anomaly semantics, which improves pixel-level performance and complements image-level performance. By synthesizing these two optimization strategies, AnomalyCLIP generally achieves better performance than using them individually.
接下来我们详细检查关键模块。与对象无关的提示学习是最有效的模块,它由我们的全局-局部上下文优化驱动,因此我们在公式 2 中考虑了两种不同的优化项:局部和全局损失。结果如表 4.3 所示。全局和局部上下文优化均对 AnomalyCLIP 的优越性有所贡献。全局上下文优化有助于捕捉全局异常语义,从而实现更准确的图像级检测。与全局上下文优化相比,局部上下文优化融入了局部异常语义,这提升了像素级性能并补充了图像级性能。通过综合这两种优化策略,AnomalyCLIP 通常比单独使用它们时表现更好。

DPAM strategy ablation  DPAM 策略消融

AnomalyCLIP uses VVitalic_V-VVitalic_V self-attention by default. Here we study the effectiveness of using two other DPAM strategies, including QQitalic_Q-QQitalic_Q and KKitalic_K-KKitalic_K self-attention, resulting in two AnomalyCLIP variants, namely AnomalyCLIPqq{}_{qq}start_FLOATSUBSCRIPT italic_q italic_q end_FLOATSUBSCRIPT and AnomalyCLIPkk{}_{kk}start_FLOATSUBSCRIPT italic_k italic_k end_FLOATSUBSCRIPT. The comparison results are presented in Fig. 6. AnomalyCLIPqq{}_{qq}start_FLOATSUBSCRIPT italic_q italic_q end_FLOATSUBSCRIPT achieves similar segmentation capabilities as AnomalyCLIP but suffers from degradation in detecting image-level anomalies. Conversely, while AnomalyCLIPkk{}_{kk}start_FLOATSUBSCRIPT italic_k italic_k end_FLOATSUBSCRIPT performs well in anomaly classification, its segmentation performance is less effective than AnomalyCLIP and AnomalyCLIPqq{}_{qq}start_FLOATSUBSCRIPT italic_q italic_q end_FLOATSUBSCRIPT. The VVitalic_V-VVitalic_V self-attention is generally recommended in AnomalyCLIP. Detailed analysis of DPAM can be seen in Appendix C.
AnomalyCLIP 默认使用 VVitalic_V - VVitalic_V 自注意力机制。在此,我们研究了使用另外两种 DPAM 策略的有效性,包括 QQitalic_Q - QQitalic_QKKitalic_K - KKitalic_K 自注意力机制,从而产生了两个 AnomalyCLIP 变体,即 AnomalyCLIP qq{}_{qq}start_FLOATSUBSCRIPT italic_q italic_q end_FLOATSUBSCRIPT 和 AnomalyCLIP kk{}_{kk}start_FLOATSUBSCRIPT italic_k italic_k end_FLOATSUBSCRIPT 。比较结果如图 6 所示。AnomalyCLIP qq{}_{qq}start_FLOATSUBSCRIPT italic_q italic_q end_FLOATSUBSCRIPT 在分割能力上与 AnomalyCLIP 相当,但在检测图像级异常时表现有所下降。相反,尽管 AnomalyCLIP kk{}_{kk}start_FLOATSUBSCRIPT italic_k italic_k end_FLOATSUBSCRIPT 在异常分类上表现良好,但其分割性能不如 AnomalyCLIP 和 AnomalyCLIP qq{}_{qq}start_FLOATSUBSCRIPT italic_q italic_q end_FLOATSUBSCRIPT 。在 AnomalyCLIP 中,通常推荐使用 VVitalic_V - VVitalic_V 自注意力机制。DPAM 的详细分析见附录 C。

5 Related Work  5 相关工作

Zero-shot anomaly detection
零样本异常检测

ZSAD relies on the model’s strong transferability to handle unseen anomalies (Aota et al., 2023). CLIP-AD (Liznerski et al., 2022) and ZOC (Esmaeilpour et al., 2022) are early studies in utilizing CLIP for ZSAD, but they mainly focus on the anomaly classification task. ACR (Li et al., 2023) requires tuning on target-domain-relevant auxiliary data for ZSAD on different target datasets, while AnomalyCLIP can be applied to different datasets after it is trained on one general dataset. A very recent approach WinCLIP (Jeong et al., 2023) presents a seminal work that leverages CLIP for zero-shot classification and segmentation. It uses a large number of hand-crafted text prompts and involves multiple forward passes of image patches for anomaly segmentation. To tackle this inefficiency, VAND (Chen et al., 2023) introduces learnable linear projection techniques to enhance the modeling of local visual semantics. However, these approaches suffer from insufficiently generalized textual prompt embeddings, which degrades their performance in identifying anomalies associated with various unseen object semantics. AnomalyCLIP utilizes only two object-agnostic learnable text prompts to optimize the generic text prompts of abnormality and normality, and it can obtain segmentation results with just a single forward pass. AnomalyGPT (Gu et al., 2023) is a concurrent work in utilizing foundation models for AD, but it is designed for unsupervised/few-shot AD with manually crafted prompts.
ZSAD 依赖于模型的强大可转移性来处理未见过的异常(Aota 等人,2023 年)。CLIP-AD(Liznerski 等人,2022 年)和 ZOC(Esmaeilpour 等人,2022 年)是利用 CLIP 进行 ZSAD 的早期研究,但它们主要关注异常分类任务。ACR(Li 等人,2023 年)需要在不同目标数据集上对目标领域相关的辅助数据进行调整以进行 ZSAD,而 AnomalyCLIP 在一个通用数据集上训练后可以应用于不同的数据集。最近的方法 WinCLIP(Jeong 等人,2023 年)提出了一项开创性工作,利用 CLIP 进行零样本分类和分割。它使用大量手工制作的文本提示,并涉及图像块的多重前向传递以进行异常分割。为了解决这种低效问题,VAND(Chen 等人,2023 年)引入了可学习的线性投影技术,以增强局部视觉语义的建模。然而,这些方法存在文本提示嵌入泛化不足的问题,这降低了它们在识别与各种未见过的对象语义相关的异常时的性能。 AnomalyCLIP 仅利用两个与对象无关的可学习文本提示来优化异常和正常性的通用文本提示,并且只需一次前向传递即可获得分割结果。AnomalyGPT(Gu 等,2023)是同时期利用基础模型进行异常检测的工作,但它专为无监督/少样本异常检测设计,采用手工制作的提示。

Prompt learning  提示学习

Rather than resorting to full network fine-tuning, prompt learning emerges as a parameter-efficient alternative to achieve satisfactory results (Sun et al., 2022; Khattak et al., 2023; Kim et al., 2023; Zhou et al., 2022a). CoOp (Zhou et al., 2022b) introduces learnable text prompts for few-shot classification. On this basis, DenseCLIP (Rao et al., 2022) extends prompt learning to dense prediction tasks with an extra image decoder. Instead, AnomalyCLIP proposes object-agnostic prompt learning for anomaly detection, blocking out the potential adverse impact of the diverse object semantics on anomaly detection. Benefiting from the glocal context optimization, AnomalyCLIP can capture local anomaly semantics such that we can simultaneously perform classification and segmentation tasks without an additional decoder network like Rao et al. (2022).
与其诉诸于全网络微调,提示学习作为一种参数高效的方法应运而生,以实现令人满意的结果(Sun 等,2022;Khattak 等,2023;Kim 等,2023;Zhou 等,2022a)。CoOp(Zhou 等,2022b)引入了可学习的文本提示用于少样本分类。在此基础上,DenseCLIP(Rao 等,2022)通过额外的图像解码器将提示学习扩展到密集预测任务。相反,AnomalyCLIP 提出了对象无关的提示学习用于异常检测,避免了多样对象语义对异常检测可能产生的负面影响。得益于全局上下文优化,AnomalyCLIP 能够捕捉局部异常语义,从而无需像 Rao 等(2022)那样使用额外的解码器网络,即可同时执行分类和分割任务。

6 Conclusion  6 结论

In this paper, we tackle a challenging yet significant area of anomaly detection, ZSAD, in which there is no available data in the target dataset for training. We propose AnomalyCLIP to improve the weak generalization performance of CLIP for ZSAD. We introduce object-agnostic prompt learning to learn generic abnormality/normality text prompts for generalized ZSAD on image datasets of diverse foreground objects. Further, to incorporate global and local anomaly semantics into AnomalyCLIP, we devise a joint global and local context optimization to optimize the object-agnostic text prompts. Extensive experimental results on 17 public datasets demonstrate that AnomalyCLIP achieves superior ZSAD performance.
在本文中,我们解决了一个具有挑战性但重要的异常检测领域——零样本异常检测(ZSAD),其中目标数据集中没有可用于训练的数据。我们提出了 AnomalyCLIP,以改善 CLIP 在 ZSAD 中的弱泛化性能。我们引入了对象无关的提示学习,以学习适用于不同前景对象图像数据集的通用异常/正常文本提示。此外,为了将全局和局部异常语义融入 AnomalyCLIP,我们设计了一种联合全局和局部上下文优化方法,以优化对象无关的文本提示。在 17 个公开数据集上的大量实验结果表明,AnomalyCLIP 实现了卓越的 ZSAD 性能。

Acknowledgments  致谢

This work was supported by NSFC U1909207, NSFC 62088101 Autonomous Intelligent Unmanned Systems, and the Singapore Ministry of Education Academic Research Fund Tier 1 grant (21SISSMU031).
本研究得到了国家自然科学基金 U1909207、国家自然科学基金 62088101 自主智能无人系统项目以及新加坡教育部学术研究基金一级资助(21SISSMU031)的支持。

Reproducibility Statement
可重复性声明

To ensure the reproducibility and completeness of this paper, we have included an Appendix consisting of five main sections. In Appendix A, we provide more implementation details of AnomalyCLIP, as well as the reproduction of other baseline methods. Appendix B provides key statistics about the datasets used in our experiments and the implementation of the auxiliary medical dataset for prompt tuning. Appendix D supplements the main paper with additional results and ablations. Further visualizations of similarity scores and maps are detailed in Appendix E. Additionally, the main paper presents only the average performance in each dataset that contains a number of data subsets, for which we present their fine-grained detection results, in Appendix F. Our code will be made publicly accessible once the paper is accepted.
为确保本文的可复现性和完整性,我们包含了一个由五个主要部分组成的附录。附录 A 中,我们提供了 AnomalyCLIP 的更多实现细节以及其他基线方法的复现。附录 B 提供了我们实验中使用的数据集的关键统计数据以及用于提示调优的辅助医疗数据集的实现。附录 D 为主论文补充了额外的结果和消融实验。相似性分数和图表的进一步可视化细节在附录 E 中详述。此外,主论文仅展示了包含多个数据子集的每个数据集的平均性能,我们在附录 F 中展示了它们的细粒度检测结果。一旦论文被接受,我们的代码将公开提供。

References  参考文献

  • Aota et al. (2023) Toshimichi Aota, Lloyd Teh Tzer Tong, and Takayuki Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  5564–5572, 2023.
    Toshimichi Aota, Lloyd Teh Tzer Tong, 和 Takayuki Okatani。零样本与多样本:无监督纹理异常检测。在《IEEE/CVF 冬季计算机视觉应用会议论文集》中,第 5564-5572 页,2023 年。
  • Bergmann et al. (2019)  伯格曼等人(2019) Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9592–9600, 2019.
    Paul Bergmann, Michael Fauser, David Sattlegger, 和 Carsten Steger. Mvtec ad–一个全面的真实世界无监督异常检测数据集. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 9592–9600 页, 2019.
  • Bergmann et al. (2020)  Bergmann 等人(2020) Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4183–4192, 2020.
    保罗·伯格曼、迈克尔·福瑟、大卫·萨特勒格和卡斯滕·斯特格。《无信息学生:利用判别性潜在嵌入进行师生异常检测》。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,第 4183-4192 页,2020 年。
  • Bernal et al. (2015)  Bernal 等(2015) Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43:99–111, 2015.
    Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, 和 Fernando Vilariño。用于结肠镜检查中准确突出息肉的 Wm-dova 地图:与医生显著性地图的验证。计算机医学成像与图形学,43:99–111, 2015。
  • Cao et al. (2023)  曹等人 (2023) Tri Cao, Jiawen Zhu, and Guansong Pang. Anomaly detection under distribution shift. arXiv preprint arXiv:2303.13845, 2023.
    Tri Cao, Jiawen Zhu, 和 Guansong Pang. 分布偏移下的异常检测. arXiv 预印本 arXiv:2303.13845, 2023.
  • Chen et al. (2023)  陈等人 (2023) Xuhai Chen, Yue Han, and Jiangning Zhang. A zero-/few-shot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad. arXiv preprint arXiv:2305.17382, 2023.
    徐海陈、韩越和张江宁。一种用于 CVPR 2023 VAND 研讨会挑战赛道 1&2 的零样本/少样本异常分类与分割方法:零样本 AD 第一名,少样本 AD 第四名。arXiv 预印本 arXiv:2305.17382,2023 年。
  • Chen et al. (2022) Yuanhong Chen, Yu Tian, Guansong Pang, and Gustavo Carneiro. Deep one-class classification via interpolated gaussian descriptor. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  383–392, 2022.
    陈远鸿, 田宇, 潘冠松, 和 Gustavo Carneiro. 通过插值高斯描述符的深度单类分类. 在《AAAI 人工智能会议论文集》中, 第 36 卷, 第 383-392 页, 2022 年.
  • Chowdhury et al. (2020)  Chowdhury 等人(2020) Muhammad E. H. Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar Reajul Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al Emadi, Mamun Bin Ibne Reaz, and Mohammad Tariqul Islam. Can ai help in screening viral and covid-19 pneumonia? IEEE Access, 8:132665–132676, 2020. doi: 10.1109/ACCESS.2020.3010287.
    Muhammad E. H. Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar Reajul Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al Emadi, Mamun Bin Ibne Reaz, 和 Mohammad Tariqul Islam. 人工智能能否帮助筛查病毒性和新冠肺炎肺炎?IEEE Access, 8:132665–132676, 2020. doi: 10.1109/ACCESS.2020.3010287.
  • Deng & Li (2022)  邓 & 李 (2022) Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9737–9746, 2022.
    Hanqiu Deng 和 Xingyu Li。通过一类嵌入的反向蒸馏进行异常检测。在 IEEE/CVF 计算机视觉与模式识别会议论文集,第 9737–9746 页,2022 年。
  • Ding et al. (2022)  丁等人 (2022) Choubo Ding, Guansong Pang, and Chunhua Shen. Catching both gray and black swans: Open-set supervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7388–7398, 2022.
    丁丑波、庞冠松和沈春华。捕捉灰天鹅与黑天鹅:开放集监督异常检测。发表于 IEEE/CVF 计算机视觉与模式识别会议论文集,第 7388-7398 页,2022 年。
  • Dosovitskiy et al. (2020)
    Dosovitskiy 等人(2020)
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 等。一张图片相当于 16x16 个词:大规模图像识别中的 Transformer。arXiv 预印本 arXiv:2010.11929, 2020。
  • Esmaeilpour et al. (2022)
    Esmaeilpour 等人(2022)
    Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei Shu. Zero-shot out-of-distribution detection based on the pre-trained model clip. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  6568–6576, 2022.
    Sepideh Esmaeilpour, Bing Liu, Eric Robertson, 和 Lei Shu。基于预训练模型 CLIP 的零样本分布外检测。在 AAAI 人工智能会议论文集,第 36 卷,第 6568-6576 页,2022 年。
  • Fernando et al. (2021)  Fernando 等人(2021) Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. Deep learning for medical anomaly detection–a survey. ACM Computing Surveys (CSUR), 54(7):1–37, 2021.
    Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, 和 Clinton Fookes。深度学习在医学异常检测中的应用——一项调查。ACM Computing Surveys (CSUR), 54(7):1–37, 2021。
  • Gong et al. (2021)  龚等人 (2021) Haifan Gong, Guanqi Chen, Ranran Wang, Xiang Xie, Mingzhi Mao, Yizhou Yu, Fei Chen, and Guanbin Li. Multi-task learning for thyroid nodule segmentation with thyroid region prior. In 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pp.  257–261. IEEE, 2021.
    Haifan Gong, Guanqi Chen, Ranran Wang, Xiang Xie, Mingzhi Mao, Yizhou Yu, Fei Chen, and Guanbin Li. 基于甲状腺区域先验的多任务学习用于甲状腺结节分割。在 2021 年 IEEE 第 18 届国际生物医学成像研讨会(ISBI)上,第 257-261 页。IEEE, 2021。
  • Gu et al. (2023)  Gu 等人(2023) Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models, 2023.
    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, 和 Jinqiao Wang. Anomalygpt: 使用大型视觉语言模型检测工业异常, 2023.
  • Gutman et al. (2016)  Gutman 等人(2016) David Gutman, Noel C. F. Codella, Emre Celebi, Brian Helba, Michael Marchetti, Nabin Mishra, and Allan Halpern. Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic), 2016.
    David Gutman, Noel C. F. Codella, Emre Celebi, Brian Helba, Michael Marchetti, Nabin Mishra, 和 Allan Halpern。皮肤病变分析以检测黑色素瘤:2016 年国际生物医学成像研讨会(ISBI)上的挑战,由国际皮肤成像协作组织(ISIC)主办,2016 年。
  • Hamada. (2020)  滨田。(2020) A. Hamada. Br35h: Brain tumor detection 2020. Online. Available: https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection, 2020.
    A. Hamada. Br35h: 脑肿瘤检测 2020. 在线. 可获取: https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection, 2020.
  • Hicks et al. (2021)  Hicks 等人(2021) Steven A Hicks, Debesh Jha, Vajira Thambawita, Pål Halvorsen, Hugo L Hammer, and Michael A Riegler. The endotect 2020 challenge: evaluation and comparison of classification, segmentation and inference time for endoscopy. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10-15, 2021, Proceedings, Part VIII, pp. 263–274. Springer, 2021.
    Steven A Hicks, Debesh Jha, Vajira Thambawita, Pål Halvorsen, Hugo L Hammer, 和 Michael A Riegler. The endotect 2020 challenge: 内窥镜分类、分割和推理时间的评估与比较. 收录于《模式识别》. ICPR 国际研讨会与挑战赛: 虚拟活动, 2021 年 1 月 10-15 日, 会议录, 第八部分, 第 263–274 页. Springer, 2021.
  • Huang et al. (2022)  黄等人 (2022) Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang, Michael Spratling, and Yan-Feng Wang. Registration based few-shot anomaly detection. In European Conference on Computer Vision, pp.  303–319. Springer, 2022.
    Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang, Michael Spratling, 和 Yan-Feng Wang。基于配准的少样本异常检测。载于欧洲计算机视觉会议,第 303-319 页。Springer, 2022 年。
  • Jeong et al. (2023)  Jeong 等人(2023) Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19606–19616, 2023.
    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, 和 Onkar Dabeer。Winclip: 零/少样本异常分类与分割。在 IEEE/CVF 计算机视觉与模式识别会议论文集,第 19606–19616 页,2023 年。
  • Jezek et al. (2021)  Jezek 等人(2021) Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions. In 2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pp.  66–71. IEEE, 2021.
    Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, 和 Milos Skotak。基于深度学习的金属零件缺陷检测:在复杂条件下评估当前方法。在 2021 年第 13 届国际超现代电信与控制系统及研讨会(ICUMT)上,第 66-71 页。IEEE,2021 年。
  • Jha et al. (2020)  Jha 等人(2020) Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D Johansen. Kvasir-seg: A segmented polyp dataset. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, pp.  451–462. Springer, 2020.
    Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, 和 Håvard D Johansen. Kvasir-seg: 一个分割的息肉数据集. 在《多媒体建模:第 26 届国际会议,MMM 2020,韩国大田,2020 年 1 月 5-8 日,会议录,第二部分 26》中,第 451-462 页. Springer, 2020.
  • Jia et al. (2022)  贾等人 (2022) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pp.  709–727. Springer, 2022.
    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, 和 Ser-Nam Lim. 视觉提示调优. 在欧洲计算机视觉会议中, 第 709–727 页. Springer, 2022.
  • Khattak et al. (2023)  Khattak 等人(2023) Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19113–19122, 2023.
    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, 和 Fahad Shahbaz Khan. Maple: 多模态提示学习. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 19113–19122 页, 2023.
  • Kim et al. (2023)  Kim 等人(2023) Kwanyoung Kim, Yujin Oh, and Jong Chul Ye. Zegot: Zero-shot segmentation through optimal transport of text prompts. arXiv preprint arXiv:2301.12171, 2023.
    Kwanyoung Kim, Yujin Oh, 和 Jong Chul Ye. Zegot: 通过文本提示的最优传输实现零样本分割. arXiv 预印本 arXiv:2301.12171, 2023.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Diederik P Kingma 和 Jimmy Ba. Adam: 一种随机优化方法. arXiv 预印本 arXiv:1412.6980, 2014.
  • Kirillov et al. (2023)  Kirillov 等人(2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
    亚历山大·基里洛夫(Alexander Kirillov)、埃里克·明顿(Eric Mintun)、尼基拉·拉维(Nikhila Ravi)、毛汉志(Hanzi Mao)、克洛伊·罗兰(Chloe Rolland)、劳拉·古斯塔夫森(Laura Gustafson)、肖特特(Tete Xiao)、斯宾塞·怀特黑德(Spencer Whitehead)、亚历山大·C·伯格(Alexander C Berg)、罗万彦(Wan-Yen Lo)等人。《任意分割》。arXiv 预印本 arXiv:2304.02643,2023 年。
  • Li et al. (2023)  Li 等人 (2023) Aodong Li, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja Rudolph, and Stephan Mandt. Zero-shot anomaly detection via batch normalization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
    李敖东, 邱晨, Marius Kloft, Padhraic Smyth, Maja Rudolph, 和 Stephan Mandt. 通过批量归一化进行零样本异常检测. 在第三十七届神经信息处理系统会议上, 2023.
  • Li et al. (2019)  Li 等人(2019) Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855, 2019.
    Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, 和 Jiwei Li. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855, 2019.
  • Lin et al. (2017)  Lin 等人(2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp.  2980–2988, 2017.
    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, 和 Piotr Dollár. 密集目标检测中的焦点损失. 在 IEEE 国际计算机视觉会议论文集, 第 2980–2988 页, 2017.
  • Liu et al. (2023)  刘等人 (2023) Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Clip-driven universal model for organ segmentation and tumor detection. arXiv preprint arXiv:2301.00785, 2023.
    Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, 和 Zongwei Zhou。Clip 驱动的通用模型用于器官分割和肿瘤检测。arXiv 预印本 arXiv:2301.00785, 2023。
  • Liznerski et al. (2020)  Liznerski 等人(2020) Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Marius Kloft, and Klaus-Robert Müller. Explainable deep one-class classification. arXiv preprint arXiv:2007.01760, 2020.
    Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Marius Kloft, 和 Klaus-Robert Müller. 可解释的深度单类分类. arXiv 预印本 arXiv:2007.01760, 2020.
  • Liznerski et al. (2022)  Liznerski 等人(2022) Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Klaus-Robert Müller, and Marius Kloft. Exposing outlier exposure: What can be learned from few, one, and zero outlier images. arXiv preprint arXiv:2205.11474, 2022.
    Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen, Billy Joe Franks, Klaus-Robert Müller, 和 Marius Kloft。揭露异常值暴露:从少量、单个和零个异常图像中能学到什么。arXiv 预印本 arXiv:2205.11474, 2022。
  • Mishra et al. (2021)  Mishra 等人(2021) Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Claudio Piciarelli, and Gian Luca Foresti. Vt-adl: A vision transformer network for image anomaly detection and localization. In 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), pp.  01–06. IEEE, 2021.
    Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Claudio Piciarelli, 和 Gian Luca Foresti. Vt-adl: 一种用于图像异常检测与定位的视觉 Transformer 网络. 收录于 2021 年 IEEE 第 30 届国际工业电子研讨会(ISIE), 页码 01–06. IEEE, 2021.
  • Mou et al. (2022)  Mou 等人(2022) Shancong Mou, Xiaoyi Gu, Meng Cao, Haoping Bai, Ping Huang, Jiulong Shan, and Jianjun Shi. Rgi: robust gan-inversion for mask-free image inpainting and unsupervised pixel-wise anomaly detection. In The Eleventh International Conference on Learning Representations, 2022.
    Shancong Mou、Xiaoyi Gu、Meng Cao、Haoping Bai、Ping Huang、Jiulong Shan 和 Jianjun Shi。Rgi:用于无掩码图像修复和无监督像素级异常检测的鲁棒 GAN 反演。在第十一届国际学习表征会议中,2022 年。
  • Pang et al. (2021a)  庞等人 (2021a) Guansong Pang, Choubo Ding, Chunhua Shen, and Anton van den Hengel. Explainable deep few-shot anomaly detection with deviation networks. arXiv preprint arXiv:2108.00462, 2021a.
    关松庞、丁丑波、沈春华和安东·范登亨格尔。基于偏差网络的可解释深度少样本异常检测。arXiv 预印本 arXiv:2108.00462,2021a。
  • Pang et al. (2021b)  Pang 等 (2021b) Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021b.
    关松庞,沈春华,曹龙冰,和安东·范登亨格尔。深度学习用于异常检测:综述。ACM 计算调查(CSUR),54(2):1–38, 2021b。
  • Qin et al. (2022)  秦等人(2022) Ziyuan Qin, Huahui Yi, Qicheng Lao, and Kang Li. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv preprint arXiv:2209.15517, 2022.
    秦子源,易华辉,老齐成,李康。基于预训练视觉语言模型的医学图像理解:一项全面研究。arXiv 预印本 arXiv:2209.15517,2022。
  • Radford et al. (2021)  Radford 等人(2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 等. 从自然语言监督中学习可迁移的视觉模型. 在国际机器学习会议上, 第 8748–8763 页. PMLR, 2021.
  • Rahman et al. (2021)  Rahman 等人(2021) Tawsifur Rahman, Amith Khandakar, Yazan Qiblawey, Anas Tahir, Serkan Kiranyaz, Saad Bin Abul Kashem, Mohammad Tariqul Islam, Somaya Al Maadeed, Susu M Zughaier, Muhammad Salman Khan, et al. Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images. Computers in biology and medicine, 132:104319, 2021.
    Tawsifur Rahman, Amith Khandakar, Yazan Qiblawey, Anas Tahir, Serkan Kiranyaz, Saad Bin Abul Kashem, Mohammad Tariqul Islam, Somaya Al Maadeed, Susu M Zughaier, Muhammad Salman Khan, 等. 探索图像增强技术对使用胸部 X 光图像检测 COVID-19 的影响. 生物学与医学中的计算机, 132:104319, 2021.
  • Rao et al. (2022)  Rao 等人(2022) Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18082–18091, 2022.
    永明饶、文亮赵、光义陈、岩松唐、郑朱、冠黄、杰周和继文卢。Denseclip:通过上下文感知提示进行语言引导的密集预测。在 IEEE/CVF 计算机视觉与模式识别会议论文集,第 18082–18091 页,2022 年。
  • Reiss & Hoshen (2023) Tal Reiss and Yedid Hoshen. Mean-shifted contrastive loss for anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  2155–2162, 2023.
    Tal Reiss 和 Yedid Hoshen。用于异常检测的均值偏移对比损失。在《AAAI 人工智能会议论文集》中,第 37 卷,第 2155–2162 页,2023 年。
  • Roth et al. (2022)  Roth 等人(2022) Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14318–14328, 2022.
    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, 和 Peter Gehler。迈向工业异常检测的完全召回。在 IEEE/CVF 计算机视觉与模式识别会议论文集,第 14318–14328 页,2022 年。
  • Ruff et al. (2021) Lukas Ruff, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, and Klaus-Robert Müller. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5):756–795, 2021.
    Lukas Ruff, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, 和 Klaus-Robert Müller. 深度与浅层异常检测的统一综述. 《IEEE 会刊》, 109(5):756–795, 2021.
  • Sain et al. (2023)  Sain 等人(2023) Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, and Yi-Zhe Song. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2765–2775, 2023.
    Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, 和 Yi-Zhe Song。Clip 适用于所有零样本基于草图的图像检索,无论是否细粒度。在 IEEE/CVF 计算机视觉与模式识别会议论文集,第 2765–2775 页,2023 年。
  • Salehi et al. (2021)  Salehi 等人(2021) Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Rabiee. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14902–14912, 2021.
    Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, 和 Hamid R Rabiee. 多分辨率知识蒸馏用于异常检测. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 14902–14912 页, 2021.
  • Sun et al. (2022)  孙等人 (2022) Ximeng Sun, Ping Hu, and Kate Saenko. Dualcoop: Fast adaptation to multi-label recognition with limited annotations. Advances in Neural Information Processing Systems, 35:30569–30582, 2022.
    Ximeng Sun, Ping Hu, 和 Kate Saenko. Dualcoop: 有限标注下的多标签识别快速适应. 神经信息处理系统进展, 35:30569–30582, 2022.
  • Tabernik et al. (2020)  Tabernik 等人(2020) Domen Tabernik, Samo Šela, Jure Skvarč, and Danijel Skočaj. Segmentation-based deep-learning approach for surface-defect detection. Journal of Intelligent Manufacturing, 31(3):759–776, 2020.
    Domen Tabernik, Samo Šela, Jure Skvarč, 和 Danijel Skočaj。基于分割的深度学习方法用于表面缺陷检测。《智能制造杂志》,31(3):759–776, 2020。
  • Tajbakhsh et al. (2015)  Tajbakhsh 等人(2015) Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming Liang. Automated polyp detection in colonoscopy videos using shape and context information. IEEE transactions on medical imaging, 35(2):630–644, 2015.
    Nima Tajbakhsh, Suryakanth R Gurudu, 和 Jianming Liang。使用形状和上下文信息在结肠镜视频中自动检测息肉。IEEE 医学影像汇刊,35(2):630–644, 2015。
  • Tian et al. (2021)  田等人(2021) Yu Tian, Guansong Pang, Fengbei Liu, Yuanhong Chen, Seon Ho Shin, Johan W Verjans, Rajvinder Singh, and Gustavo Carneiro. Constrained contrastive distribution learning for unsupervised anomaly detection and localisation in medical images. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, pp.  128–140. Springer, 2021.
    Yu Tian, Guansong Pang, Fengbei Liu, Yuanhong Chen, Seon Ho Shin, Johan W Verjans, Rajvinder Singh, 和 Gustavo Carneiro. 基于约束对比分布学习的医学图像无监督异常检测与定位. 在《医学图像计算与计算机辅助干预–MICCAI 2021: 第 24 届国际会议, 法国斯特拉斯堡, 2021 年 9 月 27 日–10 月 1 日, 会议录, 第五部分 24》, 第 128–140 页. Springer, 2021.
  • Tian et al. (2023)  Tian 等 (2023) Yu Tian, Fengbei Liu, Guansong Pang, Yuanhong Chen, Yuyuan Liu, Johan W Verjans, Rajvinder Singh, and Gustavo Carneiro. Self-supervised pseudo multi-class pre-training for unsupervised anomaly detection and segmentation in medical images. Medical Image Analysis, pp.  102930, 2023.
    Yu Tian, Fengbei Liu, Guansong Pang, Yuanhong Chen, Yuyuan Liu, Johan W Verjans, Rajvinder Singh, 和 Gustavo Carneiro。自监督伪多类预训练用于医学图像中的无监督异常检测和分割。《医学图像分析》,第 102930 页,2023 年。
  • Vaswani et al. (2017)  Vaswani 等人(2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 注意力机制就是你所需要的。神经信息处理系统进展, 30, 2017.
  • Wieler & Hahn (2007) Matthias Wieler and Tobias Hahn. Weakly supervised learning for industrial optical inspection. In DAGM symposium in, volume 6, 2007.
    Matthias Wieler 和 Tobias Hahn。工业光学检测中的弱监督学习。发表于 DAGM 研讨会,第 6 卷,2007 年。
  • Wu et al. (2023)  吴等人(2023) Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15254–15264, 2023.
    吴思泽, 张文伟, 金晟, 刘文涛, 和罗成昌. 面向开放词汇目标检测的区域包对齐. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 15254–15264 页, 2023.
  • Xie et al. (2023)  谢等人 (2023) Guoyang Xie, Jingbao Wang, Jiaqi Liu, Feng Zheng, and Yaochu Jin. Pushing the limits of fewshot anomaly detection in industry vision: Graphcore. arXiv preprint arXiv:2301.12082, 2023.
    谢国阳、王景宝、刘家齐、郑峰和金耀初。推动工业视觉中少样本异常检测的极限:Graphcore。arXiv 预印本 arXiv:2301.12082,2023。
  • You et al. (2022)  You 等人(2022) Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. A unified model for multi-class anomaly detection. Advances in Neural Information Processing Systems, 35:4571–4584, 2022.
    游志远,崔磊,沈宇俊,杨凯,卢欣,郑宇,乐心怡。一种多类异常检测的统一模型。《神经信息处理系统进展》,35 卷:4571–4584 页,2022 年。
  • Zhong et al. (2022)  Zhong 等(2022) Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16793–16803, 2022.
    义乌忠, 简伟阳, 张鹏川, 李春元, Noel Codella, Liunian Harold Li, 周洛伟, 戴西阳, 袁璐, 李寅, 等. Regionclip: 基于区域的语言-图像预训练. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 16793–16803 页, 2022.
  • Zhou et al. (2022a)  周等人 (2022a) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16816–16825, 2022a.
    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, 和 Ziwei Liu. 条件提示学习用于视觉-语言模型. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 16816–16825 页, 2022a.
  • Zhou et al. (2022b)  周等人 (2022b) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, 和 Ziwei Liu. 学习为视觉语言模型提示. 国际计算机视觉杂志, 130(9):2337–2348, 2022b.
  • Zou et al. (2022)  邹等人 (2022) Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pp.  392–408. Springer, 2022.
    杨邹、Jongheon Jeong、Latha Pemula、张东清和 Onkar Dabeer。用于异常检测和分割的“找不同”自监督预训练。载于欧洲计算机视觉会议,第 392-408 页。Springer,2022 年。

Appendix A Implementation details and baselines
附录 A 实现细节与基线

A.1 Implementation details
A.1 实施细节

In this paper, we use the publicly available CLIP model (VIT-L/14@336px) as our backbone. Model parameters of CLIP are all frozen. The length of learnable text prompts MMitalic_M is set to 12. These trainable text tokens are attached to the first 9 layers of the text encoder, and each text token has a length of 4. We fine-tune AnomalyCLIP on the test data on MVTec AD and test the performance for other datasets. As for MVTec AD, we fine-tune AomalyCLIP on test data on VisA. To provide adequate visual details, we extract local visual embeddings 𝒗mi{\bm{v}}_{m}^{i}bold_italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the 6-th, 12-th, 18-th, and 24-th layers of the visual encoder. Starting from the 6-th layer, we apply DPAM to the architecture of the visual encoder according to Sec. 3.3. Additionally, we set the balanced weight λ\lambdaitalic_λ to 1 in our loss function. The input images are resized to a size of 518 with batch size 8, and we use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 to update model parameters. During testing, we apply a Gaussian filter with σ=4\sigma=4italic_σ = 4 to smooth the anomaly score map. The epoch is 15 for all experiments, which are performed in PyTorch-2.0.0 with a single NVIDIA RTX 3090 24GB GPU.
在本文中,我们使用公开可用的 CLIP 模型( VIT-L/14@336px )作为我们的主干。CLIP 的模型参数全部冻结。可学习文本提示的长度 MMitalic_M 设置为 12。这些可训练的文本标记附加到文本编码器的前 9 层,每个文本标记的长度为 4。我们在 MVTec AD 的测试数据上微调 AnomalyCLIP,并测试其他数据集的性能。对于 MVTec AD,我们在 VisA 的测试数据上微调 AomalyCLIP。为了提供足够的视觉细节,我们从视觉编码器的第 6、12、18 和 24 层提取局部视觉嵌入 𝒗misuperscriptsubscript{\bm{v}}_{m}^{i}bold_italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT 。从第 6 层开始,我们根据第 3.3 节将 DPAM 应用于视觉编码器的架构。此外,我们在损失函数中将平衡权重 λ\lambdaitalic_λ 设置为 1。输入图像调整为 518 大小,批量大小为 8,我们使用 Adam 优化器(Kingma & Ba, 2014)以 0.001 的学习率更新模型参数。在测试期间,我们应用高斯滤波器 σ=44\sigma=4italic_σ = 4 来平滑异常分数图。所有实验的 epoch 为 15,实验在 PyTorch-2.0.0 中使用单个 NVIDIA RTX 3090 24GB GPU 进行。

A.2 Baselines  A.2 基线

To demonstrate the superiority of Anomlay-CLIP, we compare AnomlayCLIP with broad SOTA baselines. Implementation and reproduction details are given as follows:
为了展示 Anomlay-CLIP 的优越性,我们将 Anomlay-CLIP 与广泛的 SOTA 基线进行了比较。实现和复现细节如下:

  • CLIP (Radford et al., 2021). CLIP is a powerful zero-shot classification method. To perform the anomaly detection task, we use two classes of text prompt templates A photo of a normal [cls] and A photo of an anomalous [cls], where cls denotes the target class name. The anomaly score is computed according to Eq. 1. As for anomaly segmentation, we extend the above computation to local visual embedding to derive the segmentation.


    • CLIP(Radford 等,2021)。CLIP 是一种强大的零样本分类方法。为了执行异常检测任务,我们使用两类文本提示模板 A photo of a normal [cls]A photo of an anomalous [cls] ,其中 cls 表示目标类名。异常分数根据公式 1 计算。至于异常分割,我们将上述计算扩展到局部视觉嵌入以得出分割结果。
  • CLIP-AC (Radford et al., 2021). Different from CLIP, CLIP-AC employs an ensemble of text prompt templates that are recommended for ImageNet dataset (Radford et al., 2021). We average the generated textual embeddings of normal and anomaly classes respectively, and compute the probability and segmentation in the same way as CLIP.


    • CLIP-AC(Radford 等人,2021)。与 CLIP 不同,CLIP-AC 采用了为 ImageNet 数据集推荐的文本提示模板集合(Radford 等人,2021)。我们分别对正常类和异常类生成的文本嵌入进行平均,并以与 CLIP 相同的方式计算概率和分割。
  • WinCLIP (Jeong et al., 2023). WinCLIP is a SOTA ZSAD method. They design a large set of hand-crafted text prompt templates specific to anomaly detection and use a window scaling strategy to obtain anomaly segmentation. All parameters are kept the same as in their paper.


    • WinCLIP (Jeong et al., 2023)。WinCLIP 是一种 SOTA ZSAD 方法。他们设计了一组专门用于异常检测的手工文本提示模板,并使用窗口缩放策略来获取异常分割。所有参数均与其论文中保持一致。
  • VAND (Chen et al., 2023). VAND is an improved version of WinCLIP. They first adjust the text prompt templates and then introduce learnable linear projections to improve local visual semantics to derive more accurate segmentation. All parameters are kept the same as in their paper.


    • VAND (Chen et al., 2023)。VAND 是 WinCLIP 的改进版本。他们首先调整了文本提示模板,然后引入了可学习的线性投影,以改进局部视觉语义,从而获得更准确的分割。所有参数均与其论文中保持一致。
  • CoOp (Zhou et al., 2022b). CoOp is a representative method for prompt learning. To adapt CoOp to ZSAD, we replace its learnable text prompt templates [V1][V2][VN][𝚌𝚕𝚜][V_{1}][V_{2}]...[V_{N}][\verb'cls'][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] [ typewriter_cls ] with normality and abnormality text prompt templates, where ViV_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the learnable word embeddings. The normality text prompt template is defined as [V1][V2][VN][𝚗𝚘𝚛𝚖𝚊𝚕][𝚌𝚕𝚜][V_{1}][V_{2}]...[V_{N}][\verb'normal'][\verb'cls'][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] [ typewriter_normal ] [ typewriter_cls ], and the abnormality one is defined as [V1][V2][VN][𝚊𝚗𝚘𝚖𝚊𝚕𝚘𝚞𝚜][𝚌𝚕𝚜][V_{1}][V_{2}]...[V_{N}][\verb'anomalous'][\verb'cls'][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] [ typewriter_anomalous ] [ typewriter_cls ]. Anomaly probabilities and segmentation are obtained in the same way as for AnomalyCLIP. All parameters are kept the same as in their paper.


    • CoOp(Zhou 等,2022b)。CoOp 是提示学习的代表性方法。为了使 CoOp 适应 ZSAD,我们将其可学习的文本提示模板 [V1][V2][VN][𝚌𝚕𝚜]delimited-[]subscript1delimited-[]subscript2delimited-[]subscriptdelimited-[][V_{1}][V_{2}]...[V_{N}][\verb'cls'][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] [ typewriter_cls ] 替换为正常和异常文本提示模板,其中 VisubscriptV_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 是可学习的词嵌入。正常文本提示模板定义为 [V1][V2][VN][𝚗𝚘𝚛𝚖𝚊𝚕][𝚌𝚕𝚜]delimited-[]subscript1delimited-[]subscript2delimited-[]subscriptdelimited-[]delimited-[][V_{1}][V_{2}]...[V_{N}][\verb'normal'][\verb'cls'][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] [ typewriter_normal ] [ typewriter_cls ] ,异常文本提示模板定义为 [V1][V2][VN][𝚊𝚗𝚘𝚖𝚊𝚕𝚘𝚞𝚜][𝚌𝚕𝚜]delimited-[]subscript1delimited-[]subscript2delimited-[]subscriptdelimited-[]delimited-[][V_{1}][V_{2}]...[V_{N}][\verb'anomalous'][\verb'cls'][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] … [ italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] [ typewriter_anomalous ] [ typewriter_cls ] 。异常概率和分割的获取方式与 AnomalyCLIP 相同。所有参数均保持与原文一致。

Appendix B Dataset  附录 B 数据集

Table 3: Key statistics on the datasets used.
表 3:所用数据集的关键统计信息。
Dataset Category  分类 Modalities  模态 |𝒞||\mathcal{C}|| caligraphic_C | Normal and  正常和 anomalous samples  异常样本 Usage  用法
MVTec AD Obj &texture Photography  摄影 15 (467, 1258) Industrial defect detection
工业缺陷检测
VisA Obj Photography  摄影 12 (962, 1200) Industrial defect detection
工业缺陷检测
MPDD Photography  摄影 6 (176, 282) Industrial defect detection
工业缺陷检测
BTAD Photography  摄影 3 (451, 290) Industrial defect detection
工业缺陷检测
SDD Photography  摄影 1 (181, 74) Industrial defect detection
工业缺陷检测
DAGM Texture  纹理 Photography  摄影 10 (6996, 1054) Industrial defect detection
工业缺陷检测
DTD-Synthetic  DTD-合成 Photography  摄影 12 (357, 947) Industrial defect detection
工业缺陷检测
ISIC Skin  皮肤 Photography  摄影 1 (0, 379) Skin cancer detection  皮肤癌检测
CVC-ClinicDB Endoscopy  内窥镜检查 1 (0, 612) Colon polyp detection  结肠息肉检测
CVC-ColonDB Endoscopy  内窥镜检查 1 (0, 380) Colon polyp detection  结肠息肉检测
Kvasir Endoscopy  内窥镜检查 1 (0, 1000) Colon polyp detection  结肠息肉检测
Endo Endoscopy  内窥镜检查 1 (0, 200) Colon polyp detection  结肠息肉检测
TN3K Thyroid  甲状腺 Radiology  放射学 (Utralsound) 1 (0, 614) Thyroid nodule detection
甲状腺结节检测
HeadCT  头部 CT Brain  大脑 Radiology  放射学 (CT) 1 (100, 100) Brain tumor detection  脑肿瘤检测
BrainMRI Radiology  放射学 (MRI) 1 (98, 155) Brain tumor detection  脑肿瘤检测
Br35H Radiology  放射学 (MRI) 1 (1500, 1500) Brain tumor detection  脑肿瘤检测
COVID-19 Chest  胸部 Radiology  放射学 (X-ray)  (X 射线) 1 (1341, 219) COVID-19 detection  COVID-19 检测
More dataset details  更多数据集详情

In this paper, we conduct extensive experiments on 17 public datasets spanning two domains and three modalities to validate the effectiveness of our methods. Since we just use the test data of Datasets, we present the relevant information of their test sets in Table 3. We apply the default normalization of OpenCLIP to all datasets. After normalization, we resize the images to a resolution of (518, 518) to obtain an appropriate visual feature map resolution. It should be noted that the original image size of SDD has a width of 500 and a height ranging from 1,240 to 1,270. Before processing, we vertically divide the original 500 × 1,250 image into two images and assign pixel-wise annotations to each image.
在本文中,我们在跨越两个领域和三种模态的 17 个公共数据集上进行了广泛的实验,以验证我们方法的有效性。由于我们仅使用了数据集的测试数据,我们在表 3 中展示了它们测试集的相关信息。我们对所有数据集应用了 OpenCLIP 的默认归一化处理。归一化后,我们将图像调整为(518, 518)的分辨率,以获得合适的视觉特征图分辨率。需要注意的是,SDD 的原始图像尺寸宽度为 500,高度在 1,240 至 1,270 之间。在处理之前,我们将原始的 500 × 1,250 图像垂直分割为两张图像,并为每张图像分配像素级注释。

Fine-tuning medical dataset
微调医疗数据集

We cannot find publicly available 2D medical AD datasets that include both category labels and segmentation ground truths simultaneously. To fill the blank, in this paper, we create such a medical dataset by combining two existing 2D medical datasets. Particularly, we use the colon polyp detection dataset ColonDB (Tajbakhsh et al., 2015) to provide pixel-level annotations. Meanwhile, considering the normal samples in the same domain, we choose the test split of Endo classification dataset (Hicks et al., 2021) to combine with ColonDB. As a result, the new medical dataset contains 163 normal samples and 380 anomaly samples, supporting both anomaly classification and segmentation tasks.
我们无法找到同时包含类别标签和分割真实值的公开二维医学异常检测数据集。为了填补这一空白,本文通过结合两个现有的二维医学数据集创建了一个这样的医学数据集。具体而言,我们使用结肠息肉检测数据集 ColonDB(Tajbakhsh 等人,2015 年)提供像素级注释。同时,考虑到同一领域的正常样本,我们选择了 Endo 分类数据集(Hicks 等人,2021 年)的测试集与 ColonDB 结合。最终,新的医学数据集包含 163 个正常样本和 380 个异常样本,支持异常分类和分割任务。

Appendix C Detailed Analysis of DPAM
附录 C DPAM 的详细分析

Since the visual encoder of CLIP is originally pre-trained to align global object semantics, such as cat and dog, the contrastive loss used in CLIP makes the visual encoder produce a representative global embedding for recognizing semantic classes. Through the self-attention mechanism, the attention map in the visual encoder focuses on the specific tokens highlighted within the red rectangle in Fig. 3b. Although these tokens may contribute to global object recognition, they disrupt the local visual semantics, which directly hinders the effective learning of the fine-grained abnormality in our object-agnostic text prompts. For segmentation purposes, it’s crucial for the visual feature map to emphasize the surrounding context to capture more local visual semantics.
由于 CLIP 的视觉编码器最初是预训练用于对齐全局对象语义(如猫和狗),CLIP 中使用的对比损失使得视觉编码器生成具有代表性的全局嵌入以识别语义类别。通过自注意力机制,视觉编码器中的注意力图聚焦于图 3b 中红色矩形内突出显示的特定标记。尽管这些标记可能有助于全局对象识别,但它们扰乱了局部视觉语义,这直接阻碍了我们对象无关文本提示中细粒度异常的有效学习。对于分割目的而言,视觉特征图强调周围上下文以捕捉更多局部视觉语义至关重要。

Formally, let aija_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT be an attention score in the attention score matrix, where i,j[1,h×w]i,j\in[1,h\times w]italic_i , italic_j ∈ [ 1 , italic_h × italic_w ], then the i-th output of QQitalic_Q-KKitalic_K attention can be written as:
形式上,令 aijsubscripta_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT 为注意力分数矩阵中的一个注意力分数,其中 i,j[1,h×w]1i,j\in[1,h\times w]italic_i , italic_j ∈ [ 1 , italic_h × italic_w ] ,则 QQitalic_Q - KKitalic_K 注意力的第 i 个输出可表示为:

Attention(Q,K,V)i=softmax(qiKD)V=j=1naijvjj=1naij,aij=eqikjD.Attention({Q},{K},{V})_{i}=softmax\left(\frac{q_{i}K^{\top}}{\sqrt{D}}\right)V% =\frac{\sum\limits_{j=1}^{n}a_{ij}v_{j}}{\sum\limits_{j=1}^{n}a_{ij}},\quad% \quad a_{ij}=e^{\frac{q_{i}k_{j}^{\top}}{\sqrt{D}}}.italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG ) italic_V = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG end_POSTSUPERSCRIPT .

Note that vectors (i.e., qiq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, kik_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, viv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) are represented as row vectors. Attention(Q,K,V)iAttention({Q},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be regarded as the weighted average of vjv_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using aija_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as the weight. Assuming that the original attention map focuses on the specific tokens at index mmitalic_m, it is clear that qiq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT only produces the large attention score with kmk_{m}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in all kjk_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, aima_{im}italic_a start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT is the largest score among other aija_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT so Attention(Q,K,V)iAttention({Q},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is dominated by vmv_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, which causes the local visual embedding at index iiitalic_i to be disturbed by the local visual embedding at index mmitalic_m. In Figure 3(b), the attention score map presents vertical activation and suggests that every qiq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT produces a large attention score with kmk_{m}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In such a case, several Attention(Q,K,V)iAttention({Q},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is dominated by vmv_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and results in weak anomaly segmentation in Figure 3(b) even though vmv_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT may be important for original class recognition. Some prior studies (Rao et al., 2022; Gu et al., 2023) use an additional decoder to recover the local visual semantics. In this paper, we directly use local visual embeddings for segmentation and point out that an ideal attention map for local visual semantics should exhibit a more pronounced diagonal pattern. For this purpose, DPAM is proposed to replace the original QQitalic_Q-KKitalic_K attention with analogous components, including QQitalic_Q-QQitalic_Q, KKitalic_K-KKitalic_K, and VVitalic_V-VVitalic_V self-attention. Therefore, aija_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is changed into:
请注意,向量(即 qisubscriptq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTkisubscriptk_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTvisubscriptv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )表示为行向量。 Attention(Q,K,V)isubscriptAttention({Q},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 可视为使用 aijsubscripta_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT 作为权重的 vjsubscriptv_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT 的加权平均值。假设原始注意力图聚焦于索引 mmitalic_m 处的特定标记,显然 qisubscriptq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 仅在所有 kjsubscriptk_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT 中与 kmsubscriptk_{m}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 产生较大的注意力分数。因此, aimsubscripta_{im}italic_a start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT 是其他 aijsubscripta_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT 中最大的分数,所以 Attention(Q,K,V)isubscriptAttention({Q},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 主要由 vmsubscriptv_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 主导,这导致索引 iiitalic_i 处的局部视觉嵌入受到索引 mmitalic_m 处的局部视觉嵌入的干扰。在图 3(b)中,注意力分数图呈现垂直激活,并表明每个 qisubscriptq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTkmsubscriptk_{m}italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 产生较大的注意力分数。在这种情况下,几个 Attention(Q,K,V)isubscriptAttention({Q},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 主要由 vmsubscriptv_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 主导,导致图 3(b)中的异常分割较弱,尽管 vmsubscriptv_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 可能对原始类别识别很重要。一些先前的研究(Rao 等,2022;Gu 等,2023)使用额外的解码器来恢复局部视觉语义。在本文中,我们直接使用局部视觉嵌入进行分割,并指出理想的局部视觉语义注意力图应表现出更明显的对角线模式。 为此,DPAM 被提议用类似组件替换原始的 QQitalic_Q - KKitalic_K 注意力,包括 QQitalic_Q - QQitalic_QKKitalic_K - KKitalic_KVVitalic_V - VVitalic_V 自注意力。因此, aijsubscripta_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT 被更改为:

aijqq=eqiqjD,aijkk=ekikjD,aijvv=evivjD.a^{qq}_{ij}=e^{\frac{q_{i}q_{j}^{\top}}{\sqrt{D}}},\quad a^{kk}_{ij}=e^{\frac{% k_{i}k_{j}^{\top}}{\sqrt{D}}},\quad a^{vv}_{ij}=e^{\frac{v_{i}v_{j}^{\top}}{% \sqrt{D}}}.italic_a start_POSTSUPERSCRIPT italic_q italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_k italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT divide start_ARG italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_v italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG end_POSTSUPERSCRIPT .

This modification ensures that qiq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, kik_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and viv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT hold significant weight in forming Attention(Q,Q,V)iAttention({Q},{Q},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_Q , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Attention(K,K,V)iAttention({K},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_K , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Attention(V,V,V)iAttention({V},{V},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_V , italic_V , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, thereby preserving local visual semantics. As a result, the produced attention maps exhibit a more diagonal prominence compared to the original Q-K attention, leading to improved performance in anomaly segmentation, as shown in Fig.3c, Fig.3d, and Fig. 3e. However, since QQitalic_Q and KKitalic_K consist of the original attention map, other important tokens at index nnitalic_n for class recognition within themselves may also produce relatively large scores (aina_{in}italic_a start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT) (e.g., qiq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has strong relevance with qnq_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT besides qiq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) to disturb Attention(Q,Q,V)iAttention({Q},{Q},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_Q , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Attention(K,K,V)iAttention({K},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_K , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Fig.3c and Fig.3d. In contrast to QQitalic_Q-QQitalic_Q and KKitalic_K-KKitalic_K, VVitalic_V-VVitalic_V does not participate in computing the original attention map, reducing the unexpected bias to different tokens in VVitalic_V for the purpose of anomaly segmentation. Therefore, viv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT does not produce a large weight (aija_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT) with vjv_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and generates a larger weight (aiia_{ii}italic_a start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT) to form Attention(V,V,V)iAttention({V},{V},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_V , italic_V , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, preserving more information of viv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and experiencing diagonally prominent attention map (minimal disturbance), as depicted in Fig. 3e. This is the reason why VVitalic_V-VVitalic_V achieves the best results.
此修改确保 qisubscriptq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTkisubscriptk_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTvisubscriptv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 在形成 Attention(Q,Q,V)isubscriptAttention({Q},{Q},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_Q , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTAttention(K,K,V)isubscriptAttention({K},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_K , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTAttention(V,V,V)isubscriptAttention({V},{V},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_V , italic_V , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 时具有显著权重,从而保留局部视觉语义。因此,生成的注意力图相较于原始的 Q-K 注意力展现出更明显的对角线优势,如图 3c、图 3d 和图 3e 所示,这提升了异常分割的性能。然而,由于 QQitalic_QKKitalic_K 由原始注意力图构成,用于类别识别的索引 nnitalic_n 处的其他重要标记也可能产生相对较大的分数( ainsubscripta_{in}italic_a start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT )(例如, qisubscriptq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 除了与 qisubscriptq_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 有强关联外,还与 qnsubscriptq_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT 密切相关),这可能会干扰 Attention(Q,Q,V)isubscriptAttention({Q},{Q},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_Q , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTAttention(K,K,V)isubscriptAttention({K},{K},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_K , italic_K , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,如图 3c 和图 3d 所示。与 QQitalic_Q - QQitalic_QKKitalic_K - KKitalic_K 不同, VVitalic_V - VVitalic_V 不参与计算原始注意力图,减少了异常分割目的下 VVitalic_V 中对不同标记的意外偏差。因此, visubscriptv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 不会与 vjsubscriptv_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT 产生大权重( aijsubscripta_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ),而是生成更大的权重( aiisubscripta_{ii}italic_a start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT )以形成 Attention(V,V,V)isubscriptAttention({V},{V},{V})_{i}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_V , italic_V , italic_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,保留更多 visubscriptv_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 的信息,并经历对角线显著的注意力图(最小干扰),如图 3e 所示。 这就是为什么 VVitalic_V - VVitalic_V 能取得最佳结果的原因。

Appendix D Additional results and ablations
附录 D 附加结果与消融实验

Table 4: Comparison of ZSAD performance between AnomalyCLIP and SOTA full-shot methods. The best performance is highlighted in red, and the second-best is highlighted in blue.
表 4:AnomalyCLIP 与 SOTA 全样本方法在 ZSAD 性能上的比较。最佳性能以红色标出,次佳性能以蓝色标出。
Task  任务 Category  分类 Datasets  数据集 |𝒞||\mathcal{C}|| caligraphic_C | AnomalyCLIP PatchCore RD4AD
Image-level  图像级别 (AUROC, AP) Obj &texture MVTec AD 15 (91.5, 96.2) (99.0, 99.7) (98.7, 99.4)
Obj VisA 12 (82.1, 85.4) (94.6, 95.9) (95.3, 95.7)
MPDD 6 (77.0, 82.0) (94.1, 96.3) (91.6, 93.8)
BTAD 3 (88.3, 87.3) (93.2, 98.6) (93.8, 96.8)
SDD 1 (84.7, 80.0) (64.9, 48.3) (86.8, 81.3)
Texture  纹理 DAGM 10 (97.5, 92.3) (92.7, 81.3) (92.9, 79.1)
Pixel-level  像素级 (AUROC, PRO) Obj &texture MVTec AD 15 (91.1, 81.4) (98.1, 92.8) (97.8, 93.6)
Obj VisA 12 (95.5, 87.0) (98.5, 92.2) (98.4, 91.2)
MPDD 6 (96.5, 88.7) (98.8, 94.9) (98.4, 95.2)
BTAD 3 (94.2, 74.8) (97.4, 74.4) (97.5, 75.1)
SDD 1 (90.6, 67.8) (87.9, 46.3) (92.2, 72.0)
Texture  纹理 DAGM 10 (95.6, 91.0) (95.9, 87.9) (96.8, 91.9)
Comparison with SOTA full-shot methods
与 SOTA 全样本方法的比较

In this section, we are interested in the performance gap between AnomalyCLIP and the recently published SOTA full-shot methods, such as PatchCore (Roth et al., 2022) and RD4AD (Deng & Li, 2022). Since some datasets do not provide normal training data, we conduct experiments on six public datasets. AnomalyCLIP achieves comparable anomaly detection and segmentation performance compared to PatchCore and RD4AD, and it even outperforms them in some datasets. This illustrates that the generic prompt embeddings empower AnomalyCLIP to effectively capture the normality and abnormality so that AnomalyCLIP can surpass the performance boundary decided by the training data.
在本节中,我们关注 AnomalyCLIP 与近期发布的 SOTA 全样本方法(如 PatchCore(Roth 等人,2022)和 RD4AD(Deng & Li,2022))之间的性能差距。由于某些数据集未提供正常训练数据,我们在六个公开数据集上进行了实验。AnomalyCLIP 在异常检测和分割性能上与 PatchCore 和 RD4AD 相当,甚至在某些数据集中表现更优。这表明通用提示嵌入使 AnomalyCLIP 能够有效捕捉正常与异常,从而超越由训练数据决定的性能边界。

Refinement of the textual space
文本空间的精炼

A representative embedding is not only decided by the well-designed text prompt, it also depends on the appropriate textual space. During fine-tuning, randomly initialized learnable token embeddings are introduced in the text encoder to refine the textual space for the adaption to AD. To control the degree of refining the textual space, we choose to insert the learnable token embeddings into the text encoder from its bottom to the top layer. In particular, the trainable and original tokens are denoted as tmt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and tmt_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, respectively, where mmitalic_m represents the layer of the text encoder. To integrate the original textual representations, for the layer mmitalic_m, we concatenate tmt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and tmt_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT along the dimension of the channel and then forward them into TmT_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to get rm+1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and tm+1t_{m+1}italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT. Due to the self-attention mechanism, the output of tm+1t_{m+1}italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT contains the information of tmt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In order to provide adequate calibration, we discard the obtained rm+1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and initialize new learnable token embeddings tm+1t^{\prime}_{m+1}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT. Through this operation, tm+1t^{\prime}_{m+1}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT further refines textual representations of the layer m+1m+1italic_m + 1. We repeat this operation until we reach the designated layer MM^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This procedure is given by:
一个代表性的嵌入不仅由精心设计的文本提示决定,还依赖于适当的文本空间。在微调过程中,文本编码器中引入了随机初始化的可学习令牌嵌入,以优化文本空间,使其适应 AD。为了控制文本空间优化的程度,我们选择将可学习令牌嵌入从文本编码器的底层逐步插入到顶层。具体而言,可训练和原始令牌分别表示为 tmsubscriptsuperscriptt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPTtmsubscriptt_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,其中 mmitalic_m 代表文本编码器的层数。为了整合原始文本表示,对于层 mmitalic_m ,我们沿通道维度连接 tmsubscriptsuperscriptt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPTtmsubscriptt_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,然后将它们前向传递到 TmsubscriptT_{m}italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 以获取 rm+1subscriptsuperscript1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPTtm+1subscript1t_{m+1}italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 。由于自注意力机制, tm+1subscript1t_{m+1}italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 的输出包含了 tmsubscriptsuperscriptt^{\prime}_{m}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 的信息。为了提供充分的校准,我们丢弃获得的 rm+1subscriptsuperscript1r^{\prime}_{m+1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 并初始化新的可学习令牌嵌入 tm+1subscriptsuperscript1t^{\prime}_{m+1}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 。通过这一操作, tm+1subscriptsuperscript1t^{\prime}_{m+1}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT 进一步优化了层 m+11m+1italic_m + 1 的文本表示。我们重复这一操作,直到达到指定层 MsuperscriptM^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 。此过程由以下步骤给出:

[rm+1,tm+1]\displaystyle[r^{\prime}_{m+1},t_{m+1}][ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ] =Tm([tm,tm])\displaystyle=T_{m}([t^{\prime}_{m},t_{m}])= italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( [ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] )
[rm+2,tm+2]\displaystyle[r^{\prime}_{m+2},t_{m+2}][ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m + 2 end_POSTSUBSCRIPT ] =Tm+1([tm+1,tm+1])\displaystyle=T_{m+1}([t^{\prime}_{m+1},t_{m+1}])= italic_T start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ( [ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ] ) (3)
\displaystyle\qquad\ldots
tM+1\displaystyle t_{M^{\prime}+1}italic_t start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT =TM(tM),\displaystyle=T_{M^{\prime}}(t_{M^{\prime}}),= italic_T start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ,

where the operator [,][\cdot,\cdot][ ⋅ , ⋅ ] represents the concatenation along the channel.
其中运算符 [,][\cdot,\cdot][ ⋅ , ⋅ ] 表示沿通道的串联。

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 7: Hyparameter analysis. (a)EEitalic_E ablation. (b) MMitalic_M ablation. (c)LLitalic_L ablation (d)NNitalic_N ablation. Pixel/image-level (AUPRO, AP) performances are shown on the left and right sides of each subplot, respectively.
图 7:超参数分析。(a) EEitalic_E 消融。(b) MMitalic_M 消融。(c) LLitalic_L 消融。(d) NNitalic_N 消融。像素/图像级别(AUPRO, AP)性能分别显示在每个子图的左右两侧。
Hyparameter analysis  超参数分析

We study the length of learnable text prompts EEitalic_E, depth of learnable token embeddings MMitalic_M, length of learnable token embeddings MMitalic_M, and number of used layers in visual encoder NNitalic_N. As shown in Fig. 7b, we observe that the detection and segmentation performance initially improves with an increase in the value of EEitalic_E. However, within the range of lengths from 12 to 16, we notice a decline in performance, which suggests that excessively long learnable text prompts could involve redundant information. Therefore, an appropriate value for EEitalic_E, such as E=12E=12italic_E = 12, is beneficial to accurate learning of object-agnostic text prompts. Besides, we also investigate the depth of the attached learnable token embeddings in Fig. 7b. The degree of refining of the initial text space becomes more pronounced as the depth increases, enabling more discriminative textual embeddings for normal and anomaly. However, the performance drops when the refinement is excessive and impairs the generalization of AnomlayCLIP, as seen in the case when MMitalic_M equals 9. After selecting the depth, we proceed to investigate the influence of the length of learnable token embeddings. As illustrated in Fig. 7c, we find that the length of token embeddings also involves a similar tradeoff between the model generalization and calibration of textual space in Fig. 7d. AnomalyCLIP achieves the overall performance gain when we provide the most local visual semantics (N=4N=4italic_N = 4).
我们研究了可学习文本提示的长度 EEitalic_E 、可学习标记嵌入的深度 MMitalic_M 、可学习标记嵌入的长度 MMitalic_M 以及视觉编码器中使用的层数 NNitalic_N 。如图 7b 所示,我们观察到检测和分割性能最初随着 EEitalic_E 值的增加而提高。然而,在 12 到 16 的长度范围内,我们注意到性能有所下降,这表明过长的可学习文本提示可能包含冗余信息。因此,选择一个合适的 EEitalic_E 值,例如 E=1212E=12italic_E = 12 ,有助于准确学习与对象无关的文本提示。此外,我们还在图 7b 中研究了附加的可学习标记嵌入的深度。随着深度的增加,初始文本空间的细化程度变得更加明显,使得正常和异常的文本嵌入更具区分性。然而,当细化过度时,性能会下降,并损害 AnomlayCLIP 的泛化能力,如 MMitalic_M 等于 9 时的情况所示。在选择深度后,我们继续研究可学习标记嵌入长度的影响。如图 7 所示... 7c 中,我们发现 token 嵌入的长度在图 7d 中也涉及模型泛化与文本空间校准之间的类似权衡。当我们提供最局部的视觉语义时( N=44N=4italic_N = 4 ),AnomalyCLIP 实现了整体性能的提升。

Prompt template ablation  提示模板消融

Here, we study the robustness of AnomalyCLIP to prior anomaly semantics in the object-agnostic text prompt template. We replace damaged in the object-agnostic text prompt with other words having similar anomaly semantics, such as anomalous, flawed, defective, blemished. The results are presented in Table 5 and Table 6. The steady results indicate that AnomalyCLIP is not sensitive to the prior anomaly semantics introduced by the object-agnostic text prompt template.
在此,我们研究了 AnomalyCLIP 对对象无关文本提示模板中先验异常语义的鲁棒性。我们将对象无关文本提示中的 damaged 替换为具有相似异常语义的其他词汇,如 anomalousflaweddefectiveblemished 。结果展示在表 5 和表 6 中。稳定的结果表明,AnomalyCLIP 对由对象无关文本提示模板引入的先验异常语义不敏感。

Table 5: Ablation on the robustness of the abnormality-related token in our prompt template on industrial defect datasets.
表 5:工业缺陷数据集上提示模板中异常相关标记鲁棒性的消融研究。
Task  任务 Category  分类 Datasets  数据集 damaged  损坏 anomalous  异常 flawed  有缺陷的 defective  有缺陷的 blemished  有瑕疵的
Image-level  图像级别 (AUROC, AP) Obj &texture MVTec AD (91.5, 96.2) (91.4, 96.2) (91.3, 96.2) (91.4, 96.2) (91.5, 96.2)
Obj VisA (82.1, 85.4) (80.7, 84.5) (80.7, 84.5) (80.9, 84.6) (80.7, 84.5)
MPDD (77.0, 82.0) (78.0, 83.9) (77.9, 83.6) (77.8, 83.5) (78.6. 84.1)
BTAD (88.3, 87.3) (84.8, 86.7) (85.2, 87.4) (84.8, 86.2) (85.9, 67.1)
SDD (84.7, 80.0) (82.3, 76.3) (82.6, 76.8) (82.8, 77.2) (82.7, 77.0)
Texture  纹理 DAGM (97.5, 92.3) (97.7, 92.6) (97.5, 92.4) (97.5, 92.3) (97.5, 92.4)
DTD-Synthetic (93.5, 97.0) (93.3, 96.9) (93.2, 96.9) (93.4, 97.0) (93.5, 97.0)
Pixel-level  像素级 (AUROC, PRO) Obj &texture MVTec AD (91.1, 81.4) (91.0, 81.4) (90.7, 81.4) (91.0, 81.7) (90.9, 81.2)
Obj VisA (95.5, 87.0) (95.5, 86.5) (95.5, 86.5) (95.5, 86.2) (95.6, 86.5)
MPDD (96.5, 88.7) (96.6, 88.7) (96.7, 89.0) (96.7, 89.2) (96.6, 88.8)
BTAD (94.2, 74.8) (94.3, 74.3) (94.4, 75.1) (94.3, 75.2) (94.3, 73.7)
SDD (90.6, 67.8) (89.6, 66.8) (89.5, 66.5) (89.5, 64.8) (89.6, 64.6)
Texture  纹理 DAGM (95.6, 91.0) (95.6, 91.2) (95.6, 91.3) (95.5, 90.9) (95.6, 90.9)
DTD-Synthetic (97.9, 92.3) (97.9, 92.3) (97.9, 92.1) (97.9, 92.5) (97.9, 92.2)
Refer to caption
Figure 8: Object ablation.
图 8:物体消融。
Object ablation  对象消融

To investigate what the object-agnostic text prompts have learned, we replace object in object-agnostic text prompts with specific target [cls], resulting in AnomalyCLIPre{}_{re}start_FLOATSUBSCRIPT italic_r italic_e end_FLOATSUBSCRIPT. In Fig. 8, AnomalyCLIPre{}_{re}start_FLOATSUBSCRIPT italic_r italic_e end_FLOATSUBSCRIPT still performs well in ZSAD, even as we block out the object semantics during fine-tuning. This suggests that the knowledge learned by object-agnostic text prompts is the underlying anomaly patterns, allowing them to provide discriminative textual embeddings even when specific object semantics are incorporated. Furthermore, compared to AnomalyCLIP, AnomalyCLIPre{}_{re}start_FLOATSUBSCRIPT italic_r italic_e end_FLOATSUBSCRIPT shows a performance decay, which can be attributed to the inclusion of redundant/noisy object semantics. These results once again demonstrate the generalization ability of object-agnostic prompt learning.
为了探究对象无关文本提示学习到了什么,我们将对象无关文本提示中的 object 替换为特定目标 [cls] ,从而得到 AnomalyCLIP re{}_{re}start_FLOATSUBSCRIPT italic_r italic_e end_FLOATSUBSCRIPT 。在图 8 中,即使我们在微调过程中屏蔽了对象语义,AnomalyCLIP re{}_{re}start_FLOATSUBSCRIPT italic_r italic_e end_FLOATSUBSCRIPT 在 ZSAD 中仍然表现良好。这表明,对象无关文本提示学习到的知识是底层异常模式,使得即使在融入特定对象语义时,它们也能提供区分性的文本嵌入。此外,与 AnomalyCLIP 相比,AnomalyCLIP re{}_{re}start_FLOATSUBSCRIPT italic_r italic_e end_FLOATSUBSCRIPT 显示出性能下降,这可以归因于冗余/噪声对象语义的引入。这些结果再次证明了对象无关提示学习的泛化能力。

Table 6: Ablation on the robustness of the abnormality-related token in our prompt template on medical image datasets.
表 6:在医学图像数据集上,关于提示模板中异常相关标记鲁棒性的消融研究。
Task  任务 Category  分类 Datasets damaged  损坏 anomalous  异常 flawed  有缺陷的 defective  有缺陷的 blemished  有瑕疵的
Image-level  图像级别 (AUROC, AP) Brain  大脑 HeadCT  头部 CT (93.4, 91.6) (93.1, 90.6) (93.3, 90.8) (93.5, 91.0) (93.8, 91.5)
BrainMRI (90.3, 92.2) (87.8, 90.4) (87.7, 90.0) (88.3, 90.5) (88.6, 90.7)
Br35H (94.6, 94.7) (93.1, 93.0) (92.9, 92.8) (93.1, 93.0) (93.2, 93.1)
Chest  胸部 COVID-19 (80.1, 58.7) (80.0, 58.5) (80.2, 58.8) (80.6, 59.0) (82.1, 61.4)
Pixel-level  像素级 (AUROC, PRO) Skin  皮肤 ISIC (89.7, 78.4) (90.1, 80.1) (90.1, 80.1) (90.4, 81.0) (90.2, 80.6)
Colon  冒号 CVC-ColonDB (81.9, 71.3) (82.2, 71.5) (82.3, 71.6) (82.1, 71.1) (82.2, 71.5)
CVC-ClinicDB (82.9, 67.8) (83.0, 68.1) (83.1, 68.4) (82.9, 67.9) (83.1, 68.2)
Kvasir (78.9, 45.6) (79.4, 45.1) (79.4, 45.2) (79.3, 44.9) (79.5, 45.8)
Endo (84.1, 63.6) (84.3, 63.5) (84.2, 63.5) (84.2, 62.9) (84.3, 63.4)
Thyroid  甲状腺 TN3K (81.5, 50.4) (81.5, 51.7) (81.3, 50.9) (81.3, 50.3) (81.6, 51.1)

Appendix E Visualization  附录 E 可视化

Similarity score between textual and visual embeddings.
文本和视觉嵌入之间的相似度得分。