Abstract 摘要
In this project, we introduce Augmented RealNet (ARNet), an enhanced version of the RealNet framework, for advanced feature reconstruction-based anomaly detection. ARNet integrates an improved training strategy and a foreground extraction module to achieve robust anomaly detection in real-world environments. By incorporating various augmentations over synthetic anomalies, ARNet simulates a broad spectrum of potential perturbations that might occur in practical scenarios. The foreground estimation module within ARNet precisely isolates objects from the background, utilizing this information to refine reconstruction residuals. This approach effectively mitigates false alarms caused by background variations and changes in object positioning within the image. Our empirical evaluations demonstrate the superior performance of ARNet across diverse categories, achieving mean image-level and pixel-level F1 scores of 0.962 and 0.672, respectively.
在本项目中,我们引入了增强版 RealNet(ARNet),作为 RealNet 框架的升级版本,用于基于高级特征重建的异常检测。ARNet 集成了改进的训练策略和前景提取模块,以实现现实环境中的鲁棒异常检测。通过对合成异常进行多种增强,ARNet 模拟了实际场景中可能出现的广泛潜在扰动。ARNet 中的前景估计模块精确地将对象从背景中分离出来,并利用这些信息来优化重建残差。这种方法有效地减少了由背景变化和图像中对象位置变化引起的误报。我们的实证评估表明,ARNet 在多个类别中均表现出色,分别达到了 0.962 和 0.672 的平均图像级和像素级 F1 分数。
Visual Anomaly Detection 视觉异常检测
Visual anomaly detection (VAD) [1] is essential in the industrial sector for identifying defects during the manufacturing process. Accurate anomaly detection not only guarantees product quality and reliability but also minimizes costs by reducing waste and recalls. Recent advancements in this field have been largely driven by machine learning, specifically through supervised [4], unsupervised [3, 6], and semi-supervised learning approaches [5]. Supervised methods, though powerful, often require extensive labeled datasets, which are impractical to obtain in many industrial situations. Therefore, unsupervised learning, particularly one-class classification models, has become popular. These models typically learn from data representing the system's normal operation, aiming to identify deviations from this norm as potential anomalies. However, they face significant challenges in terms of robustness, especially when exposed to real-world variations not present during training. Factors such as changes in camera specifications, lighting conditions, or gradual wear of mechanical components can cause shifts in data distribution, known as domain shifts. These shifts can drastically impact the performance of anomaly detection systems, leading to false positives or missed detections. Thus, the development of robust VAD systems that can adapt over time and handle these real- world variations is crucial. Enhancing anomaly detection models' adaptability to cope with domain shifts and maintain consistent performance despite external changes is a key requirement for their successful deployment in industrial settings.
视觉异常检测(VAD)[1]在工业领域中对于识别制造过程中的缺陷至关重要。准确的异常检测不仅保证了产品的质量和可靠性,还通过减少浪费和召回最小化了成本。该领域的最新进展主要由机器学习推动,特别是通过监督学习[4]、无监督学习[3, 6]和半监督学习方法[5]。尽管监督方法强大,但通常需要大量标注数据集,这在许多工业场景中是不切实际的。因此,无监督学习,尤其是一类分类模型,变得流行起来。这些模型通常从代表系统正常运行的数据中学习,旨在识别与这一常态的偏差作为潜在的异常。然而,它们在鲁棒性方面面临重大挑战,特别是在面对训练期间未出现的现实世界变化时。诸如相机规格变化、光照条件或机械部件逐渐磨损等因素可能导致数据分布的变化,即所谓的领域偏移。 这些变化可能会极大地影响异常检测系统的性能,导致误报或漏检。因此,开发能够随时间适应并处理这些现实世界变化的稳健 VAD 系统至关重要。增强异常检测模型的适应性,以应对领域变化并在外部变化的情况下保持一致的性能,是其在工业环境中成功部署的关键要求。
The Anomaly Detection Challenge
异常检测挑战
The Visual Anomaly and Novelty Detection (VAND) challenge addresses the need for anomaly detection systems that can handle real-world conditions not usually represented in training datasets. The main goal is to build models that can withstand unpredictable shifts in domains [2], reflecting real-world situations where data capture conditions may change over time.The dataset utilized for the challenge is MVTec AD [16], comprising 15 categories of diverse industrial objects and material textures. Participants are encouraged to adopt a one-class training paradigm, which involves creating models using images from normal conditions only. This approach is crucial to ensure that the models can generalize to unseen anomalies without previous exposure to specific defects or abnormalities. The test set for model robustness evaluation includes an undisclosed set of perturbations, simulating real-world changes and noises such as lighting variations, camera noise induced by equipment quality, and camera angle shifts. The challenge employs stringent evaluation criteria, based on image-level and pixel-level F1- max scores, to thoroughly assess the models' anomaly detection capabilities under various altered conditions. This ensures that the developed solutions are applicable in real-world industrial scenarios and can adapt to different environments, a critical aspect for industrial applications.
视觉异常与新颖性检测(VAND)挑战旨在解决异常检测系统应对训练数据集中通常未涵盖的现实世界条件的需求。其主要目标是构建能够承受领域不可预测变化的模型[2],反映数据采集条件可能随时间变化的现实情况。挑战中使用的数据集是 MVTec AD[16],包含 15 类不同的工业物体和材料纹理。鼓励参与者采用单类训练范式,即仅使用正常条件下的图像创建模型。这种方法对于确保模型能够泛化到未见过的异常情况至关重要,而无需事先接触特定缺陷或异常。用于模型鲁棒性评估的测试集包括一组未公开的扰动,模拟现实世界的变化和噪声,如光照变化、设备质量引起的相机噪声以及相机角度偏移。 该挑战采用严格的评估标准,基于图像级和像素级的 F1-max 分数,全面评估模型在各种变化条件下的异常检测能力。这确保了所开发的解决方案适用于现实世界的工业场景,并能适应不同的环境,这是工业应用中的一个关键方面。
Model Selection Approach 模型选择方法
To address robust anomaly detection, we start our model selection process by examining the most promising architectures. We compare their strengths and weaknesses in the context of robust anomaly detection, which leads us to select our initial architecture for model development. Recently, various strategies aiming to tackle industrial anomaly detection have emerged, with three main approaches leading the way: student-teacher methods [7], patch- matching methods [8,9], and reconstruction methods [10]. All three have shown impressive results, especially on the MVTec AD dataset [11]. For the student-teacher approach, we looked at EfficientAD [7], while PatchCore [8] represents the patch-matching method, and RealNet exemplifies the reconstruction-based method [10]. We compared their performance on an augmented image set, reflecting real- world scenarios. We focused on more challenging categories where anomaly detection models usually underperform, such as cable, screw, pill and capsule. The comparison of these three model’s performance is shown in Figure 1.
为了解决鲁棒异常检测问题,我们首先通过考察最有前景的架构来启动模型选择过程。我们在鲁棒异常检测的背景下比较了它们的优缺点,从而选择了我们用于模型开发的初始架构。最近,针对工业异常检测的各种策略不断涌现,其中三种主要方法引领潮流:师生方法[7]、补丁匹配方法[8,9]和重建方法[10]。这三种方法均展示了令人印象深刻的结果,尤其是在 MVTec AD 数据集[11]上。对于师生方法,我们考察了 EfficientAD[7],而 PatchCore[8]代表了补丁匹配方法,RealNet 则是基于重建的方法[10]。我们在一个反映现实世界场景的增强图像集上比较了它们的性能,重点关注了异常检测模型通常表现不佳的更具挑战性的类别,如电缆、螺丝、药片和胶囊。这三种模型性能的比较如图 1 所示。
Our preliminary findings show that RealNet [10], the reconstruction-based method, generally outperforms the others, except in the cable category. This isn't surprising since the supervised training in reconstruction methods helps the model identify anomalies, even with augmentations. Patch-based methods, on the other hand, require a larger memory bank to store diverse features, which can slow down training and inference. The student-teacher approach can be more complex, especially in the design of the auto-encoder to handle a variety of augmentations. Based on these findings, we've decided to use the reconstruction-based approach, using RealNet as our base model, to address this challenge. We've also added a foreground prediction block to this network and implemented an augmented training strategy. Therefore, we've named our model Augmented-RealNet (ARNet).
我们的初步研究结果表明,基于重建的方法 RealNet [10]通常优于其他方法,除了电缆类别。这并不令人惊讶,因为重建方法中的监督训练有助于模型识别异常,即使在增强的情况下。另一方面,基于补丁的方法需要更大的记忆库来存储多样化的特征,这可能会减慢训练和推理速度。学生-教师方法可能更复杂,特别是在设计自动编码器以处理各种增强时。基于这些发现,我们决定使用基于重建的方法,以 RealNet 作为基础模型,来解决这一挑战。我们还在该网络中增加了前景预测块,并实施了增强训练策略。因此,我们将我们的模型命名为增强型 RealNet(ARNet)。
Model Architecture 模型架构
The architecture of our proposed method, ARNet, is presented in Figure 2. The blue blocks depict the original RealNet blocks, while the orange ones represent our enhancements. The detection flow of the model is as follows. Pretrained features of the input image, which are generated by WideResNet50 backbone pass through the Anomaly-aware Feature Selection (AFS) block first. This block selects feature channels based on their ability to identify anomalous regions in an image. Essentially, it chooses the topK feature indexes that maximize the distance between normal and anomalous images in the anomaly regions, averaged over all training samples. This selection process reduces training and computation cost. The AFS block's output features then enter the feature reconstruction block, composed of several UNets. Each UNet corresponds to a layer of the pretrained features. The UNet's objective is to reconstruct the original features from the anomalous ones. By subtracting the features before and after reconstruction, their residuals can reveal the anomalous regions.
我们提出的方法 ARNet 的架构如图 2 所示。蓝色块表示原始的 RealNet 块,而橙色块代表我们的增强部分。模型的检测流程如下:输入图像的预训练特征由 WideResNet50 骨干网络生成,首先通过异常感知特征选择(AFS)块。该块根据特征通道识别图像中异常区域的能力来选择特征通道。本质上,它选择在异常区域中使正常图像与异常图像之间距离最大化的 topK 特征索引,该距离在所有训练样本上取平均。此选择过程降低了训练和计算成本。AFS 块的输出特征随后进入由多个 UNet 组成的特征重建块。每个 UNet 对应预训练特征的一个层级。UNet 的目标是从异常特征中重建原始特征。通过减去重建前后的特征,它们的残差可以揭示异常区域。
We also included a foreground prediction block which uses the foreground information to reweight the residuals of each layer to help the model focus on meaningful foreground regions, thereby eliminating false triggers and improving detection accuracy for object categories. This block takes the first 3 layers of the pretrained features and predicts the foreground area based on ground truth foreground maps. These ground truth maps are generated using an automated algorithm using SegmentAnythingModel (SAM) [11]. The foreground prediction block, designed like a UNet, upscales lower resolution features, concatenates them with the next higher-level layer, and then puts them through a convolutional block. This process repeats until all layers are combined and eventually interpolated to recover the full resolution foreground map. This foreground prediction is particularly useful when images could be taken from various camera angles, and the object may not always be in the image center. The final block of the model is Reconstruction Residuals Selection (RRS) module. This module upscales the lower resolution features, so all residuals have the same resolution. It then performs global average and global max pooling on the features to select TopK residuals. The selection process discards residuals with insufficient anomaly information. Finally, the selected residuals go to the discriminator, which is trained on the cross-entropy loss to predict anomalous pixels based on synthetic anomalies' ground truth. Originally, RealNet [10] has two losses: the feature reconstruction loss (𝐿 𝑟𝑒𝑐𝑜𝑛 ) and the discriminator's segmentation loss (𝐿 𝑠𝑒𝑔 ). We also added the cross-entropy loss for foreground prediction (𝐿 𝑓𝑟𝑔 ). The total loss is as shown below:
我们还引入了一个前景预测模块,该模块利用前景信息对每一层的残差进行重新加权,以帮助模型聚焦于有意义的前景区域,从而消除误触发并提高目标类别的检测准确性。该模块采用预训练特征的前三层,并基于真实前景图预测前景区域。这些真实前景图是通过使用 SegmentAnythingModel(SAM)[11]的自动化算法生成的。前景预测模块设计类似于 UNet,将较低分辨率的特征上采样,与下一更高层级的特征拼接,然后通过卷积块处理。此过程重复进行,直到所有层合并并最终插值恢复完整分辨率的前景图。当图像可能从不同相机角度拍摄,且目标不一定位于图像中心时,此前景预测尤为有用。模型的最后一部分是重建残差选择(RRS)模块。该模块对较低分辨率的特征进行上采样,使所有残差具有相同的分辨率。 然后,它对特征执行全局平均和全局最大池化以选择 TopK 残差。选择过程会丢弃异常信息不足的残差。最后,选定的残差进入判别器,判别器在交叉熵损失上进行训练,以基于合成异常的真实情况预测异常像素。最初,RealNet [10] 有两个损失:特征重建损失(𝐿 𝑟𝑒𝑐𝑜𝑛)和判别器的分割损失(𝐿 𝑠𝑒𝑔)。我们还添加了用于前景预测的交叉熵损失(𝐿 𝑓𝑟𝑔)。总损失如下所示:
Data Augmentations 数据增强
The main goal of incorporating augmentations is to help the model adjust to potential real- world domain shifts during inference. The augmentations used in training ARNet include Gaussian noise, blur, RGB shift, brightness and contrast, and rotation and translations. Gaussian noise and blur imitate camera sensor noise and defocus, respectively. RGB shift is employed to account for color reproduction variations in different camera brands. Brightness and contrast adjustments simulate different times of day and lighting conditions, which can change based on the deployment scenario. To simulate various camera angles, we use a variety of rotations and translations. It's worth noting that for texture categories, we only use fixed rotations of 90, 180, or 270 degrees, as intermediate angles would necessitate border filling the texture with black pixels. Conversely, for object categories, our model can perform background subtraction, so we use shifting, scaling, and rotation, occasionally with border filling. To induce camera zoom-in and translation effect, we use random resized crop augmentation. Finally, we use horizontal and vertical flip augmentations to further add to the variations of training images. Figure 3 illustrates some examples of augmented images.
引入增强的主要目的是帮助模型在推理过程中适应现实世界中可能出现的领域变化。训练 ARNet 时使用的增强包括高斯噪声、模糊、RGB 偏移、亮度和对比度调整,以及旋转和平移。高斯噪声和模糊分别模拟相机传感器噪声和失焦现象。RGB 偏移用于考虑不同相机品牌间的色彩再现差异。亮度和对比度调整则模拟一天中不同时间和光照条件的变化,这些变化可能基于部署场景而有所不同。为了模拟各种相机角度,我们采用了多种旋转和平移方式。值得注意的是,对于纹理类别,我们仅使用固定的 90 度、180 度或 270 度旋转,因为中间角度需要将纹理边缘填充为黑色像素。相反,对于物体类别,我们的模型能够进行背景减除,因此我们使用平移、缩放和旋转,有时会进行边缘填充。为了产生相机放大和平移效果,我们采用了随机调整大小的裁剪增强。 最后,我们使用水平和垂直翻转增强来进一步增加训练图像的变化。图 3 展示了一些增强图像的示例。
Training and Evaluation 训练与评估
We mainly used the MVTec AD [12] dataset for this category including the synthesized version of SDAS images. We combined the synthesized images with the normal images for training and leave all the test images for evaluation. As the challenge aims to develop a system robust enough to handle data capture variations, we applied augmentations to both the training and test sets. To ensure our evaluation reflects our model's ability to handle diverse domain shifts in the images, we tripled the test set. This means we generated three versions of each test image, each with a different augmentation, to provide the most reliable evaluation of our model.
我们主要使用了 MVTec AD [12]数据集,包括 SDAS 图像的合成版本。我们将合成图像与正常图像结合用于训练,并保留所有测试图像用于评估。由于挑战旨在开发一个足够鲁棒的系统以处理数据捕获的变化,我们对训练集和测试集都进行了增强。为了确保我们的评估能够反映模型处理图像中多样化领域转移的能力,我们将测试集扩大了三倍。这意味着我们为每个测试图像生成了三个版本,每个版本都有不同的增强,以提供对我们模型最可靠的评估。
We use a one-class training paradigm, training each class individually. Each category is trained on a single RTX 3090 GPU with a batch size of 16 and a learning rate of 0.0001 using the Adam optimizer. We train for 1500 epochs, 500 more than RealNet, to account for the longer convergence time due to extensive augmentations in the training dataset.
我们采用单类训练范式,分别对每个类别进行训练。每个类别在单个 RTX 3090 GPU 上进行训练,批量大小为 16,学习率为 0.0001,使用 Adam 优化器。我们训练 1500 个周期,比 RealNet 多 500 个周期,以应对由于训练数据集中广泛增强而导致的较长收敛时间。
For evaluation, we use the well-known image-level Area Under the Receiver Operator Curve (AUROC) metric and the F1 Max metric for both image and pixel levels. These are also the target metrics for challenge evaluation.
为了评估,我们使用了众所周知的图像级接收者操作特征曲线下面积(AUROC)指标以及图像和像素级的 F1 Max 指标。这些也是挑战评估的目标指标。
Results and Discussions 结果与讨论
The performance evaluation of our proposed model, ARNet, is summarized in Table 1, showing the comparison of model’s performance on the test with and without augmentations. In spite of heavy augmentations, a mean image-level F1 Max score of 0.962 and a pixel-level F1 Max score of 0.672 on augmented test reveals the robustness of our model. It is evident from the results that the overall model’s performance remains quite close to the original test set depicting the adaptiveness of our model.
我们提出的模型 ARNet 的性能评估总结在表 1 中,展示了模型在有和没有增强的情况下在测试中的性能比较。尽管进行了大量增强,增强测试中的平均图像级 F1 Max 得分为 0.962,像素级 F1 Max 得分为 0.672,显示了模型的鲁棒性。从结果中可以明显看出,模型的整体性能与原始测试集非常接近,表明了我们模型的适应性。
Conclusion and Future Work
结论与未来工作
In conclusion, we propose ARNet, a feature reconstruction model for robust anomaly detection. It uses synthetic data and image augmentations, and is trained in a supervised manner. It displays strong image and pixel detection performance on the MVTec AD dataset. This prediction helps to remove unintended disturbances in the background and makes the model more robust to changes in object position. Our carefully designed training data augmentations result in a strong image and pixel-level F1 Max performance of 0.962 and 0.672, respectively. This showcases the potential robustness of our model in addressing real-world anomaly detection challenges. As for future work, we plan to further study category-specific limitations of our model and make necessary adjustments to our detection framework.
总之,我们提出了 ARNet,一种用于鲁棒异常检测的特征重建模型。它使用合成数据和图像增强,并以监督方式进行训练。在 MVTec AD 数据集上,它展示了强大的图像和像素检测性能。这一预测有助于消除背景中的无意干扰,并使模型对物体位置的变化更加鲁棒。我们精心设计的训练数据增强策略实现了强大的图像和像素级 F1 Max 性能,分别为 0.962 和 0.672。这展示了我们模型在应对现实世界异常检测挑战中的潜在鲁棒性。至于未来的工作,我们计划进一步研究我们模型的类别特定限制,并对检测框架进行必要的调整。
References 参考文献
[1] Pang, Guansong, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. "Deep learning for anomaly detection: A review." ACM computing surveys (CSUR) 54, no. 2 (2021).
[1] Pang, Guansong, Chunhua Shen, Longbing Cao, 和 Anton Van Den Hengel. "Deep learning for anomaly detection: A review." ACM computing surveys (CSUR) 54, no. 2 (2021).
[2] Jeong, Jongheon, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. "Winclip: Zero-/few-shot anomaly classification and segmentation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19606-19616. 2023.
[2] Jeong, Jongheon, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, 和 Onkar Dabeer. "Winclip: 零样本/少样本异常分类与分割." 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 19606-19616 页. 2023.
[3] Bergmann, Paul, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. "Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and localization." International Journal of Computer Vision 130, no. 4 (2022): 947-969.
[3] Bergmann, Paul, Kilian Batzner, Michael Fauser, David Sattlegger, 和 Carsten Steger. "超越凹痕和划痕:无监督异常检测与定位中的逻辑约束." 国际计算机视觉杂志 130, 第 4 期 (2022): 947-969.
[4] Hojjati, Hadi, Thi Kieu Khanh Ho, and Naregs Armanfard. "Self-supervised anomaly detection in computer vision and beyond: A survey and outlook." Neural Networks (2024): 106106.
[4] Hojjati, Hadi, Thi Kieu Khanh Ho, 和 Naregs Armanfard. "计算机视觉及其他领域中的自监督异常检测:综述与展望." Neural Networks (2024): 106106.
[5] Villa-Pérez, Miryam Elizabeth, et al. "Semi-supervised anomaly detection algorithms: A comparative summary and future research directions." Knowledge-Based Systems 218 (2021): 106878.
[5] Villa-Pérez, Miryam Elizabeth, 等. "半监督异常检测算法:比较总结与未来研究方向." 知识系统 218 (2021): 106878.
[6] Cui, Yajie, Zhaoxiang Liu, and Shiguo Lian. "A survey on unsupervised anomaly detection algorithms for industrial images." IEEE Access (2023).
[6] Cui, Yajie, Zhaoxiang Liu, and Shiguo Lian. "工业图像无监督异常检测算法综述." IEEE Access (2023).
[7] Batzner, Kilian, Lars Heckler, and Rebecca König. "Efficientad: Accurate visual anomaly detection at millisecond-level latencies." in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.
[7] Batzner, Kilian, Lars Heckler, 和 Rebecca König. "Efficientad: 在毫秒级延迟下实现准确的视觉异常检测。" 发表于 IEEE/CVF 冬季计算机视觉应用会议论文集. 2024.
[8] Roth, Karsten, et al. "Towards total recall in industrial anomaly detection." in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[8] Roth, Karsten, 等. "Towards total recall in industrial anomaly detection." in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[9] Li, Hanxi, et al. "Target before shooting: Accurate anomaly detection and localization under one millisecond via cascade patch retrieval." arXiv preprint arXiv:2308.06748 (2023).
[9] Li, Hanxi, 等. "先瞄准后射击:通过级联补丁检索在毫秒内实现精确异常检测与定位." arXiv 预印本 arXiv:2308.06748 (2023).
[10] Zhang, Ximiao, Min Xu, and Xiuzhuang Zhou. "RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection." arXiv preprint arXiv:2403.05897 (2024).
[10] 张西淼, 徐敏, 周秀壮. "RealNet: 一种具有真实合成异常的特征选择网络用于异常检测." arXiv 预印本 arXiv:2403.05897 (2024).
[11] Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao et al. "Segment anything." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015-4026. 2023.
[11] Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao 等. "Segment anything." 在 IEEE/CVF 国际计算机视觉会议论文集, 第 4015-4026 页. 2023.
[12] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec-ad: A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019.
[12] Paul Bergmann, Michael Fauser, David Sattlegger, 和 Carsten Steger. Mvtec-ad: 一个用于无监督异常检测的综合现实世界数据集. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 页码 9592–9600, 2019.
Comments 评论
Please log in or sign up to comment.
请登录或注册以发表评论。