这是用户在 2024-11-28 21:44 为 https://app.immersivetranslate.com/pdf-pro/3afeb050-2144-426c-bed3-f42f5ba326b6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
MedCLIP: 从未配对的医学图像和文本中进行对比学习

Zifeng Wang 1 1 ^(1){ }^{1}, Zhenbang Wu 1 1 ^(1){ }^{1}, Dinesh Agarwal 1 , 3 1 , 3 ^(1,3){ }^{1,3}, Jimeng Sun 1 , 2 1 , 2 ^(1,2){ }^{1,2} 1 1 ^(1){ }^{1} Department of Computer Science, University of Illinois Urbana-Champaign
伊利诺伊大学厄本那-香槟分校计算机科学系
2 2 ^(2){ }^{2} Carle Illinois College of Medicine, University of Illinois Urbana-Champaign
卡尔伊利诺伊大学医学院,伊利诺伊大学厄本那-香槟分校
3 3 ^(3){ }^{3} Adobe{zifengw2, zw12, jimeng}@illinois.edu, diagarwa@adobe.com

Abstract 摘要

Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zeroshot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20 K pretraining data, MedCLIP wins over the state-of-the-art method (using 200 K 200 K ~~200K\approx 200 \mathrm{~K} data) 1 1 ^(1){ }^{1}.
现有的视觉-文本对比学习,如 CLIP(Radford 等,2021),旨在匹配配对的图像和标题嵌入,同时将其他嵌入推开,从而提高表示的可转移性并支持零样本预测。然而,医学图像-文本数据集的数量级远低于来自互联网的一般图像和标题。此外,以前的方法遇到了许多假阴性,即来自不同患者的图像和报告可能具有相同的语义,但被错误地视为负样本。在本文中,我们为多模态对比学习解耦图像和文本,从而以低成本以组合的数量级扩展可用的训练数据。我们还建议用基于医学知识的语义匹配损失替代 InfoNCE 损失,以消除对比学习中的假阴性。我们证明了 MedCLIP 是一个简单而有效的框架:它在零样本预测、监督分类和图像-文本检索方面超越了最先进的方法。令人惊讶的是,我们观察到仅用 20K 的预训练数据,MedCLIP 就超过了最先进的方法(使用 200 K 200 K ~~200K\approx 200 \mathrm{~K} 数据) 1 1 ^(1){ }^{1}

1 Introduction 1 引言

Medical images such as X-rays, CTs, and MRIs are commonly used to diagnose, monitor, or treat medical conditions in clinical practice (FDA, 2022). With the rapid growth of medical images and the corresponding reports data, researchers have developed various deep learning models to support clinical decision making (Çallı et al., 2021).
医学图像,如 X 光、CT 和 MRI,通常用于临床实践中诊断、监测或治疗医疗状况(FDA,2022)。随着医学图像及相应报告数据的快速增长,研究人员开发了各种深度学习模型以支持临床决策(Çallı等,2021)。
Recently, large-scale image-text pre-training, e.g., CLIP (Radford et al., 2021), has achieved considerable successes in computer vision and natural language processing domains. CLIP is trained to predict the correct matching of a batch of images
最近,大规模的图像-文本预训练,例如 CLIP(Radford 等,2021),在计算机视觉和自然语言处理领域取得了相当大的成功。CLIP 被训练用于预测一批图像的正确匹配。
Figure 1: Zero-shot performance of MedCLIP, ConVIRT (Zhang et al., 2020), GLoRIA (Huang et al., 2021) when using different amounts of data for pretraining. ConVIRT and GLoRIA are trained on MIMIC-CXR (369K) and CheXpert (191K) dataset, respectively. Our method yields superior ACC than GLoRIA using near 1 / 10 1 / 10 1//101 / 10 of pre-training data.
图 1:在使用不同数量的数据进行预训练时,MedCLIP、ConVIRT(Zhang et al., 2020)、GLoRIA(Huang et al., 2021)的零样本性能。ConVIRT 和 GLoRIA 分别在 MIMIC-CXR(369K)和 CheXpert(191K)数据集上进行训练。我们的方法在使用接近 1 / 10 1 / 10 1//101 / 10 的预训练数据时,获得了比 GLoRIA 更优的 ACC。

and text training examples. The joint-training of image and text representations on large-scale imagetext pairs generates transferable representations and supports flexible downstream tasks. Inspired by success of CLIP, we believe the knowledge jointly learned from medical images and reports should be helpful for downstream clinical tasks.
和文本训练示例。对大规模图像文本对进行图像和文本表示的联合训练生成可转移的表示,并支持灵活的下游任务。受到 CLIP 成功的启发,我们相信从医学图像和报告中共同学习的知识应该对下游临床任务有所帮助。
However, adopting vision-text pre-training on medical domain is a non-trivial task due to (1) CLIP’s (Radford et al., 2021) data-hungry nature: CLIP is trained on a dataset of 400 M image-text pairs collected from the internet, while the total number of publicly available medical images and reports is orders of magnitude below; and (2) specificity of medical images and reports: compared to general domains (e.g., “cats” v.s. “dog”), the differences within medical domains are more subtle and fine-grained (e.g., “pneumonia” v.s. “consolidation”). In a nutshell, it is necessary to (1) address the data insufficiency issue; and (2) capture the subtle yet crucial medical meanings.
然而,在医学领域采用视觉-文本预训练是一项非平凡的任务,原因有二:(1)CLIP(Radford 等,2021)的数据需求特性:CLIP 是在一个包含 4 亿对从互联网收集的图像-文本对的数据集上训练的,而公开可用的医学图像和报告的总数则低得多;(2)医学图像和报告的特异性:与一般领域(例如,“猫”与“狗”)相比,医学领域内的差异更加微妙和细致(例如,“肺炎”与“实变”)。总之,有必要(1)解决数据不足的问题;(2)捕捉微妙但至关重要的医学含义。

Figure 2: Demonstration of challenges in medical image-text contrastive learning. (1) Pre-training data only includes paired images and texts. However, many more image-only and text-only datasets are ignored. (2) False negatives appear. For an anchor image, previous methods treat paired texts (i.e., reports from the same patient’s study) as positives and unpaired texts (i.e., reports from other patients’ studies) as negatives. However, the negative texts can describe the same symptoms as the anchor texts.
图 2:医学图像-文本对比学习中的挑战示范。(1) 预训练数据仅包括配对的图像和文本。然而,许多图像单独和文本单独的数据集被忽视。(2) 出现假阴性。对于一个锚图像,以前的方法将配对文本(即来自同一患者研究的报告)视为正样本,将未配对文本(即来自其他患者研究的报告)视为负样本。然而,负文本可以描述与锚文本相同的症状。
Existing works try to tackle the challenges above in different ways. ConVIRT (Zhang et al., 2020) jointly trains the vision and text encoders with the paired medical images and reports via a bidirectional contrastive objective; GLoRIA (Huang et al., 2021) further models both the global and local interactions between medical images and reports to capture the pathology meanings from specific image regions. However, both works have significant limitations, as illustrated in Fig. 2.
现有的研究尝试以不同的方式解决上述挑战。ConVIRT(Zhang et al., 2020)通过双向对比目标联合训练视觉和文本编码器,使用配对的医学图像和报告;GLoRIA(Huang et al., 2021)进一步建模医学图像和报告之间的全局和局部交互,以捕捉特定图像区域的病理意义。然而,如图 2 所示,这两项工作都有显著的局限性。
  • Limited usable data. Most medical image datasets only provide the diagnostic labels instead of the raw reports. However, both works need paired image and reports, leaving a vast number of medical image-only and text-only datasets unused.
    可用数据有限。大多数医学图像数据集仅提供诊断标签,而不是原始报告。然而,这两项工作都需要配对的图像和报告,导致大量仅包含医学图像和仅包含文本的数据集未被使用。
  • False negatives in contrastive learning. Both methods try to push images and texts embeddings from different patients apart. However, even though some reports do not belong to the target patient’s study, they can still describe the same symptoms and findings. Simply treating the other reports as negative samples brings noise to the supervision and confuses the model.
    对比学习中的假阴性。这两种方法都试图将来自不同患者的图像和文本嵌入推开。然而,即使某些报告不属于目标患者的研究,它们仍然可以描述相同的症状和发现。简单地将其他报告视为负样本会给监督带来噪声,并使模型感到困惑。
To handle the above challenges, we propose a simple yet effective approach, namely MedCLIP. It has the following contributions:
为了应对上述挑战,我们提出了一种简单而有效的方法,即 MedCLIP。它具有以下贡献:
  • Decoupling images and texts for contrastive learning. We extend the pre-training to cover the massive unpaired images and texts datasets, which scales the number of training data in a combinatorial manner. It opens a new direction to expand multi-modal learning based on medical knowledge instead of expensively scaling up data.
    解耦图像和文本以进行对比学习。我们扩展了预训练,以涵盖大量未配对的图像和文本数据集,这以组合方式扩大了训练数据的数量。这为基于医学知识的多模态学习开辟了一个新方向,而不是昂贵地扩大数据规模。
  • Eliminating false negatives via medical knowledge. We observe that images and reports from separate patients’ studies may carry the same semantics but are falsely treated as negatives by previous methods. Hence, we design a soft semantic matching loss that uses the medical semantic similarity between each image and report as the supervision signal. This approach equips the model with the ability to capture the subtle yet crucial medical meanings.
    通过医学知识消除假阴性。我们观察到来自不同患者研究的图像和报告可能具有相同的语义,但被以前的方法错误地视为阴性。因此,我们设计了一种软语义匹配损失,利用每个图像和报告之间的医学语义相似性作为监督信号。这种方法使模型具备捕捉细微但至关重要的医学含义的能力。
We make comprehensive evaluation on MedCLIP across four public datasets. Results show that MedCLIP reaches extremely high data efficiency, as shown in Fig. 1. Our method obtains better performances than the state-of-the-art GLoRIA (Huang et al., 2021) using only 10 % 10 % 10%10 \% pre-training data. Extensive experiments verify MedCLIP’s transferability to various downstream tasks. It wins over baselines by a large margin: over 10 % 10 % 10%10 \% improvement of
我们对 MedCLIP 在四个公共数据集上进行了全面评估。结果表明,MedCLIP 达到了极高的数据效率,如图 1 所示。我们的方法在仅使用 10 % 10 % 10%10 \% 预训练数据的情况下,获得了比最先进的 GLoRIA(Huang 等,2021)更好的性能。大量实验验证了 MedCLIP 在各种下游任务中的可迁移性。它以较大幅度超越了基线:超过 10 % 10 % 10%10 \% 的提升。

prediction ACC for zero-shot prediction and supervised image classification tasks on average; over 2 % 2 % 2%2 \% improvement of retrieval precision. Details are in § 4 § 4 §4\S 4§.
零样本预测和监督图像分类任务的平均预测准确率;检索精度提高了 2 % 2 % 2%2 \% 。详细信息见 § 4 § 4 §4\S 4§
Vision-text representation learning was shown to learn good visual representations (Joulin et al., 2016; Li et al., 2017; Sariyildiz et al., 2020; Desai and Johnson, 2021; Kim et al., 2021; Wang et al., 2021a). But all of them work on paired image and captions from general domain, e.g., Flickr (Joulin et al., 2016) and COCO Captions (Desai and Johnson, 2021). Likewise, these methods do not support cross-modal retrieval hence do not support zero-shot predictions either.
视觉-文本表示学习被证明能够学习良好的视觉表示(Joulin et al., 2016; Li et al., 2017; Sariyildiz et al., 2020; Desai and Johnson, 2021; Kim et al., 2021; Wang et al., 2021a)。但它们都基于来自一般领域的配对图像和标题,例如 Flickr(Joulin et al., 2016)和 COCO Captions(Desai and Johnson, 2021)。同样,这些方法不支持跨模态检索,因此也不支持零样本预测。
Many propose to learn visual-semantic embedding for vision-text retrieval (Liu et al., 2019; Wu et al., 2019; Lu et al., 2019; Huang et al., 2020; Chen et al., 2021) by attention or objection detection models; and by vision-text contrastive learning (Zhang et al., 2020; Jia et al., 2021; Yuan et al., 2021; Yu et al., 2022) or multiple vision and text supervision (Singh et al., 2021; Li et al., 2022). They all work on general domain where near infinite web images and captions are available, which dwarfs the scale of medical image-text data. This challenge hurdles the execution of self-supervised CL for large vision-text transformers. Though remedies like data augmentation (Li et al., 2021) and knowledge graph (Shen et al., 2022) were proposed, the magnitude of used data is still far larger than medical data.
许多人提议通过注意力或物体检测模型学习视觉-语义嵌入以进行视觉-文本检索(Liu et al., 2019; Wu et al., 2019; Lu et al., 2019; Huang et al., 2020; Chen et al., 2021);以及通过视觉-文本对比学习(Zhang et al., 2020; Jia et al., 2021; Yuan et al., 2021; Yu et al., 2022)或多种视觉和文本监督(Singh et al., 2021; Li et al., 2022)。他们都在一般领域工作,在那里几乎有无限的网络图像和标题可用,这使得医学图像-文本数据的规模显得微不足道。这个挑战阻碍了大型视觉-文本变换器的自监督 CL 的执行。尽管提出了数据增强(Li et al., 2021)和知识图谱(Shen et al., 2022)等补救措施,但使用的数据量仍然远大于医学数据。
Medical image-text representation learning was investigated based on contrastive learning as well (Zhang et al., 2020; Huang et al., 2021; Wang et al., 2021b). Nonetheless, they all work on paired medical images and texts so still encounter the lacking data challenge. Moreover, they all suffer from the false negative noises when adopting noise contrastive estimation (NCE) (Van den Oord et al., 2018) to perform instance discrimination (Wu et al., 2018), which undermines the representation quality (Arora et al., 2019; Zheng et al., 2021). Our work bridges the gap by making the full use of all available medical data to support medical image-text pre-training. And we harness medical knowledge tailored to eliminate false negatives in contrastive learning to improve the pre-training data efficiency.
医学图像-文本表示学习也基于对比学习进行了研究(Zhang et al., 2020; Huang et al., 2021; Wang et al., 2021b)。然而,他们都在配对的医学图像和文本上工作,因此仍然面临数据不足的挑战。此外,他们在采用噪声对比估计(NCE)(Van den Oord et al., 2018)进行实例区分时都遭受了假阴性噪声,这削弱了表示质量(Arora et al., 2019; Zheng et al., 2021)。我们的工作通过充分利用所有可用的医学数据来支持医学图像-文本的预训练,弥补了这一差距。我们利用医学知识,旨在消除对比学习中的假阴性,以提高预训练数据的效率。

3 Method 3 方法

In this section, we present the technical details of MedCLIP following the flow in Fig. 3. MedCLIP consists of components (1) knowledge extraction that builds the semantic similarity matrix, (2) vision and text encoders that extracts embeddings, and (3) semantic matching loss that trains the whole model.
在本节中,我们按照图 3 中的流程介绍 MedCLIP 的技术细节。MedCLIP 由以下组件组成:(1) 知识提取,构建语义相似度矩阵;(2) 视觉和文本编码器,提取嵌入;(3) 语义匹配损失,训练整个模型。

3.1 Vision and Text Encoder
3.1 视觉和文本编码器

MedCLIP consists of one visual encoder and one text encoder.
MedCLIP 由一个视觉编码器和一个文本编码器组成。
Vision Encoder. We encode images into embeddings v R D v R D vinR^(D)\mathbf{v} \in \mathbb{R}^{D} using a vision encoder E img E img  E_("img ")E_{\text {img }}. A projection head then maps raw embeddings to v p R P v p R P v_(p)inR^(P)\mathbf{v}_{p} \in \mathbb{R}^{P}.
视觉编码器。我们使用视觉编码器将图像编码为嵌入 v R D v R D vinR^(D)\mathbf{v} \in \mathbb{R}^{D} 。然后,投影头将原始嵌入映射到 v p R P v p R P v_(p)inR^(P)\mathbf{v}_{p} \in \mathbb{R}^{P}
v = E i m g ( x i m g ) v p = f v ( v ) v = E i m g x i m g v p = f v ( v ) {:[v=E_(img)(x_(img))],[quadv_(p)=f_(v)(v)]:}\begin{array}{r} \mathbf{v}=E_{i m g}\left(\mathbf{x}_{i m g}\right) \\ \quad \mathbf{v}_{p}=f_{v}(\mathbf{v}) \end{array}
where f v f v f_(v)f_{v} is the projection head of the vision encoder.
其中 f v f v f_(v)f_{v} 是视觉编码器的投影头。
Text Encoder. We create clinically meaningful text embeddings t R M t R M tinR^(M)\mathbf{t} \in \mathbb{R}^{M} by a text encoder. We project them to t p R P t p R P t_(p)inR^(P)\mathbf{t}_{p} \in \mathbb{R}^{P} as
文本编码器。我们通过文本编码器创建临床意义的文本嵌入 t R M t R M tinR^(M)\mathbf{t} \in \mathbb{R}^{M} 。我们将它们投影到 t p R P t p R P t_(p)inR^(P)\mathbf{t}_{p} \in \mathbb{R}^{P}
t = E t x t ( x t x t ) t p = f t ( t ) t = E t x t x t x t t p = f t ( t ) {:[t=E_(txt)(x_(txt))],[quadt_(p)=f_(t)(t)]:}\begin{array}{r} \mathbf{t}=E_{t x t}\left(\mathbf{x}_{t x t}\right) \\ \quad \mathbf{t}_{p}=f_{t}(\mathbf{t}) \end{array}
where f t f t f_(t)f_{t} is the projection head and E t x t E t x t E_(txt)E_{t x t} denotes the text encoder. This gives the same embedding dimension P P PP as the vision encoder, suitable for contrastive learning.
其中 f t f t f_(t)f_{t} 是投影头, E t x t E t x t E_(txt)E_{t x t} 表示文本编码器。这与视觉编码器具有相同的嵌入维度 P P PP ,适合对比学习。

3.2 Decouple Image-Text Pairs with Medical Knowledge Extractor
3.2 使用医学知识提取器解耦图像-文本对

Paired medical image text datasets are orders of magnitude less than the general paired image text (e.g., from the internet) due to the significant expense of supplying high-quality annotations by medical specialists as well as privacy and legal concerns. To enhance medical multi-modal learning, we want to make the full use of all existing medical image-text, image-only, and text-only datasets. The challenge is that for image-only, and text-only datasets, CLIP-like contrastive learning is infeasible. Also, we want to dig out all positive pairs to eliminate false negatives.
配对医学图像文本数据集的数量比一般的配对图像文本(例如,来自互联网)少几个数量级,这主要是由于提供高质量注释的医学专家的巨大费用以及隐私和法律问题。为了增强医学多模态学习,我们希望充分利用所有现有的医学图像-文本、仅图像和仅文本数据集。挑战在于,对于仅图像和仅文本数据集,类似 CLIP 的对比学习是不可行的。此外,我们希望挖掘出所有正对,以消除假阴性。
Suppose we have n n nn paired image-text samples, m m mm labeled images, and h h hh medical sentences. Previous methods are only able to use n n nn paired samples.
假设我们有 n n nn 对图像-文本样本, m m mm 个标记图像和 h h hh 个医学句子。以前的方法只能使用 n n nn 对样本。

Figure 3: The workflow of MedCLIP. The knowledge extraction module extracts medical entities from raw medical reports. Then, a semantic similarity matrix is built by comparing medical entities (from text) and raw labels (from images), which enables pairing arbitrary two separately sampled images and texts. The extracted image and text embeddings are paired to match the semantic similarity matrix.
图 3:MedCLIP 的工作流程。知识提取模块从原始医疗报告中提取医疗实体。然后,通过比较医疗实体(来自文本)和原始标签(来自图像)构建语义相似度矩阵,这使得可以配对任意两个单独采样的图像和文本。提取的图像和文本嵌入被配对以匹配语义相似度矩阵。
By contrast, we decouple the n n nn paired samples into n n nn images and n n nn sentences, respectively. Ultimately, we are able to obtain ( n + m ) × ( n + h ) ( n + m ) × ( n + h ) (n+m)xx(n+h)(n+m) \times(n+h) imagetext pairs by traversing all possible combinations, which results in ( n + m ) × ( n + h ) n × ( n + m ) × ( n + h ) n × ((n+m)xx(n+h))/(n)xx\frac{(n+m) \times(n+h)}{n} \times more supervision. For instance, in Fig. 2, previous method pretrains on 2 image-text pairs while MedCLIP is capable of exploiting ( 2 + 3 ) × ( 2 + 3 ) = 25 ( 2 + 3 ) × ( 2 + 3 ) = 25 (2+3)xx(2+3)=25(2+3) \times(2+3)=25 samples in total.
相比之下,我们将 n n nn 对样本解耦为 n n nn 图像和 n n nn 句子。最终,我们能够通过遍历所有可能的组合获得 ( n + m ) × ( n + h ) ( n + m ) × ( n + h ) (n+m)xx(n+h)(n+m) \times(n+h) 图像文本对,这导致了 ( n + m ) × ( n + h ) n × ( n + m ) × ( n + h ) n × ((n+m)xx(n+h))/(n)xx\frac{(n+m) \times(n+h)}{n} \times 更多的监督。例如,在图 2 中,之前的方法在 2 个图像-文本对上进行预训练,而 MedCLIP 能够利用总共 ( 2 + 3 ) × ( 2 + 3 ) = 25 ( 2 + 3 ) × ( 2 + 3 ) = 25 (2+3)xx(2+3)=25(2+3) \times(2+3)=25 个样本。
To fulfill the additional supervision, we propose to leverage external medical knowledge to build the knowledge-driven semantic similarity. Unlike previous works that treat all positive samples equally (Khosla et al., 2020; Zheng et al., 2021; Wang and Sun, 2022), here we propose to differentiate samples via their semantic similarities.
为了满足额外的监督,我们建议利用外部医学知识构建知识驱动的语义相似性。与之前的研究(Khosla et al., 2020; Zheng et al., 2021; Wang and Sun, 2022)将所有正样本视为相等不同,我们在这里建议通过它们的语义相似性来区分样本。
In particular, we split raw reports into sentences x t x t x t x t x_(txt)x_{t x t}. MetaMap (Aronson and Lang, 2010) is used to extract entities defined in Unified Medical Language System (Bodenreider, 2004) from raw sentences. Follow the practice of Peng et al. (2018), we focus on 14 main entity types shown by Table 5. Likewise, for images with diagnosis labels, we leverage MetaMap to map the raw classes to UMLS conceptions thus being aligned with entities from texts, e.g., “Normal” maps to “No Findings”. We build multi-hot vectors from the extracted entities for images and texts, as l i m g l i m g l_(img)\mathbf{l}_{i m g} and l t x t l t x t l_(txt)\mathbf{l}_{t x t}, respectively. Therefore, we unify the semantics of images and texts. For any sampled x i m g x i m g x_(img)x_{i m g} and x t x t x t x t x_(txt)x_{t x t}, we can measure their semantic similarity by comparing the corresponding l i m g l i m g l_(img)\mathbf{l}_{i m g} and l t x t l t x t l_(txt)\mathbf{l}_{t x t}.
特别地,我们将原始报告拆分为句子 x t x t x t x t x_(txt)x_{t x t} 。MetaMap(Aronson 和 Lang,2010)用于从原始句子中提取统一医学语言系统(Bodenreider,2004)中定义的实体。遵循 Peng 等人(2018)的做法,我们关注表 5 所示的 14 种主要实体类型。同样,对于带有诊断标签的图像,我们利用 MetaMap 将原始类别映射到 UMLS 概念,从而与文本中的实体对齐,例如,“正常”映射到“无发现”。我们为图像和文本构建多热向量,如 l i m g l i m g l_(img)\mathbf{l}_{i m g} l t x t l t x t l_(txt)\mathbf{l}_{t x t} 。因此,我们统一了图像和文本的语义。对于任何采样的 x i m g x i m g x_(img)x_{i m g} x t x t x t x t x_(txt)x_{t x t} ,我们可以通过比较相应的 l i m g l i m g l_(img)\mathbf{l}_{i m g} l t x t l t x t l_(txt)\mathbf{l}_{t x t} 来测量它们的语义相似性。

3.3 Semantic Matching Loss
3.3 语义匹配损失

We bridge the images and texts through the built semantic labels l i m g l i m g l_(img)\mathbf{l}_{i m g} and l t x t l t x t l_(txt)\mathbf{l}_{t x t}. During each iteration, we sample N batch N batch  N_("batch ")N_{\text {batch }} input images { x image } x image  {x_("image ")}\left\{\mathbf{x}_{\text {image }}\right\} and text { x text } x text  {x_("text ")}\left\{\mathbf{x}_{\text {text }}\right\} separately. Instead of defining positive pairing by searching equivalent labels, we propose to build soft targets s s ss by
我们通过构建语义标签 l i m g l i m g l_(img)\mathbf{l}_{i m g} l t x t l t x t l_(txt)\mathbf{l}_{t x t} 来连接图像和文本。在每次迭代中,我们分别采样 N batch N batch  N_("batch ")N_{\text {batch }} 输入图像 { x image } x image  {x_("image ")}\left\{\mathbf{x}_{\text {image }}\right\} 和文本 { x text } x text  {x_("text ")}\left\{\mathbf{x}_{\text {text }}\right\} 。我们建议通过构建软目标 s s ss 来替代通过搜索等效标签来定义正配对。
s = l i m g l t x t l i m g l t x t s = l i m g l t x t l i m g l t x t s=(l_(img)^(TT)*l_(txt))/(||l_(img)||*||l_(txt)||)s=\frac{\mathbf{l}_{i m g}^{\top} \cdot \mathbf{l}_{t x t}}{\left\|\mathbf{l}_{i m g}\right\| \cdot\left\|\mathbf{l}_{t x t}\right\|}
s s ss thus indicates the medical semantic similarity.
s s ss 因此表示医学语义相似性。

For an image i i ii, we obtain a set of s i j s i j s_(ij)s_{i j} where j = 1 N batch j = 1 N batch  j=1dotsN_("batch ")j=1 \ldots N_{\text {batch }} corresponds to the batch of texts. The soft target is computed by normalizing across j j jj by softmax.
对于图像 i i ii ,我们获得一组 s i j s i j s_(ij)s_{i j} ,其中 j = 1 N batch j = 1 N batch  j=1dotsN_("batch ")j=1 \ldots N_{\text {batch }} 对应于文本批次。软目标通过对 j j jj 进行 softmax 归一化来计算。
y i j v t = exp s i j j = 1 N b a t c h exp s i j y i j v t = exp s i j j = 1 N b a t c h exp s i j y_(ij)^(v rarr t)=(exp s_(ij))/(sum_(j=1)^(N_(batch))exp s_(ij))y_{i j}^{v \rightarrow t}=\frac{\exp s_{i j}}{\sum_{j=1}^{N_{b a t c h}} \exp s_{i j}}
Similarily, the reversed text-to-image soft targets are obtained by
类似地,反向文本到图像的软目标是通过获得的
y j i t v = exp s j i i = 1 N b a t c h exp s j i y j i t v = exp s j i i = 1 N b a t c h exp s j i y_(ji)^(t rarr v)=(exp s_(ji))/(sum_(i=1)^(N_(batch))exp s_(ji))y_{j i}^{t \rightarrow v}=\frac{\exp s_{j i}}{\sum_{i=1}^{N_{b a t c h}} \exp s_{j i}}
The logits are obtained by cosine similarities between image and text embeddings:
通过图像和文本嵌入之间的余弦相似度获得 logits:
s ^ i j = v ~ i t ~ j s ^ i j = v ~ i t ~ j hat(s)_(ij)= tilde(v)_(i)^(TT)* tilde(t)_(j)\hat{s}_{i j}=\tilde{\mathbf{v}}_{i}^{\top} \cdot \tilde{\mathbf{t}}_{j}
where v ~ i v ~ i tilde(v)_(i)\tilde{\mathbf{v}}_{i} and t ~ j t ~ j tilde(t)_(j)\tilde{\mathbf{t}}_{j} are normalized v p v p v_(p)\mathbf{v}_{p} and t p t p t_(p)\mathbf{t}_{p}, respectively. The predicted similarity is also obtained by softmax function
其中 v ~ i v ~ i tilde(v)_(i)\tilde{\mathbf{v}}_{i} t ~ j t ~ j tilde(t)_(j)\tilde{\mathbf{t}}_{j} 分别是归一化的 v p v p v_(p)\mathbf{v}_{p} t p t p t_(p)\mathbf{t}_{p} 。预测的相似性也是通过 softmax 函数获得的。
y ^ i j = exp s ^ i j / τ i = 1 N b a t c h exp s ^ i j / τ y ^ i j = exp s ^ i j / τ i = 1 N b a t c h exp s ^ i j / τ hat(y)_(ij)=(exp hat(s)_(ij)//tau)/(sum_(i=1)^(N_(batch))exp hat(s)_(ij)//tau)\hat{y}_{i j}=\frac{\exp \hat{s}_{i j} / \tau}{\sum_{i=1}^{N_{b a t c h}} \exp \hat{s}_{i j} / \tau}
τ τ tau\tau is the temperature initialized at 0.07 . The s e s e se-s e- mantic matching loss is hence the cross entropy between the logits and soft targets as
τ τ tau\tau 的温度初始化为 0.07。因此, s e s e se-s e- 的神秘匹配损失是 logits 和软目标之间的交叉熵。
L v l = 1 N batch i = 1 N batch j = 1 N batch y i j log y ^ i j . L v l = 1 N batch  i = 1 N batch  j = 1 N batch  y i j log y ^ i j . L^(v rarr l)=-(1)/(N_("batch "))sum_(i=1)^(N_("batch "))sum_(j=1)^(N_("batch "))y_(ij)log hat(y)_(ij).\mathcal{L}^{v \rightarrow l}=-\frac{1}{N_{\text {batch }}} \sum_{i=1}^{N_{\text {batch }}} \sum_{j=1}^{N_{\text {batch }}} y_{i j} \log \hat{y}_{i j} .
Likewise, we can compute L l v L l v L^(l rarr v)\mathcal{L}^{l \rightarrow v} and then reach to
同样,我们可以计算 L l v L l v L^(l rarr v)\mathcal{L}^{l \rightarrow v} 然后达到
L = L v l + L l v 2 L = L v l + L l v 2 L=(L^(v rarr l)+L^(l rarr v))/(2)\mathcal{L}=\frac{\mathcal{L}^{v \rightarrow l}+\mathcal{L}^{l \rightarrow v}}{2}
as the final training objective.
作为最终训练目标。

4 Experiments 4 个实验

We conduct extensive experiments on four X-ray datasets to answer the following questions:
我们在四个 X 射线数据集上进行了广泛的实验,以回答以下问题:
  • Q1. Does the proposed pre-training method yield better zero-shot image recognition performances?
    Q1. 提出的预训练方法是否能产生更好的零样本图像识别性能?
  • Q2. Does the knowledge-driven supervision, i.e., semantic matching task, facilitate the contrastive image-text pre-training?
    Q2. 知识驱动的监督,即语义匹配任务,是否有助于对比图像-文本的预训练?
  • Q3. Does MedCLIP bring better performance and label efficiency for downstream classification tasks with fine-tuning?
    Q3. MedCLIP 在微调后是否为下游分类任务带来了更好的性能和标签效率?
  • Q4. Are the learned embeddings good at crossmodal retrieval tasks?
    Q4. 学习到的嵌入在跨模态检索任务中表现好吗?
  • Q5. How do the learned embeddings look like?
    Q5. 学习到的嵌入是什么样的?

4.1 Datasets 4.1 数据集

CheXpert (Irvin et al., 2019) is a large dataset of chest X-rays with 14 observation labels collected from Stanford Hospital. Note that this dataset does not provide the corresponding medical reports to the public. We use the training split of this dataset for pre-training. For evaluation, we follow (Huang et al., 2021) and sample a multi-class classification dataset from the testing split, namely CheXpert-5x200. This multi-class classification dataset has 200 exclusively positive images for the five CheXpert competition tasks: Atelectasis, Cardiomegaly, Edema, Pleural, Effsion.
CheXpert(Irvin 等,2019)是一个大型胸部 X 光数据集,包含来自斯坦福医院的 14 个观察标签。请注意,该数据集不向公众提供相应的医学报告。我们使用该数据集的训练集进行预训练。为了评估,我们遵循(Huang 等,2021)并从测试集中抽取一个多类分类数据集,即 CheXpert-5x200。该多类分类数据集有 200 张专门的阳性图像,针对五个 CheXpert 竞赛任务:肺不张、心脏肥大、水肿、胸腔积液。
MIMIC-CXR (Johnson et al., 2019) is a large chest X-ray database with free-text radiology reports collected from the Beth Israel Deaconess Medical Center in Boston, MA. We use the training
MIMIC-CXR(Johnson 等,2019)是一个大型胸部 X 光数据库,包含从马萨诸塞州波士顿的贝斯以色列女执事医疗中心收集的自由文本放射学报告。我们使用训练

split of this dataset for pre-training. For evaluation, we also sample a MIMIC-5x200 dataset for the same five tasks above.
该数据集的预训练拆分。为了评估,我们还为上述五个任务抽样了一个 MIMIC-5x200 数据集。
COVID (Rahman et al., 2021) is a publicly available x-ray dataset with COVID v.s. non-COVID labels. The positive and negative ratio is roughly 1 : 1 1 : 1 1:11: 1. We use this dataset for evaluation.
COVID (Rahman et al., 2021) 是一个公开可用的 X 光数据集,包含 COVID 与非 COVID 标签。正负比例大约为 1 : 1 1 : 1 1:11: 1 。我们使用这个数据集进行评估。
RSNA Pneumonia (Shih et al., 2019) is a collection of pneumonia cases found in the database of chest x-rays made public by the National Institutes of Health. This is a binary classification dataset: pneumonia v.s. normal. We sample a balanced subset (i.e., 1 : 1 1 : 1 1:11: 1 positive and negative ratio) and use it for evaluation.
RSNA 肺炎(Shih 等,2019)是一个收集自国家卫生研究院公开的胸部 X 光片数据库的肺炎病例集合。这是一个二分类数据集:肺炎与正常。我们抽取一个平衡的子集(即 1 : 1 1 : 1 1:11: 1 正负比例)并用于评估。

4.2 Baselines 4.2 基线

Random is a ResNet-50 (He et al., 2015) model with its default random initialization.
Random 是一个 ResNet-50(He 等,2015)模型,具有默认的随机初始化。
ImageNet is a ResNet-50 (He et al., 2015) model with weights pretrained on the standard ImageNet ILSVRC-2012 task (Deng et al., 2009).
ImageNet 是一个 ResNet-50(He 等,2015)模型,其权重在标准的 ImageNet ILSVRC-2012 任务上进行了预训练(Deng 等,2009)。
CLIP (Radford et al., 2021) is a vision-text contrastive learning framework pre-trained on a dataset of 400 M image-texts pairs collected from the internet.
CLIP(Radford 等,2021)是一个视觉-文本对比学习框架,预先在从互联网收集的 4 亿对图像-文本数据集上进行训练。
ConVIRT works on vision-text contrastive learning in medicine. It employs a plain InfoNCE loss (Van den Oord et al., 2018) on paired X-rays and reports. We reproduce it based on their paper based on BioClinicalBERT text encoder and ResNet50 (He et al., 2016) vision encoder.
ConVIRT 在医学中进行视觉-文本对比学习。它在配对的 X 光片和报告上采用了简单的 InfoNCE 损失(Van den Oord 等,2018)。我们基于他们的论文,使用 BioClinicalBERT 文本编码器和 ResNet50(He 等,2016)视觉编码器进行了复现。
GLoRIA (Huang et al., 2021) entangles image subregions and words in inference by cross-attention which was argued to better capture key characteristics in images and reports. We implement it based on the official code and the provided pretrained weights 2 2 ^(2){ }^{2}.
GLoRIA(Huang 等,2021)通过交叉注意力在推理中将图像子区域和单词纠缠在一起,这被认为更好地捕捉图像和报告中的关键特征。我们基于官方代码和提供的预训练权重 2 2 ^(2){ }^{2} 实现它。

4.3 Implementation Details
4.3 实施细节

We use the BioClinicalBERT 3 3 ^(3){ }^{3} as the backbone text encoder and Swin Transformer (Liu et al., 2021) with ImageNet (Deng et al., 2009) pre-trained
我们使用 BioClinicalBERT 3 3 ^(3){ }^{3} 作为主干文本编码器,并使用 Swin Transformer(Liu et al., 2021)与 ImageNet(Deng et al., 2009)进行预训练。
Table 1: Results of zero-shot image classification tasks on four datasets. We take an additional prompt ensemble version of each method (with subscript ENS ENS  _("ENS "){ }_{\text {ENS }} ). We take the mean and standard deviation (STD) of accuracy (ACC) in five runs considering the randomness of prompt generation process. Best scores across a dataset are in bold.
表 1:四个数据集上零样本图像分类任务的结果。我们采用每种方法的额外提示集成版本(下标 ENS ENS  _("ENS "){ }_{\text {ENS }} )。我们考虑到提示生成过程的随机性,取五次运行的准确率(ACC)的均值和标准差(STD)。每个数据集的最佳得分用粗体表示。
ACC(STD) CheXpert-5x200 MIMIC-5x200 COVID RSNA
CLIP CLIP ENS CLIP  ENS  ^("CLIP "_("ENS "))^{\text {CLIP }_{\text {ENS }}} 0.2016 ( 0.01 ) 0.2016 ( 0.01 ) 0.2016(0.01)0.2016(0.01) 0.1918 ( 0.01 ) 0.1918 ( 0.01 ) 0.1918(0.01)0.1918(0.01) 0.5069 ( 0.03 ) 0.5069 ( 0.03 ) 0.5069(0.03)0.5069(0.03) 0.4989 ( 0.01 ) 0.4989 ( 0.01 ) 0.4989(0.01)0.4989(0.01)
ConVIRT ConVIRT ConVIRT  ^("ConVIRT ")^{\text {ConVIRT }} ENS 0.2036 ( 0.01 ) 0.2036 ( 0.01 ) 0.2036(0.01)0.2036(0.01) 0.2254 ( 0.01 ) 0.2254 ( 0.01 ) 0.2254(0.01)0.2254(0.01) 0.5090 ( < 0.01 ) 0.5090 ( < 0.01 ) 0.5090( < 0.01)0.5090(<0.01) 0.5055 ( 0.01 ) 0.5055 ( 0.01 ) 0.5055(0.01)0.5055(0.01)
GLoRIA GLoRIA ENS GLoRIA  ENS  ^("GLoRIA "_("ENS "))^{\text {GLoRIA }_{\text {ENS }}} 0.4188 ( 0.01 ) 0.4188 ( 0.01 ) 0.4188(0.01)0.4188(0.01) 0.4018 ( 0.01 ) 0.4018 ( 0.01 ) 0.4018(0.01)0.4018(0.01) 0.5184 ( 0.01 ) 0.5184 ( 0.01 ) 0.5184(0.01)0.5184(0.01) 0.4731 ( 0.05 ) 0.4731 ( 0.05 ) 0.4731(0.05)0.4731(0.05)
MedCLIP-ResNet MedCLIP-ResNet MedCLIP-ResNet  ^("MedCLIP-ResNet ")^{\text {MedCLIP-ResNet }} 0.4224 ( 0.02 ) 0.4224 ( 0.02 ) 0.4224(0.02)0.4224(0.02) 0.4010 ( 0.02 ) 0.4010 ( 0.02 ) 0.4010(0.02)0.4010(0.02) 0.6647 ( 0.05 ) 0.6647 ( 0.05 ) 0.6647(0.05)0.6647(0.05) 0.4647 ( 0.08 ) 0.4647 ( 0.08 ) 0.4647(0.08)0.4647(0.08)
MedCLIP-ViT MedCLIP-ViT MedCLIP-ViT  ^("MedCLIP-ViT ")^{\text {MedCLIP-ViT }}
ENS 0.4328 ( 0.01 ) 0.4328 ( 0.01 ) 0.4328(0.01)0.4328(0.01) 0.3306 ( 0.01 ) 0.3306 ( 0.01 ) 0.3306(0.01)0.3306(0.01) 0.7090 ( 0.04 ) 0.7090 ( 0.04 ) 0.7090(0.04)0.7090(0.04) 0.5808 ( 0.08 ) 0.5808 ( 0.08 ) 0.5808(0.08)0.5808(0.08)
0.4210 ( 0.03 ) 0.4210 ( 0.03 ) 0.4210(0.03)0.4210(0.03) 0.3382 ( 0.01 ) 0.3382 ( 0.01 ) 0.3382(0.01)0.3382(0.01) 0.5702 ( 0.06 ) 0.5702 ( 0.06 ) 0.5702(0.06)0.5702(0.06) 0.4752 ( 0.06 ) 0.4752 ( 0.06 ) 0.4752(0.06)0.4752(0.06)
ACC(STD) CheXpert-5x200 MIMIC-5x200 COVID RSNA CLIP ^("CLIP "_("ENS ")) 0.2016(0.01) 0.1918(0.01) 0.5069(0.03) 0.4989(0.01) ConVIRT ^("ConVIRT ") ENS 0.2036(0.01) 0.2254(0.01) 0.5090( < 0.01) 0.5055(0.01) GLoRIA ^("GLoRIA "_("ENS ")) 0.4188(0.01) 0.4018(0.01) 0.5184(0.01) 0.4731(0.05) MedCLIP-ResNet ^("MedCLIP-ResNet ") 0.4224(0.02) 0.4010(0.02) 0.6647(0.05) 0.4647(0.08) MedCLIP-ViT ^("MedCLIP-ViT ") ENS 0.4328(0.01) 0.3306(0.01) 0.7090(0.04) 0.5808(0.08) 0.4210(0.03) 0.3382(0.01) 0.5702(0.06) 0.4752(0.06)| ACC(STD) | CheXpert-5x200 | MIMIC-5x200 | COVID | RSNA | | :--- | :--- | :--- | :--- | :--- | | CLIP $^{\text {CLIP }_{\text {ENS }}}$ | $0.2016(0.01)$ | $0.1918(0.01)$ | $0.5069(0.03)$ | $0.4989(0.01)$ | | ConVIRT $^{\text {ConVIRT }}$ ENS | $0.2036(0.01)$ | $0.2254(0.01)$ | $0.5090(<0.01)$ | $0.5055(0.01)$ | | GLoRIA $^{\text {GLoRIA }_{\text {ENS }}}$ | $0.4188(0.01)$ | $0.4018(0.01)$ | $0.5184(0.01)$ | $0.4731(0.05)$ | | MedCLIP-ResNet $^{\text {MedCLIP-ResNet }}$ | $0.4224(0.02)$ | $0.4010(0.02)$ | $0.6647(0.05)$ | $0.4647(0.08)$ | | MedCLIP-ViT $^{\text {MedCLIP-ViT }}$ | | | | | | ENS | $0.4328(0.01)$ | $0.3306(0.01)$ | $0.7090(0.04)$ | $0.5808(0.08)$ | | | $0.4210(0.03)$ | $0.3382(0.01)$ | $0.5702(0.06)$ | $0.4752(0.06)$ |
Table 2: Results of medical image classification tasks after fine-tuning. Best scores are in bold.
表 2:微调后医学图像分类任务的结果。最佳分数用粗体表示。
ACC
CheXpert
5 × 200 5 × 200 -5xx200-5 \times 200
CheXpert -5xx200| CheXpert | | :---: | | $-5 \times 200$ |
MIMIC
5 × 200 5 × 200 -5xx200-5 \times 200
MIMIC -5xx200| MIMIC | | :---: | | $-5 \times 200$ |
COVID RSNA
0.2500 0.2220 0.5056 0.6421
ImageNet 0.3200 0.2830 0.6020 0.7560
CLIP 0.3020 0.2780 0.5866 0.7303
ConVIRT 0.4770 0.4040 0.6983 0.7846
GLoRIA 0.5370 0.3590 0.7623 0.7981
MedCLIP 0 . 5 9 6 0 0 . 5 9 6 0 0.5960\mathbf{0 . 5 9 6 0} 0 . 5 6 5 0 0 . 5 6 5 0 0.5650\mathbf{0 . 5 6 5 0} 0 . 7 8 9 0 0 . 7 8 9 0 0.7890\mathbf{0 . 7 8 9 0} 0 . 8 0 7 5 0 . 8 0 7 5 0.8075\mathbf{0 . 8 0 7 5}
ACC "CheXpert -5xx200" "MIMIC -5xx200" COVID RSNA 0.2500 0.2220 0.5056 0.6421 ImageNet 0.3200 0.2830 0.6020 0.7560 CLIP 0.3020 0.2780 0.5866 0.7303 ConVIRT 0.4770 0.4040 0.6983 0.7846 GLoRIA 0.5370 0.3590 0.7623 0.7981 MedCLIP 0.5960 0.5650 0.7890 0.8075| ACC | CheXpert <br> $-5 \times 200$ | MIMIC <br> $-5 \times 200$ | COVID | RSNA | | :---: | :---: | :---: | :---: | :---: | | | 0.2500 | 0.2220 | 0.5056 | 0.6421 | | ImageNet | 0.3200 | 0.2830 | 0.6020 | 0.7560 | | CLIP | 0.3020 | 0.2780 | 0.5866 | 0.7303 | | ConVIRT | 0.4770 | 0.4040 | 0.6983 | 0.7846 | | GLoRIA | 0.5370 | 0.3590 | 0.7623 | 0.7981 | | MedCLIP | $\mathbf{0 . 5 9 6 0}$ | $\mathbf{0 . 5 6 5 0}$ | $\mathbf{0 . 7 8 9 0}$ | $\mathbf{0 . 8 0 7 5}$ |
Table 3: The statistics of used datasets. Pos.%: positive sample ratio.
表 3:使用的数据集统计。Pos.%:正样本比例。
Pretrain 预训练 # Images # 图片 # Reports # 报告 # Classes # 课程
MIMIC-CXR 377,111 201,063 -
CheXpert 223,415 - 14
Evaluation 评估 # Train (Pos.%) # 训练 (正.%) # Test (Pos. %) # 测试 (正. %) # Classes # 课程
CheXpert-5x200 1 , 000 ( ) 1 , 000 ( ) 1,000(-)1,000(-) 1 , 000 ( ) 1 , 000 ( ) 1,000(-)1,000(-) 5
MIMIC-5x200 1 , 000 ( ) 1 , 000 ( ) 1,000(-)1,000(-) 1 , 000 ( ) 1 , 000 ( ) 1,000(-)1,000(-) 5
COVID 2 , 162 ( 19 % ) 2 , 162 ( 19 % ) 2,162(19%)2,162(19 \%) 3 , 000 ( 49 % ) 3 , 000 ( 49 % ) 3,000(49%)3,000(49 \%) 2
RSNA 8 , 486 ( 50 % ) 8 , 486 ( 50 % ) 8,486(50%)8,486(50 \%) 3 , 538 ( 50 % ) 3 , 538 ( 50 % ) 3,538(50%)3,538(50 \%) 2
Pretrain # Images # Reports # Classes MIMIC-CXR 377,111 201,063 - CheXpert 223,415 - 14 Evaluation # Train (Pos.%) # Test (Pos. %) # Classes CheXpert-5x200 1,000(-) 1,000(-) 5 MIMIC-5x200 1,000(-) 1,000(-) 5 COVID 2,162(19%) 3,000(49%) 2 RSNA 8,486(50%) 3,538(50%) 2| Pretrain | # Images | # Reports | # Classes | | :--- | :---: | :---: | :---: | | MIMIC-CXR | 377,111 | 201,063 | - | | CheXpert | 223,415 | - | 14 | | Evaluation | # Train (Pos.%) | # Test (Pos. %) | # Classes | | CheXpert-5x200 | $1,000(-)$ | $1,000(-)$ | 5 | | MIMIC-5x200 | $1,000(-)$ | $1,000(-)$ | 5 | | COVID | $2,162(19 \%)$ | $3,000(49 \%)$ | 2 | | RSNA | $8,486(50 \%)$ | $3,538(50 \%)$ | 2 |
weight as the backbone vision encoder. Both transformer-based models are drawn from the transformers library (Wolf et al., 2019). We also provide ablation study with ResNet-50 (He et al., 2015) as the vision encoder, which is in-line with previous works (Zhang et al., 2020; Huang et al., 2021).
以权重作为主干视觉编码器。两个基于变换器的模型均来自变换器库(Wolf 等,2019)。我们还提供了以 ResNet-50(He 等,2015)作为视觉编码器的消融研究,这与之前的工作一致(Zhang 等,2020;Huang 等,2021)。
MIMIC-CXR and CheXpert are used for pretraining where we held 5000 samples out out for evaluation. All images are padded to square then scaled to 224 × 224 224 × 224 224 xx224224 \times 224. For MIMIC-CXR, we combine the “Findings” and “Impression” sections of
MIMIC-CXR 和 CheXpert 用于预训练,我们保留了 5000 个样本用于评估。所有图像都被填充为正方形,然后缩放到 224 × 224 224 × 224 224 xx224224 \times 224 。对于 MIMIC-CXR,我们结合了“发现”和“印象”部分。

reports then split them into sentences. We remove sentences with less than 3 words. We take a linear projection head with output dimension 512 , a learnable temperature τ τ tau\tau initialized on 0.07 . We utilize image augmentations to first scale to raw images to 256 × 256 256 × 256 256 xx256256 \times 256 then apply random crop with size 224 × 224 224 × 224 224 xx224224 \times 224; horizontal flipping with 0.5 probability; color jittering with brightness and contrast ratios from [ 0.8 , 1.2 ] [ 0.8 , 1.2 ] [0.8,1.2][0.8,1.2]; random affine transformation with degree sampled from [ 10 , 10 ] [ 10 , 10 ] [-10,10][-10,10], max translation rate 0.0625 , and scale factor in [ 0.8 , 1.1 ] [ 0.8 , 1.1 ] [0.8,1.1][0.8,1.1]. Other hyperparameters are: learning rate 5 e 5 5 e 5 5e-55 \mathrm{e}-5, batch size 100 , weight decay 1 e 4 1 e 4 1e-41 \mathrm{e}-4, number of epochs 10 , learning rate warmup ratio 0.1 . We employ mixed-precision training such that the pretraining finishes in 8 hours on a single RTX-3090 GPU.
报告然后将其拆分为句子。我们删除少于 3 个单词的句子。我们采用一个线性投影头,输出维度为 512,学习温度 τ τ tau\tau 初始化为 0.07。我们利用图像增强,首先将原始图像缩放到 256 × 256 256 × 256 256 xx256256 \times 256 ,然后应用随机裁剪,大小为 224 × 224 224 × 224 224 xx224224 \times 224 ;以 0.5 的概率进行水平翻转;使用来自 [ 0.8 , 1.2 ] [ 0.8 , 1.2 ] [0.8,1.2][0.8,1.2] 的亮度和对比度比率进行颜色抖动;随机仿射变换,角度从 [ 10 , 10 ] [ 10 , 10 ] [-10,10][-10,10] 中抽样,最大平移率为 0.0625,缩放因子为 [ 0.8 , 1.1 ] [ 0.8 , 1.1 ] [0.8,1.1][0.8,1.1] 。其他超参数为:学习率 5 e 5 5 e 5 5e-55 \mathrm{e}-5 ,批量大小 100,权重衰减 1 e 4 1 e 4 1e-41 \mathrm{e}-4 ,训练轮数 10,学习率预热比例 0.1。我们采用混合精度训练,使得预训练在单个 RTX-3090 GPU 上完成,耗时 8 小时。

4.4 Q1. Zero-Shot Classification
4.4 Q1. 零样本分类

We conduct zero-shot image classification evaluation on four datasets: CheXpert-5x200, MIMIC 5 × 200 5 × 200 5xx2005 \times 200, COVID, and RSNA. The learned image-text encoders are used to support zero-shot prediction by matching the encoded image embeddings and the embeddings of created prompts for each disease class. We illustrate the results in Table 1.
我们在四个数据集上进行零样本图像分类评估:CheXpert-5x200、MIMIC 5 × 200 5 × 200 5xx2005 \times 200 、COVID 和 RSNA。学习到的图像-文本编码器用于通过匹配编码的图像嵌入和为每个疾病类别创建的提示的嵌入来支持零样本预测。我们在表 1 中展示了结果。
It can be found that our method outperforms all the other baselines by a great margin. MedCLIP is capable of benefiting from prompt ensemble to yield better performance. By contrast, the ensemble does not always lead to positive effect to the other two, especially that GLoRIA is usually harmed by ensemble. One reason might be that ConVIRT and GLoRIA cannot differentiate false negatives in contrastive pre-training, and those false negatives are
可以发现,我们的方法在所有其他基线中表现优异。MedCLIP 能够通过提示集成获得更好的性能。相比之下,集成并不总是对另外两个模型产生积极的效果,尤其是 GLoRIA 通常会受到集成的影响。一个原因可能是 ConVIRT 和 GLoRIA 无法在对比预训练中区分假阴性,而这些假阴性是

incoporated with prompt ensemble which confuses the model. Besides, we observe that the original CLIP model yields bad predictions that are basically identical to random guess on all datasets. It demonstrates the discrepancy between the general internet image-text and the ones in medical domain.
与混合提示的集成相结合,这使得模型感到困惑。此外,我们观察到原始的 CLIP 模型在所有数据集上产生的预测基本上与随机猜测相同。这表明了通用互联网图像-文本与医学领域图像-文本之间的差异。
Interestingly, MedCLIP yields over 0.8 ACC on COVID data while there is no COVID-19 positive image available during the course of pre-training. To endow the model to detect COVID-19 infection, we refer to the descriptions proposed in (Smith et al., 2020): the presence of patchy or confluent, bandlike ground-glass opacity or consolidation in a peripheral and mid to lower lung zone distribution, to build the prompts. This result demonstrates that contrastive pre-training of MedCLIP provides it with the transferability to out-of-domain classes.
有趣的是,MedCLIP 在 COVID 数据上获得了超过 0.8 的准确率,而在预训练过程中没有可用的 COVID-19 阳性图像。为了使模型能够检测 COVID-19 感染,我们参考了(Smith et al., 2020)中提出的描述:在外周和中下肺区分布中存在斑片状或融合的带状玻璃样浑浊或实变,以构建提示。这个结果表明,MedCLIP 的对比预训练使其具备了对域外类别的迁移能力。

4.5 Q2. Pre-training Data Efficiency
4.5 Q2. 预训练数据效率

Data efficiency is a key challenge in CLIP based methods. As evident, CLIP uses 400M image-text pairs in the training phase, which is not just computationally expensive but also infeasible in medical domain due to limited data. To evaluate the data efficiency of MedCLIP, we subsample the pretraining data to 20 K , 50 K 20 K , 50 K 20K,50K20 \mathrm{~K}, 50 \mathrm{~K}, and 200 K , then pre-train MedCLIP and record the yielded model zero-shot prediction on CheXpert-5x200 data. Results show in Fig. 1.
数据效率是基于 CLIP 方法的一个关键挑战。显然,CLIP 在训练阶段使用了 4 亿个图像-文本对,这不仅计算成本高,而且由于数据有限,在医学领域也是不可行的。为了评估 MedCLIP 的数据效率,我们对预训练数据进行了 20 K , 50 K 20 K , 50 K 20K,50K20 \mathrm{~K}, 50 \mathrm{~K} 和 20 万的子采样,然后对 MedCLIP 进行预训练,并记录在 CheXpert-5x200 数据上的零-shot 预测结果。结果如图 1 所示。
We surprisingly find that with 20 K data MedCLIP yields superior performance over GLoRIA that learns from the whole CheXpert dataset (around 200K image-text pairs). Likewise, MedCLIP beats ConVIRT that uses 369 K data. When we include more training data, MedCLIP obtains a lift on its accuracy as well. We do not observe the saturation of zero-shot ACC at 570K samples (MIMIC-CXR plus CheXpert). It signifies the great capacity of MedCLIP on learning from multi-sourced data.
我们惊讶地发现,使用 20K 数据的 MedCLIP 在性能上优于从整个 CheXpert 数据集(大约 200K 图像-文本对)学习的 GLoRIA。同样,MedCLIP 也超过了使用 369K 数据的 ConVIRT。当我们增加更多训练数据时,MedCLIP 的准确性也有所提升。我们没有观察到在 570K 样本(MIMIC-CXR 加 CheXpert)时零-shot ACC 的饱和。这表明 MedCLIP 在从多源数据中学习方面具有很大的能力。

4.6 Q3. Fine-tune for Classification
4.6 Q3. 微调以进行分类

We aim to evaluate the learned model transferability to downstream supervised tasks. We draw and froze the image encoder and fine-tune a randomly initialized linear classification head on the training data with cross-entropy loss. Results are in Table 2. We show that MedCLIP still achieves the best performances across all three methods. What is more, we surprisingly find that MedCLIP makes zero-shot prediction comparable with supervised learning models when contrasting Table 2 to Table 1.
我们旨在评估所学模型在下游监督任务中的可转移性。我们绘制并冻结了图像编码器,并在训练数据上微调了一个随机初始化的线性分类头,使用交叉熵损失。结果见表 2。我们展示了 MedCLIP 在所有三种方法中仍然取得了最佳性能。更令人惊讶的是,我们发现当将表 2 与表 1 进行对比时,MedCLIP 的零-shot 预测与监督学习模型相当。

Figure 4: Embeddings visualization of CheXpert5x200 images by CLIP and MedCLIP. Dimension reduced by t-SNE.
图 4:通过 CLIP 和 MedCLIP 对 CheXpert5x200 图像的嵌入可视化。维度通过 t-SNE 降维。
On all datasets, the zero-shot MedCLIP performs better than the other supervised models. Specifically, we find on COVID, zero-shot MedCLIP performs better than its supervised counterpart, which demonstrates the supremacy of MedCLIP on lowresource scenarios. On the contrary, without pretraining on medical domain data, the ResNet baselines reach inferior performances.
在所有数据集上,零样本 MedCLIP 的表现优于其他监督模型。具体来说,我们发现,在 COVID 数据集上,零样本 MedCLIP 的表现优于其监督对应模型,这证明了 MedCLIP 在低资源场景中的优势。相反,如果没有在医疗领域数据上进行预训练,ResNet 基线的表现则较差。

4.7 Q4. Image-Text Retrieval
4.7 Q4. 图像-文本检索

We choose CheXpert-5x200 to evaluate the semantic richness of the learned representations by all models through the image-text retrieval task. Since CheXpert-5x200 do not have report data publicly available, we used MIMIC-CXR dataset to come up with reports/sentences. We sampled 200 sentences for each of the 5 classes as present in CheXpert 5 × 200 5 × 200 5xx2005 \times 200 dataset. This gives rise to 1,000 images and 1,000 sentences as the retrieval dataset. We use Precision@ K K KK to calculate the precision in the top K K KK retrieved reports/sentences by checking if the report belongs to the same category as the query
我们选择 CheXpert-5x200 来通过图像-文本检索任务评估所有模型学习表示的语义丰富性。由于 CheXpert-5x200 没有公开的报告数据,我们使用 MIMIC-CXR 数据集来生成报告/句子。我们为 CheXpert 5 × 200 5 × 200 5xx2005 \times 200 数据集中存在的 5 个类别中的每个类别抽取了 200 个句子。这产生了 1,000 张图像和 1,000 个句子作为检索数据集。我们使用 Precision@ K K KK 来计算通过检查报告是否属于与查询相同类别的前 K K KK 个检索报告/句子的精确度。
Table 4: Results of Image-Text retrieval tasks on CheXpert5x200 dataset. We take the Precision@{1,2,5,10} to measure the performance of various models in this task. Best within the data are in bold.
表 4:在 CheXpert5x200 数据集上图像-文本检索任务的结果。我们采用 Precision@{1,2,5,10}来衡量各种模型在此任务中的表现。数据中的最佳结果用粗体表示。
Model 模型 P@1 P@2 P@5 P@10
CLIP 0.21 0.20 0.20 0.19
ConVIRT 0.20 0.20 0.20 0.21
GLoRIA 0 . 4 7 0 . 4 7 0.47\mathbf{0 . 4 7} 0.47 0.46 0.46
MedCLIP 0.45 0 . 4 9 0 . 4 9 0.49\mathbf{0 . 4 9} 0 . 4 8 0 . 4 8 0.48\mathbf{0 . 4 8} 0 . 5 0 0 . 5 0 0.50\mathbf{0 . 5 0}
Model P@1 P@2 P@5 P@10 CLIP 0.21 0.20 0.20 0.19 ConVIRT 0.20 0.20 0.20 0.21 GLoRIA 0.47 0.47 0.46 0.46 MedCLIP 0.45 0.49 0.48 0.50| Model | P@1 | P@2 | P@5 | P@10 | | :---: | :---: | :---: | :---: | :---: | | CLIP | 0.21 | 0.20 | 0.20 | 0.19 | | ConVIRT | 0.20 | 0.20 | 0.20 | 0.21 | | GLoRIA | $\mathbf{0 . 4 7}$ | 0.47 | 0.46 | 0.46 | | MedCLIP | 0.45 | $\mathbf{0 . 4 9}$ | $\mathbf{0 . 4 8}$ | $\mathbf{0 . 5 0}$ |
image. 图像。
We display the results by Table 4. It can be seen that MedCLIP achieves the best performances across all methods. This indicates that our method efficiently provide the required semantic information to retrieve texts. We find that there is an increase precision for MedCLIP with the higher K. Analysis of this phenomenon is present in the Appendix A.
我们通过表 4 展示结果。可以看出,MedCLIP 在所有方法中表现最佳。这表明我们的方法有效地提供了检索文本所需的语义信息。我们发现,随着 K 值的增加,MedCLIP 的精确度有所提高。对此现象的分析见附录 A。

4.8 Q5. Embedding Visualization
4.8 Q5. 嵌入可视化

We also demonstrate the effectiveness of our representation learning framework by plotting t-SNE (Van der Maaten and Hinton, 2008) of image embeddings produced for CheXpert-5x200 images. We compare its embeddings with CLIP model embeddings. As visible in Fig. 4, our model produces better clustered representation. Whereas, CLIP model t-SNE plot is homogeneous. It is because most medical X-Rays share pretty high overlapping while only small lesion regions are different. Nonetheless, MedCLIP still detects clusters by the lesion types.
我们还通过绘制为 CheXpert-5x200 图像生成的图像嵌入的 t-SNE(Van der Maaten 和 Hinton,2008)来展示我们的表示学习框架的有效性。我们将其嵌入与 CLIP 模型嵌入进行比较。如图 4 所示,我们的模型产生了更好的聚类表示。而 CLIP 模型的 t-SNE 图则是同质的。这是因为大多数医学 X 光片有相当高的重叠,而只有小的病变区域是不同的。尽管如此,MedCLIP 仍然通过病变类型检测到聚类。

5 Conclusion 5 结论

In this work, we propose a decoupled medical image-text contrastive learning framework named MedCLIP. It significantly expands the training data size with a combinatorial magnitude. Meanwhile, the introduction of medical knowledge sheds light on alleviating false negatives. As a result, MedCLIP yields an excellent pretraining data efficiency: it wins over the state-of-the-art baseline by 1 % 1 % 1%1 \% ACC with around 10 × 10 × 10 xx10 \times fewer data. Moreover, we verify the prominence of MedCLIP on zero-shot prediction, supervised classification, and image-text retrieval tasks. It is expected to support a foundational model for the medical domain and handle medical diagnosis when facing diverse diseases with low resource requirements.
在这项工作中,我们提出了一种名为 MedCLIP 的解耦医学图像-文本对比学习框架。它以组合的规模显著扩展了训练数据的大小。同时,医学知识的引入有助于减轻假阴性。因此,MedCLIP 在预训练数据效率上表现出色:它在大约 10 × 10 × 10 xx10 \times 更少的数据下,超越了最先进的基线 1 % 1 % 1%1 \% ACC。此外,我们验证了 MedCLIP 在零样本预测、监督分类和图像-文本检索任务中的突出表现。预计它将支持医学领域的基础模型,并在面对多种疾病时以低资源需求处理医学诊断。

Acknowledgement 确认

This work was supported by NSF award SCH2205289, SCH-2014438, IIS-1838042, NIH award R01 1R01NS107291-01.
本工作得到了 NSF 奖项 SCH2205289、SCH-2014438、IIS-1838042 和 NIH 奖项 R01 1R01NS107291-01 的支持。

Limitations 限制

This work leverages medical domain knowledge to decouple contrastive learning on medical images and texts. Hence, it significantly expands the available training data for pretraining. Meanwhile, the proposed knowledge-guided semantic matching loss debugs the false negatives appearing in naive contrastive learning. It still encounters failure cases where incorrect semantic tags are detected or missing detection of negation or uncertainty phrases. A remedy can be introducing learning from noisy data techniques (Wang et al., 2020, 2022) to alleviate the noises in the extracted semantic similarity matrix.
这项工作利用医学领域知识对医学图像和文本的对比学习进行解耦。因此,它显著扩展了可用于预训练的训练数据。同时,提出的知识引导语义匹配损失调试了在简单对比学习中出现的假阴性。它仍然遇到失败案例,其中检测到不正确的语义标签或缺少对否定或不确定短语的检测。一种补救措施可以是引入从噪声数据中学习的技术(Wang et al., 2020, 2022),以减轻提取的语义相似性矩阵中的噪声。
Another concern is that though we prove MedCLIP is able to reach comparable zero-shot prediction accuracy to the finetuned counterpart, it is still not amenable to practical use. We suppose the reasons include (1) the prompt-based inference relies on the prompt quality and (2) more pretraining data is desired to further enhance the pretraining. Specifically, for the (1) point, it is promising to leverage prompt-learning methods (Zhou et al., 2022) to automate the model application to downstream tasks instead of executing manual prompt engineering.
另一个问题是,尽管我们证明了 MedCLIP 能够达到与微调模型相当的零-shot 预测准确性,但它仍然不适合实际使用。我们认为原因包括(1)基于提示的推理依赖于提示质量,以及(2)需要更多的预训练数据来进一步增强预训练。具体来说,对于(1)点,利用提示学习方法(Zhou et al., 2022)来自动化模型在下游任务中的应用,而不是执行手动提示工程,是很有前景的。

References 参考文献

Alan R Aronson and François-Michel Lang. 2010. An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229-236.
艾伦·R·阿伦森和弗朗索瓦-米歇尔·朗。2010 年。《MetaMap 概述:历史视角和近期进展》。美国医学信息学协会杂志,17(3):229-236。
Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. 2019. A theoretical analysis of contrastive unsupervised representation learning. In 36th International Conference on Machine Learning, pages 9904-9923. International Machine Learning Society (IMLS).
Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, 和 Nikunj Saunshi. 2019. 对比无监督表示学习的理论分析. 在第 36 届国际机器学习会议上, 页码 9904-9923. 国际机器学习学会 (IMLS).
O Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Research, 32(Database issue):D26770.
O Bodenreider. 2004. 统一医学语言系统(umls):整合生物医学术语。核酸研究,32(数据库期刊):D26770。
Erdi Çallı, Ecem Sogancioglu, Bram van Ginneken, Kicky G van Leeuwen, and Keelin Murphy. 2021. Deep learning for chest x-ray analysis: A survey. Medical Image Analysis, 72:102125.
Erdi Çallı, Ecem Sogancioglu, Bram van Ginneken, Kicky G van Leeuwen, 和 Keelin Murphy. 2021. 深度学习在胸部 X 光分析中的应用:一项调查。医学图像分析, 72:102125.
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15789-15798.
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, 和 Changhu Wang. 2021. 学习视觉语义嵌入的最佳池化策略. 在 IEEE/CVF 计算机视觉与模式识别会议, 页码 15789-15798.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, 和 Li Fei-Fei. 2009. Imagenet: 一个大规模分层图像数据库. 在 2009 年 IEEE 计算机视觉与模式识别会议,页码 248-255.
Karan Desai and Justin Johnson. 2021. Virtex: Learning visual representations from textual annotations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11162-11173.
Karan Desai 和 Justin Johnson. 2021. Virtex: 从文本注释中学习视觉表示. 在 IEEE/CVF 计算机视觉与模式识别会议, 页码 11162-11173.
FDA. 2022. Medical imaging.
FDA. 2022. 医学影像。

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, 和 Jian Sun. 2015. 深度残差学习用于图像识别。
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, 和 Jian Sun. 2016. 深度残差学习用于图像识别. 在 IEEE 计算机视觉与模式识别会议, 页码 770-778.
Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. 2021. GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition. In IEEE/CVF International Conference on Computer Vision, pages 3942-3951.
Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, 和 Serena Yeung. 2021. GLoRIA: 一种用于标签高效医学图像识别的多模态全局-局部表示学习框架. 在 IEEE/CVF 国际计算机视觉会议, 页码 3942-3951.
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 590-597.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904-4916. PMLR.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, 和 Tom Duerig. 2021. 使用噪声文本监督扩大视觉和视觉语言表示学习. 在国际机器学习会议, 页码 4904-4916. PMLR.
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chihying Deng, Roger G Mark, and Steven Horng. 2019. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):1-8.
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chihying Deng, Roger G Mark, 和 Steven Horng. 2019. Mimic-cxr,一个去标识化的公开可用胸部放射影像数据库,附有自由文本报告。科学数据, 6(1):1-8.
Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. 2016. Learning visual features from large weakly supervised data. In European Conference on Computer Vision, pages 67-84. Springer.
Armand Joulin, Laurens van der Maaten, Allan Jabri, 和 Nicolas Vasilache. 2016. 从大量弱监督数据中学习视觉特征. 在欧洲计算机视觉会议, 页码 67-84. 施普林格.
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661-18673.
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, 和 Dilip Krishnan. 2020. 监督对比学习. 神经信息处理系统进展, 33:18661-18673.
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583-5594. PMLR.
Ang Li, Allan Jabri, Armand Joulin, and Laurens Van Der Maten. 2017. Learning visual n-grams from web data. In IEEE International Conference on Computer Vision, pages 4183-4192.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086.
Junnan Li, Dongxu Li, Caiming Xiong, 和 Steven Hoi. 2022. Blip: 引导语言-图像预训练以实现统一的视觉-语言理解和生成. arXiv 预印本 arXiv:2201.12086.
Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. 2021. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208.
杨光李,冯亮,李晨赵,余峰崔,万里欧阳,景韶,冯伟余,和俊杰严。2021 年。监督无处不在:一种数据高效的对比语言-图像预训练范式。arXiv 预印本 arXiv:2110.05208。
Fenglin Liu, Yuanxin Liu, Xuancheng Ren, Xiaodong He , and Xu Sun. 2019. Aligning visual regions and textual concepts for semantic-grounded image representations. Advances in Neural Information Processing Systems, 32.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, 和 Baining Guo. 2021. Swin transformer: 使用移动窗口的分层视觉变换器。
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32.
Jiasen Lu, Dhruv Batra, Devi Parikh, 和 Stefan Lee. 2019. Vilbert: 预训练任务无关的视觉语言表示用于视觉和语言任务. 神经信息处理系统进展, 32.
Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, and Zhiyong Lu. 2018. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings, 2018:188.
Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, 和 Zhiyong Lu. 2018. NegBio: 一种高性能的工具,用于放射学报告中的否定和不确定性检测。AMIA 转化科学峰会会议录,2018:188。
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748-8763. PMLR.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 等. 2021. 从自然语言监督中学习可转移的视觉模型. 在国际机器学习会议, 页码 8748-8763. PMLR.
Tawsifur Rahman, Amith Khandakar, Yazan Qiblawey, Anas Tahir, Serkan Kiranyaz, Saad Bin Abul Kashem, Mohammad Tariqul Islam, Somaya Al Maadeed, Susu M Zughaier, Muhammad Salman Khan, et al. 2021. Exploring the effect
Tawsifur Rahman, Amith Khandakar, Yazan Qiblawey, Anas Tahir, Serkan Kiranyaz, Saad Bin Abul Kashem, Mohammad Tariqul Islam, Somaya Al Maadeed, Susu M Zughaier, Muhammad Salman Khan, 等. 2021. 探索影响

of image enhancement techniques on covid-19 detection using chest x-ray images. Computers in biology and medicine, 132:104319.
使用胸部 X 光图像进行 COVID-19 检测的图像增强技术。生物医学中的计算机,132:104319。
Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning visual representations with caption annotations. In European Conference on Computer Vision, pages 153-170. Springer.
Mert Bulent Sariyildiz, Julien Perez, 和 Diane Larlus. 2020. 使用标题注释学习视觉表示. 在欧洲计算机视觉会议, 页 153-170. 施普林格.
Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, et al. 2022. Klite: Learning transferable visual models with external knowledge. arXiv preprint arXiv:2204.09222.
Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, 等. 2022. Klite: Learning transferable visual models with external knowledge. arXiv preprint arXiv:2204.09222.
George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. 2019. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1):e180041.
乔治·施,卡罗尔·C·吴,萨夫万·S·哈拉比,马克·D·科利,卢西亚诺·M·普雷维德洛,特莎·S·库克,阿尔君·夏尔马,朱迪思·K·阿莫罗萨,维罗尼卡·阿尔特亚加,玛雅·加尔佩林-艾岑贝格,等。2019 年。通过专家注释可能的肺炎来增强国家卫生研究院胸部 X 光图像数据集。《放射学:人工智能》,1(1):e180041。
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2021. FLAVA: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482.
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, 和 Douwe Kiela. 2021. FLAVA: 一种基础语言和视觉对齐模型. arXiv 预印本 arXiv:2112.04482.
David L Smith, John-Paul Grenier, Catherine Batte, and Bradley Spieler. 2020. A characteristic chest radiographic pattern in the setting of the covid-19 pandemic. Radiology: Cardiothoracic Imaging, 2(5):e200280.
Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv e-prints, pages arXiv-1807.
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9(11).
Laurens Van der Maaten 和 Geoffrey Hinton. 2008. 使用 t-sne 可视化数据. 机器学习研究杂志, 9(11).
Haobo Wang, Ruixuan Xiao, Yixuan Li, Lei Feng, Gang Niu, Gang Chen, and Junbo Zhao. 2022. Pico: Contrastive label disambiguation for partial label learning. In International Conference on Learning Representations.
Haobo Wang, Ruixuan Xiao, Yixuan Li, Lei Feng, Gang Niu, Gang Chen, and Junbo Zhao. 2022. Pico: 对于部分标签学习的对比标签消歧。国际学习表征会议。
Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. 2021a. VLMo: Unified vision-language pretraining with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358.
Wenhui Wang, Hangbo Bao, Li Dong, 和 Furu Wei. 2021a. VLMo: 统一的视觉-语言预训练与混合模态专家. arXiv 预印本 arXiv:2111.02358.
Xiaosong Wang, Ziyue Xu, Leo Tam, Dong Yang, and Daguang Xu. 2021b. Self-supervised image-text pre-training with mixed data in chest x-rays. arXiv preprint arXiv:2103.16022.
Xiaosong Wang, Ziyue Xu, Leo Tam, Dong Yang, 和 Daguang Xu. 2021b. 在胸部 X 光片中使用混合数据进行自监督图像-文本预训练. arXiv 预印本 arXiv:2103.16022.
Zifeng Wang and Jimeng Sun. 2022. Transtab: Learning transferable tabular transformers across tables. arXiv preprint arXiv:2205.09328.
Zifeng Wang 和 Jimeng Sun. 2022. Transtab: 学习可转移的表格变换器. arXiv 预印本 arXiv:2205.09328.
Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, and Shao-Lun Huang. 2020. Less is better: Unweighted data subsampling via influence function.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6340-6347.
在人工智能 AAAI 会议论文集中,第 34 卷,6340-6347 页。
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface 的变压器:最先进的自然语言处理。arXiv 预印本 arXiv:1910.03771。
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6609-6618.
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, 和 Wei-Ying Ma. 2019. 统一的视觉-语义嵌入:通过结构化意义表示桥接视觉和语言。在 IEEE/CVF 计算机视觉与模式识别会议,页 6609-6618。
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733-3742.
Zhirong Wu, Yuanjun Xiong, Stella X Yu, 和 Dahua Lin. 2018. 通过非参数实例区分进行无监督特征学习. 在《IEEE 计算机视觉与模式识别会议论文集》,第 3733-3742 页.
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, 和 Yonghui Wu. 2022. Coca: 对比式标题生成器是图像-文本基础模型. arXiv 预印本 arXiv:2205.01917.
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. 2021. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, 等. 2021. Florence: 一种新的计算机视觉基础模型. arXiv 预印本 arXiv:2111.11432.
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747.
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, 和 Curtis P Langlotz. 2020. 从配对图像和文本中对比学习医学视觉表示. arXiv 预印本 arXiv:2010.00747.
Mingkai Zheng, Fei Wang, Shan You, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. 2021. Weakly supervised contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10042-10051.
Mingkai Zheng, Fei Wang, Shan You, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. 2021. 弱监督对比学习. 在 IEEE/CVF 国际计算机视觉会议论文集中,页码 10042-10051。
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816-16825.
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. 条件提示学习用于视觉-语言模型. 在 IEEE/CVF 计算机视觉与模式识别会议论文集中,页面 16816-16825。
Table 5: 14 main finding types used in this paper.
表 5:本文中使用的 14 种主要发现类型。
Finding types 寻找类型
No Finding 没有发现
Enlarged Cardiomediastinum
扩大心纵隔
Cardiomegaly 心脏肥大
Lung Opacity 肺部不透明度
Lung Lesion 肺病变
Edema 水肿
Consolidation 合并
Pneumonia 肺炎
Atelectasis 肺不张
Pneumothorax 气胸
Pleural Effusion 胸腔积液
Pleural Other 胸膜其他
Fracture 骨折
Support Devices 支持设备
Finding types No Finding Enlarged Cardiomediastinum Cardiomegaly Lung Opacity Lung Lesion Edema Consolidation Pneumonia Atelectasis Pneumothorax Pleural Effusion Pleural Other Fracture Support Devices| Finding types | | :---: | | No Finding | | Enlarged Cardiomediastinum | | Cardiomegaly | | Lung Opacity | | Lung Lesion | | Edema | | Consolidation | | Pneumonia | | Atelectasis | | Pneumothorax | | Pleural Effusion | | Pleural Other | | Fracture | | Support Devices |

A Analysis of Image-text retrieval results
图像-文本检索结果分析

We observed that there is certain increase in MedCLIP Precison@K metric in Table 4. This phenomenon is sort of counterintuitive. As mentioned in Section 4.7, we used CheXpert-5x200 dataset for images and MIMIC-CXR for text in our imagetext retrieval task. Each of them, i.e. images and text reports/sentences, has 1000 rows. Particularly, 200 images and sentences/reports are there for each class in CheXpert-5x200.
我们观察到表 4 中的 MedCLIP Precision@K 指标有一定的增加。这个现象有点违反直觉。如第 4.7 节所提到的,我们在图像文本检索任务中使用了 CheXpert-5x200 数据集用于图像,MIMIC-CXR 用于文本。每个数据集,即图像和文本报告/句子,都有 1000 行。特别是,CheXpert-5x200 中每个类别有 200 张图像和句子/报告。
After running the main experiment we decided to plot some graphs to gain insight. Settings for the same is described as following: We pick up a class (e.g. Atelectasis) and for each image in that class we pick up sentences/reports, from top 10 (based on cosine distance) out of all the texts retrieved, that belong to the same class as that of the image in consideration. We also pick up their cosine similarity score. Finally, we plot a histogram (for each class) where x -axis is the cosine distance and the height of the bins of histogram represent the number of texts(retrieved in previous step) that have cosine similarity score of the bin. We have plotted few such plots and can be seen in Figure 5.
在进行主要实验后,我们决定绘制一些图表以获得洞察。相关设置如下:我们选择一个类别(例如,肺不张),对于该类别中的每个图像,我们从所有检索到的文本中选择前 10 个(基于余弦距离),这些文本属于与所考虑图像相同的类别。我们还选择它们的余弦相似度分数。最后,我们绘制一个直方图(针对每个类别),其中 x 轴是余弦距离,直方图的高度表示在前一步中检索到的具有该箱余弦相似度分数的文本数量。我们绘制了几个这样的图表,可以在图 5 中看到。
Intuitively, we are plotting number of texts that would appear in Precision @ K K KK of a particular class given a cosine similarity threshold. Now as the histogram tells, there are more text centered around relatively smaller cosine similarity score. Further, as the K K KK increases in the Precision@ K K KK, intuitively cut-off cosine similarity(or threshold) score would decrease and hence more text from the same class
直观上,我们正在绘制在特定类别下,给定余弦相似度阈值时,出现在精度@ K K KK 中的文本数量。现在,正如直方图所示,围绕相对较小的余弦相似度得分的文本更多。此外,随着 K K KK 在精度@ K K KK 中的增加,直观上,截止余弦相似度(或阈值)得分将减少,因此来自同一类别的文本会更多。

start showing up in the result. This explains why Precision@ K K KK increased. In this sense, it is beneficial to investigate to address the non-smooth anisotropic distribution of image and text embeddings.
开始出现在结果中。这解释了为什么 Precision@ K K KK 增加。从这个意义上说,研究以解决图像和文本嵌入的非平滑各向异性分布是有益的。


(a) Histogram Plot for Atelectasis class
(a) 肺不张类别的直方图绘图


(b) Histogram Plot for Cardiomegaly class
(b) 心脏肥大类别的直方图绘图


© Histogram Plot for Consolidation class
© 合并类的直方图绘图
Figure 5: Visualization of the similarity distributions computed based on MedCLIP embeddings.
图 5:基于 MedCLIP 嵌入计算的相似性分布可视化。

  1. 1 1 ^(1){ }^{1} Our code is available at https://github.com/ RyanWangZf/MedCLIP.
    1 1 ^(1){ }^{1} 我们的代码可在 https://github.com/ RyanWangZf/MedCLIP 获取。