这是用户在 2024-11-28 21:44 为 https://app.immersivetranslate.com/pdf-pro/3afeb050-2144-426c-bed3-f42f5ba326b6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
MedCLIP: 从未配对的医学图像和文本中进行对比学习

Zifeng Wang 1 1 ^(1){ }^{1}, Zhenbang Wu 1 1 ^(1){ }^{1}, Dinesh Agarwal 1 , 3 1 , 3 ^(1,3){ }^{1,3}, Jimeng Sun 1 , 2 1 , 2 ^(1,2){ }^{1,2} 1 1 ^(1){ }^{1} Department of Computer Science, University of Illinois Urbana-Champaign
伊利诺伊大学厄本那-香槟分校计算机科学系
2 2 ^(2){ }^{2} Carle Illinois College of Medicine, University of Illinois Urbana-Champaign
卡尔伊利诺伊大学医学院,伊利诺伊大学厄本那-香槟分校
3 3 ^(3){ }^{3} Adobe{zifengw2, zw12, jimeng}@illinois.edu, diagarwa@adobe.com

Abstract 摘要

Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zeroshot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20 K pretraining data, MedCLIP wins over the state-of-the-art method (using 200 K 200 K ~~200K\approx 200 \mathrm{~K} data) 1 1 ^(1){ }^{1}.
现有的视觉-文本对比学习,如 CLIP(Radford 等,2021),旨在匹配配对的图像和标题嵌入,同时将其他嵌入推开,从而提高表示的可转移性并支持零样本预测。然而,医学图像-文本数据集的数量级远低于来自互联网的一般图像和标题。此外,以前的方法遇到了许多假阴性,即来自不同患者的图像和报告可能具有相同的语义,但被错误地视为负样本。在本文中,我们为多模态对比学习解耦图像和文本,从而以低成本以组合的数量级扩展可用的训练数据。我们还建议用基于医学知识的语义匹配损失替代 InfoNCE 损失,以消除对比学习中的假阴性。我们证明了 MedCLIP 是一个简单而有效的框架:它在零样本预测、监督分类和图像-文本检索方面超越了最先进的方法。令人惊讶的是,我们观察到仅用 20K 的预训练数据,MedCLIP 就超过了最先进的方法(使用 200 K 200 K ~~200K\approx 200 \mathrm{~K} 数据) 1 1 ^(1){ }^{1}

1 Introduction 1 引言

Medical images such as X-rays, CTs, and MRIs are commonly used to diagnose, monitor, or treat medical conditions in clinical practice (FDA, 2022). With the rapid growth of medical images and the corresponding reports data, researchers have developed various deep learning models to support clinical decision making (Çallı et al., 2021).
医学图像,如 X 光、CT 和 MRI,通常用于临床实践中诊断、监测或治疗医疗状况(FDA,2022)。随着医学图像及相应报告数据的快速增长,研究人员开发了各种深度学习模型以支持临床决策(Çallı等,2021)。
Recently, large-scale image-text pre-training, e.g., CLIP (Radford et al., 2021), has achieved considerable successes in computer vision and natural language processing domains. CLIP is trained to predict the correct matching of a batch of images
最近,大规模的图像-文本预训练,例如 CLIP(Radford 等,2021),在计算机视觉和自然语言处理领域取得了相当大的成功。CLIP 被训练用于预测一批图像的正确匹配。
Figure 1: Zero-shot performance of MedCLIP, ConVIRT (Zhang et al., 2020), GLoRIA (Huang et al., 2021) when using different amounts of data for pretraining. ConVIRT and GLoRIA are trained on MIMIC-CXR (369K) and CheXpert (191K) dataset, respectively. Our method yields superior ACC than GLoRIA using near 1 / 10 1 / 10 1//101 / 10 of pre-training data.
图 1:在使用不同数量的数据进行预训练时,MedCLIP、ConVIRT(Zhang et al., 2020)、GLoRIA(Huang et al., 2021)的零样本性能。ConVIRT 和 GLoRIA 分别在 MIMIC-CXR(369K)和 CheXpert(191K)数据集上进行训练。我们的方法在使用接近 1 / 10 1 / 10 1//101 / 10 的预训练数据时,获得了比 GLoRIA 更优的 ACC。

and text training examples. The joint-training of image and text representations on large-scale imagetext pairs generates transferable representations and supports flexible downstream tasks. Inspired by success of CLIP, we believe the knowledge jointly learned from medical images and reports should be helpful for downstream clinical tasks.
和文本训练示例。对大规模图像文本对进行图像和文本表示的联合训练生成可转移的表示,并支持灵活的下游任务。受到 CLIP 成功的启发,我们相信从医学图像和报告中共同学习的知识应该对下游临床任务有所帮助。
However, adopting vision-text pre-training on medical domain is a non-trivial task due to (1) CLIP’s (Radford et al., 2021) data-hungry nature: CLIP is trained on a dataset of 400 M image-text pairs collected from the internet, while the total number of publicly available medical images and reports is orders of magnitude below; and (2) specificity of medical images and reports: compared to general domains (e.g., “cats” v.s. “dog”), the differences within medical domains are more subtle and fine-grained (e.g., “pneumonia” v.s. “consolidation”). In a nutshell, it is necessary to (1) address the data insufficiency issue; and (2) capture the subtle yet crucial medical meanings.
然而,在医学领域采用视觉-文本预训练是一项非平凡的任务,原因有二:(1)CLIP(Radford 等,2021)的数据需求特性:CLIP 是在一个包含 4 亿对从互联网收集的图像-文本对的数据集上训练的,而公开可用的医学图像和报告的总数则低得多;(2)医学图像和报告的特异性:与一般领域(例如,“猫”与“狗”)相比,医学领域内的差异更加微妙和细致(例如,“肺炎”与“实变”)。总之,有必要(1)解决数据不足的问题;(2)捕捉微妙但至关重要的医学含义。

Figure 2: Demonstration of challenges in medical image-text contrastive learning. (1) Pre-training data only includes paired images and texts. However, many more image-only and text-only datasets are ignored. (2) False negatives appear. For an anchor image, previous methods treat paired texts (i.e., reports from the same patient’s study) as positives and unpaired texts (i.e., reports from other patients’ studies) as negatives. However, the negative texts can describe the same symptoms as the anchor texts.
图 2:医学图像-文本对比学习中的挑战示范。(1) 预训练数据仅包括配对的图像和文本。然而,许多图像单独和文本单独的数据集被忽视。(2) 出现假阴性。对于一个锚图像,以前的方法将配对文本(即来自同一患者研究的报告)视为正样本,将未配对文本(即来自其他患者研究的报告)视为负样本。然而,负文本可以描述与锚文本相同的症状。
Existing works try to tackle the challenges above in different ways. ConVIRT (Zhang et al., 2020) jointly trains the vision and text encoders with the paired medical images and reports via a bidirectional contrastive objective; GLoRIA (Huang et al., 2021) further models both the global and local interactions between medical images and reports to capture the pathology meanings from specific image regions. However, both works have significant limitations, as illustrated in Fig. 2.
现有的研究尝试以不同的方式解决上述挑战。ConVIRT(Zhang et al., 2020)通过双向对比目标联合训练视觉和文本编码器,使用配对的医学图像和报告;GLoRIA(Huang et al., 2021)进一步建模医学图像和报告之间的全局和局部交互,以捕捉特定图像区域的病理意义。然而,如图 2 所示,这两项工作都有显著的局限性。
  • Limited usable data. Most medical image datasets only provide the diagnostic labels instead of the raw reports. However, both works need paired image and reports, leaving a vast number of medical image-only and text-only datasets unused.
    可用数据有限。大多数医学图像数据集仅提供诊断标签,而不是原始报告。然而,这两项工作都需要配对的图像和报告,导致大量仅包含医学图像和仅包含文本的数据集未被使用。
  • False negatives in contrastive learning. Both methods try to push images and texts embeddings from different patients apart. However, even though some reports do not belong to the target patient’s study, they can still describe the same symptoms and findings. Simply treating the other reports as negative samples brings noise to the supervision and confuses the model.
    对比学习中的假阴性。这两种方法都试图将来自不同患者的图像和文本嵌入推开。然而,即使某些报告不属于目标患者的研究,它们仍然可以描述相同的症状和发现。简单地将其他报告视为负样本会给监督带来噪声,并使模型感到困惑。
To handle the above challenges, we propose a simple yet effective approach, namely MedCLIP. It has the following contributions:
为了应对上述挑战,我们提出了一种简单而有效的方法,即 MedCLIP。它具有以下贡献:
  • Decoupling images and texts for contrastive learning. We extend the pre-training to cover the massive unpaired images and texts datasets, which scales the number of training data in a combinatorial manner. It opens a new direction to expand multi-modal learning based on medical knowledge instead of expensively scaling up data.
    解耦图像和文本以进行对比学习。我们扩展了预训练,以涵盖大量未配对的图像和文本数据集,这以组合方式扩大了训练数据的数量。这为基于医学知识的多模态学习开辟了一个新方向,而不是昂贵地扩大数据规模。
  • Eliminating false negatives via medical knowledge. We observe that images and reports from separate patients’ studies may carry the same semantics but are falsely treated as negatives by previous methods. Hence, we design a soft semantic matching loss that uses the medical semantic similarity between each image and report as the supervision signal. This approach equips the model with the ability to capture the subtle yet crucial medical meanings.
    通过医学知识消除假阴性。我们观察到来自不同患者研究的图像和报告可能具有相同的语义,但被以前的方法错误地视为阴性。因此,我们设计了一种软语义匹配损失,利用每个图像和报告之间的医学语义相似性作为监督信号。这种方法使模型具备捕捉细微但至关重要的医学含义的能力。
We make comprehensive evaluation on MedCLIP across four public datasets. Results show that MedCLIP reaches extremely high data efficiency, as shown in Fig. 1. Our method obtains better performances than the state-of-the-art GLoRIA (Huang et al., 2021) using only 10 % 10 % 10%10 \% pre-training data. Extensive experiments verify MedCLIP’s transferability to various downstream tasks. It wins over baselines by a large margin: over 10 % 10 % 10%10 \% improvement of
我们对 MedCLIP 在四个公共数据集上进行了全面评估。结果表明,MedCLIP 达到了极高的数据效率,如图 1 所示。我们的方法在仅使用 10 % 10 % 10%10 \% 预训练数据的情况下,获得了比最先进的 GLoRIA(Huang 等,2021)更好的性能。大量实验验证了 MedCLIP 在各种下游任务中的可迁移性。它以较大幅度超越了基线:超过 10 % 10 % 10%10 \% 的提升。

prediction ACC for zero-shot prediction and supervised image classification tasks on average; over 2 % 2 % 2%2 \% improvement of retrieval precision. Details are in § 4 § 4 §4\S 4§.
零样本预测和监督图像分类任务的平均预测准确率;检索精度提高了 2 % 2 % 2%2 \% 。详细信息见 § 4 § 4 §4\S 4§
Vision-text representation learning was shown to learn good visual representations (Joulin et al., 2016; Li et al., 2017; Sariyildiz et al., 2020; Desai and Johnson, 2021; Kim et al., 2021; Wang et al., 2021a). But all of them work on paired image and captions from general domain, e.g., Flickr (Joulin et al., 2016) and COCO Captions (Desai and Johnson, 2021). Likewise, these methods do not support cross-modal retrieval hence do not support zero-shot predictions either.
视觉-文本表示学习被证明能够学习良好的视觉表示(Joulin et al., 2016; Li et al., 2017; Sariyildiz et al., 2020; Desai and Johnson, 2021; Kim et al., 2021; Wang et al., 2021a)。但它们都基于来自一般领域的配对图像和标题,例如 Flickr(Joulin et al., 2016)和 COCO Captions(Desai and Johnson, 2021)。同样,这些方法不支持跨模态检索,因此也不支持零样本预测。
Many propose to learn visual-semantic embedding for vision-text retrieval (Liu et al., 2019; Wu et al., 2019; Lu et al., 2019; Huang et al., 2020; Chen et al., 2021) by attention or objection detection models; and by vision-text contrastive learning (Zhang et al., 2020; Jia et al., 2021; Yuan et al., 2021; Yu et al., 2022) or multiple vision and text supervision (Singh et al., 2021; Li et al., 2022). They all work on general domain where near infinite web images and captions are available, which dwarfs the scale of medical image-text data. This challenge hurdles the execution of self-supervised CL for large vision-text transformers. Though remedies like data augmentation (Li et al., 2021) and knowledge graph (Shen et al., 2022) were proposed, the magnitude of used data is still far larger than medical data.
许多人提议通过注意力或物体检测模型学习视觉-语义嵌入以进行视觉-文本检索(Liu et al., 2019; Wu et al., 2019; Lu et al., 2019; Huang et al., 2020; Chen et al., 2021);以及通过视觉-文本对比学习(Zhang et al., 2020; Jia et al., 2021; Yuan et al., 2021; Yu et al., 2022)或多种视觉和文本监督(Singh et al., 2021; Li et al., 2022)。他们都在一般领域工作,在那里几乎有无限的网络图像和标题可用,这使得医学图像-文本数据的规模显得微不足道。这个挑战阻碍了大型视觉-文本变换器的自监督 CL 的执行。尽管提出了数据增强(Li et al., 2021)和知识图谱(Shen et al., 2022)等补救措施,但使用的数据量仍然远大于医学数据。
Medical image-text representation learning was investigated based on contrastive learning as well (Zhang et al., 2020; Huang et al., 2021; Wang et al., 2021b). Nonetheless, they all work on paired medical images and texts so still encounter the lacking data challenge. Moreover, they all suffer from the false negative noises when adopting noise contrastive estimation (NCE) (Van den Oord et al., 2018) to perform instance discrimination (Wu et al., 2018), which undermines the representation quality (Arora et al., 2019; Zheng et al., 2021). Our work bridges the gap by making the full use of all available medical data to support medical image-text pre-training. And we harness medical knowledge tailored to eliminate false negatives in contrastive learning to improve the pre-training data efficiency.
医学图像-文本表示学习也基于对比学习进行了研究(Zhang et al., 2020; Huang et al., 2021; Wang et al., 2021b)。然而,他们都在配对的医学图像和文本上工作,因此仍然面临数据不足的挑战。此外,他们在采用噪声对比估计(NCE)(Van den Oord et al., 2018)进行实例区分时都遭受了假阴性噪声,这削弱了表示质量(Arora et al., 2019; Zheng et al., 2021)。我们的工作通过充分利用所有可用的医学数据来支持医学图像-文本的预训练,弥补了这一差距。我们利用医学知识,旨在消除对比学习中的假阴性,以提高预训练数据的效率。

3 Method 3 方法

In this section, we present the technical details of MedCLIP following the flow in Fig. 3. MedCLIP consists of components (1) knowledge extraction that builds the semantic similarity matrix, (2) vision and text encoders that extracts embeddings, and (3) semantic matching loss that trains the whole model.
在本节中,我们按照图 3 中的流程介绍 MedCLIP 的技术细节。MedCLIP 由以下组件组成:(1) 知识提取,构建语义相似度矩阵;(2) 视觉和文本编码器,提取嵌入;(3) 语义匹配损失,训练整个模型。

3.1 Vision and Text Encoder
3.1 视觉和文本编码器

MedCLIP consists of one visual encoder and one text encoder.
MedCLIP 由一个视觉编码器和一个文本编码器组成。
Vision Encoder. We encode images into embeddings v R D v R D vinR^(D)\mathbf{v} \in \mathbb{R}^{D} using a vision encoder E img E img  E_("img ")E_{\text {img }}. A projection head then maps raw embeddings to v p R P v p R P v_(p)inR^(P)\mathbf{v}_{p} \in \mathbb{R}^{P}.
视觉编码器。我们使用视觉编码器将图像编码为嵌入 v R D v R D vinR^(D)\mathbf{v} \in \mathbb{R}^{D} 。然后,投影头将原始嵌入映射到 v p R P v p R P v_(p)inR^(P)\mathbf{v}_{p} \in \mathbb{R}^{P}
v = E i m g ( x i m g ) v p = f v ( v ) v = E i m g x i m g v p = f v ( v ) {:[v=E_(img)(x_(img))],[quadv_(p)=f_(v)(v)]:}\begin{array}{r} \mathbf{v}=E_{i m g}\left(\mathbf{x}_{i m g}\right) \\ \quad \mathbf{v}_{p}=f_{v}(\mathbf{v}) \end{array}
where f v f v f_(v)f_{v} is the projection head of the vision encoder.
其中 f v f v f_(v)f_{v} 是视觉编码器的投影头。
Text Encoder. We create clinically meaningful text embeddings t R M t R M tinR^(M)\mathbf{t} \in \mathbb{R}^{M} by a text encoder. We project them to t p R P t p R P t_(p)inR^(P)\mathbf{t}_{p} \in \mathbb{R}^{P} as
文本编码器。我们通过文本编码器创建临床意义的文本嵌入 t R M t R M tinR^(M)\mathbf{t} \in \mathbb{R}^{M} 。我们将它们投影到 t p R P t p R P t_(p)inR^(P)\mathbf{t}_{p} \in \mathbb{R}^{P}
t = E t x t ( x t x t ) t p = f t ( t ) t = E t x t x t x t t p = f t ( t ) {:[t=E_(txt)(x_(txt))],[quadt_(p)=f_(t)(t)]:}\begin{array}{r} \mathbf{t}=E_{t x t}\left(\mathbf{x}_{t x t}\right) \\ \quad \mathbf{t}_{p}=f_{t}(\mathbf{t}) \end{array}
where f t f t f_(t)f_{t} is the projection head and E t x t E t x t E_(txt)E_{t x t} denotes the text encoder. This gives the same embedding dimension P P PP as the vision encoder, suitable for contrastive learning.
其中 f t f t f_(t)f_{t} 是投影头, E t x t E t x t E_(txt)E_{t x t} 表示文本编码器。这与视觉编码器具有相同的嵌入维度 P P PP ,适合对比学习。

3.2 Decouple Image-Text Pairs with Medical Knowledge Extractor
3.2 使用医学知识提取器解耦图像-文本对

Paired medical image text datasets are orders of magnitude less than the general paired image text (e.g., from the internet) due to the significant expense of supplying high-quality annotations by medical specialists as well as privacy and legal concerns. To enhance medical multi-modal learning, we want to make the full use of all existing medical image-text, image-only, and text-only datasets. The challenge is that for image-only, and text-only datasets, CLIP-like contrastive learning is infeasible. Also, we want to dig out all positive pairs to eliminate false negatives.
配对医学图像文本数据集的数量比一般的配对图像文本(例如,来自互联网)少几个数量级,这主要是由于提供高质量注释的医学专家的巨大费用以及隐私和法律问题。为了增强医学多模态学习,我们希望充分利用所有现有的医学图像-文本、仅图像和仅文本数据集。挑战在于,对于仅图像和仅文本数据集,类似 CLIP 的对比学习是不可行的。此外,我们希望挖掘出所有正对,以消除假阴性。
Suppose we have n n nn paired image-text samples, m m mm labeled images, and h h hh medical sentences. Previous methods are only able to use n n nn paired samples.
假设我们有 n n nn 对图像-文本样本, m m mm 个标记图像和 h h hh 个医学句子。以前的方法只能使用 n n nn 对样本。

Figure 3: The workflow of MedCLIP. The knowledge extraction module extracts medical entities from raw medical reports. Then, a semantic similarity matrix is built by comparing medical entities (from text) and raw labels (from images), which enables pairing arbitrary two separately sampled images and texts. The extracted image and text embeddings are paired to match the semantic similarity matrix.
图 3:MedCLIP 的工作流程。知识提取模块从原始医疗报告中提取医疗实体。然后,通过比较医疗实体(来自文本)和原始标签(来自图像)构建语义相似度矩阵,这使得可以配对任意两个单独采样的图像和文本。提取的图像和文本嵌入被配对以匹配语义相似度矩阵。
By contrast, we decouple the n n nn paired samples into n n nn images and n n nn sentences, respectively. Ultimately, we are able to obtain ( n + m ) × ( n + h ) ( n + m ) × ( n + h ) (n+m)xx(n+h)(n+m) \times(n+h) imagetext pairs by traversing all possible combinations, which results in ( n + m ) × ( n + h ) n × ( n + m ) × ( n + h ) n × ((n+m)xx(n+h))/(n)xx\frac{(n+m) \times(n+h)}{n} \times more supervision. For instance, in Fig. 2, previous method pretrains on 2 image-text pairs while MedCLIP is capable of exploiting ( 2 + 3 ) × ( 2 + 3 ) = 25 ( 2 + 3 ) × ( 2 + 3 ) = 25 (2+3)xx(2+3)=25(2+3) \times(2+3)=25 samples in total.
相比之下,我们将 n n nn 对样本解耦为 n n nn 图像和 n n nn 句子。最终,我们能够通过遍历所有可能的组合获得 ( n + m ) × ( n + h ) ( n + m ) × ( n + h ) (n+m)xx(n+h)(n+m) \times(n+h) 图像文本对,这导致了 ( n + m ) × ( n + h ) n × ( n + m ) × ( n + h ) n × ((n+m)xx(n+h))/(n)xx\frac{(n+m) \times(n+h)}{n} \times 更多的监督。例如,在图 2 中,之前的方法在 2 个图像-文本对上进行预训练,而 MedCLIP 能够利用总共 ( 2 + 3 ) × ( 2 + 3 ) = 25 ( 2 + 3 ) × ( 2 + 3 ) = 25 (2+3)xx(2+3)=25(2+3) \times(2+3)=25 个样本。
To fulfill the additional supervision, we propose to leverage external medical knowledge to build the knowledge-driven semantic similarity. Unlike previous works that treat all positive samples equally (Khosla et al., 2020; Zheng et al., 2021; Wang and Sun, 2022), here we propose to differentiate samples via their semantic similarities.
为了满足额外的监督,我们建议利用外部医学知识构建知识驱动的语义相似性。与之前的研究(Khosla et al., 2020; Zheng et al., 2021; Wang and Sun, 2022)将所有正样本视为相等不同,我们在这里建议通过它们的语义相似性来区分样本。
In particular, we split raw reports into sentences x t x t x t x t x_(txt)x_{t x t}. MetaMap (Aronson and Lang, 2010) is used to extract entities defined in Unified Medical Language System (Bodenreider, 2004) from raw sentences. Follow the practice of Peng et al. (2018), we focus on 14 main entity types shown by Table 5. Likewise, for images with diagnosis labels, we leverage MetaMap to map the raw classes to UMLS conceptions thus being aligned with entities from texts, e.g., “Normal” maps to “No Findings”. We build multi-hot vectors from the extracted entities for images and texts, as l i m g l i m g l_(img)\mathbf{l}_{i m g} and l t x t l t x t l_(txt)\mathbf{l}_{t x t}, respectively. Therefore, we unify the semantics of images and texts. For any sampled x i m g x i m g x_(img)x_{i m g} and x t x t x t x t x_(txt)x_{t x t}, we can measure their semantic similarity by comparing the corresponding l i m g l i m g l_(img)\mathbf{l}_{i m g} and l t x t l t x t l_(txt)\mathbf{l}_{t x t}.
特别地,我们将原始报告拆分为句子 x t x t x t x t x_(txt)x_{t x t} 。MetaMap(Aronson 和 Lang,2010)用于从原始句子中提取统一医学语言系统(Bodenreider,2004)中定义的实体。遵循 Peng 等人(2018)的做法,我们关注表 5 所示的 14 种主要实体类型。同样,对于带有诊断标签的图像,我们利用 MetaMap 将原始类别映射到 UMLS 概念,从而与文本中的实体对齐,例如,“正常”映射到“无发现”。我们为图像和文本构建多热向量,如 l i m g l i m g l_(img)\mathbf{l}_{i m g} l t x t l t x t l_(txt)\mathbf{l}_{t x t} 。因此,我们统一了图像和文本的语义。对于任何采样的 x i m g x i m g x_(img)x_{i m g} x t x t x t x t x_(txt)x_{t x t} ,我们可以通过比较相应的 l i m g l i m g l_(img)\mathbf{l}_{i m g} l t x t l t x t l_(txt)\mathbf{l}_{t x t} 来测量它们的语义相似性。

3.3 Semantic Matching Loss
3.3 语义匹配损失

We bridge the images and texts through the built semantic labels l i m g l i m g l_(img)\mathbf{l}_{i m g} and l t x t l t x t l_(txt)\mathbf{l}_{t x t}. During each iteration, we sample N batch N batch  N_("batch ")N_{\text {batch }} input images { x image } x image  {x_("image ")}\left\{\mathbf{x}_{\text {image }}\right\} and text { x text } x text  {x_("text ")}\left\{\mathbf{x}_{\text {text }}\right\} separately. Instead of defining positive pairing by searching equivalent labels, we propose to build soft targets s s ss by
我们通过构建语义标签 l i m g l i m g l_(img)\mathbf{l}_{i m g} l t x t l t x t l_(txt)\mathbf{l}_{t x t} 来连接图像和文本。在每次迭代中,我们分别采样 N batch N batch  N_("batch ")N_{\text {batch }} 输入图像 { x image } x image  {x_("image ")}\left\{\mathbf{x}_{\text {image }}\right\} 和文本 { x text } x text  {x_("text ")}\left\{\mathbf{x}_{\text {text }}\right\} 。我们建议通过构建软目标 s s ss 来替代通过搜索等效标签来定义正配对。
s = l i m g l t x t l i m g l t x t s = l i m g l t x t l i m g l t x t s=(l_(img)^(TT)*l_(txt))/(||l_(img)||*||l_(txt)||)s=\frac{\mathbf{l}_{i m g}^{\top} \cdot \mathbf{l}_{t x t}}{\left\|\mathbf{l}_{i m g}\right\| \cdot\left\|\mathbf{l}_{t x t}\right\|}
s s ss thus indicates the medical semantic similarity.
s s ss 因此表示医学语义相似性。

For an image i i ii, we obtain a set of s i j s i j s_(ij)s_{i j} where j = 1 N batch j = 1 N batch  j=1dotsN_("batch ")j=1 \ldots N_{\text {batch }} corresponds to the batch of texts. The soft target is computed by normalizing across j j jj by softmax.
对于图像 i i ii ,我们获得一组 s i j s i j s_(ij)s_{i j} ,其中 j = 1 N batch j = 1 N batch  j=1dotsN_("batch ")j=1 \ldots N_{\text {batch }} 对应于文本批次。软目标通过对 j j jj 进行 softmax 归一化来计算。
y i j v t = exp s i j j = 1 N b a t c h exp s i j y i j v t = exp s i j j = 1 N b a t c h exp s i j y_(ij)^(v rarr t)=(exp s_(ij))/(sum_(j=1)^(N_(batch))exp s_(ij))y_{i j}^{v \rightarrow t}=\frac{\exp s_{i j}}{\sum_{j=1}^{N_{b a t c h}} \exp s_{i j}}
Similarily, the reversed text-to-image soft targets are obtained by
类似地,反向文本到图像的软目标是通过获得的
y j i t v = exp s j i i = 1 N b a t c h exp s j i y j i t v = exp s j i i = 1 N b a t c h exp s j i y_(ji)^(t rarr v)=(exp s_(ji))/(sum_(i=1)^(N_(batch))exp s_(ji))y_{j i}^{t \rightarrow v}=\frac{\exp s_{j i}}{\sum_{i=1}^{N_{b a t c h}} \exp s_{j i}}
The logits are obtained by cosine similarities between image and text embeddings:
通过图像和文本嵌入之间的余弦相似度获得 logits:
s ^ i j = v ~ i t ~ j s ^ i j = v ~ i t ~ j hat(s)_(ij)= tilde(v)_(i)^(TT)* tilde(t)_(j)\hat{s}_{i j}=\tilde{\mathbf{v}}_{i}^{\top} \cdot \tilde{\mathbf{t}}_{j}
where v ~ i v ~ i tilde(v)_(i)\tilde{\mathbf{v}}_{i} and t ~ j t ~ j tilde(t)_(j)\tilde{\mathbf{t}}_{j} are normalized v p v p v_(p)\mathbf{v}_{p} and t p t p t_(p)\mathbf{t}_{p}, respectively. The predicted similarity is also obtained by softmax function
其中 v ~ i v ~ i tilde(v)_(i)\tilde{\mathbf{v}}_{i} t ~ j t ~ j tilde(t)_(j)\tilde{\mathbf{t}}_{j} 分别是归一化的 v p v p v_(p)\mathbf{v}_{p} t p t p t_(p)\mathbf{t}_{p} 。预测的相似性也是通过 softmax 函数获得的。
y ^ i j = exp s ^ i j / τ i = 1 N b a t c h exp s ^ i j / τ y ^ i j = exp s ^ i j / τ i = 1 N b a t c h exp s ^ i j / τ hat(y)_(ij)=(exp hat(s)_(ij)//tau)/(sum_(i=1)^(N_(batch))exp hat(s)_(ij)//tau)\hat{y}_{i j}=\frac{\exp \hat{s}_{i j} / \tau}{\sum_{i=1}^{N_{b a t c h}} \exp \hat{s}_{i j} / \tau}
τ τ tau\tau is the temperature initialized at 0.07 . The s e s e se-s e- mantic matching loss is hence the cross entropy between the logits and soft targets as
τ τ tau\tau 的温度初始化为 0.07。因此, s e s e se-s e- 的神秘匹配损失是 logits 和软目标之间的交叉熵。
L v l = 1 N batch i = 1 N batch j = 1 N batch y i j log y ^ i j . L v l = 1 N batch  i = 1 N batch  j = 1 N batch  y i j log y ^ i j . L^(v rarr l)=-(1)/(N_("batch "))sum_(i=1)^(N_("batch "))sum_(j=1)^(N_("batch "))y_(ij)log hat(y)_(ij).\mathcal{L}^{v \rightarrow l}=-\frac{1}{N_{\text {batch }}} \sum_{i=1}^{N_{\text {batch }}} \sum_{j=1}^{N_{\text {batch }}} y_{i j} \log \hat{y}_{i j} .
Likewise, we can compute L l v L l v L^(l rarr v)\mathcal{L}^{l \rightarrow v} and then reach to
同样,我们可以计算 L l v L l v L^(l rarr v)\mathcal{L}^{l \rightarrow v} 然后达到
L = L v l + L l v 2 L = L v l + L l v 2 L=(L^(v rarr l)+L^(l rarr v))/(2)\mathcal{L}=\frac{\mathcal{L}^{v \rightarrow l}+\mathcal{L}^{l \rightarrow v}}{2}
as the final training objective.
作为最终训练目标。

4 Experiments 4 个实验

We conduct extensive experiments on four X-ray datasets to answer the following questions:
我们在四个 X 射线数据集上进行了广泛的实验,以回答以下问题:
  • Q1. Does the proposed pre-training method yield better zero-shot image recognition performances?
    Q1. 提出的预训练方法是否能产生更好的零样本图像识别性能?
  • Q2. Does the knowledge-driven supervision, i.e., semantic matching task, facilitate the contrastive image-text pre-training?
    Q2. 知识驱动的监督,即语义匹配任务,是否有助于对比图像-文本的预训练?
  • Q3. Does MedCLIP bring better performance and label efficiency for downstream classification tasks with fine-tuning?
    Q3. MedCLIP 在微调后是否为下游分类任务带来了更好的性能和标签效率?
  • Q4. Are the learned embeddings good at crossmodal retrieval tasks?
    Q4. 学习到的嵌入在跨模态检索任务中表现好吗?
  • Q5. How do the learned embeddings look like?
    Q5. 学习到的嵌入是什么样的?

4.1 Datasets 4.1 数据集

CheXpert (Irvin et al., 2019) is a large dataset of chest X-rays with 14 observation labels collected from Stanford Hospital. Note that this dataset does not provide the corresponding medical reports to the public. We use the training split of this dataset for pre-training. For evaluation, we follow (Huang et al., 2021) and sample a multi-class classification dataset from the testing split, namely CheXpert-5x200. This multi-class classification dataset has 200 exclusively positive images for the five CheXpert competition tasks: Atelectasis, Cardiomegaly, Edema, Pleural, Effsion.
CheXpert(Irvin 等,2019)是一个大型胸部 X 光数据集,包含来自斯坦福医院的 14 个观察标签。请注意,该数据集不向公众提供相应的医学报告。我们使用该数据集的训练集进行预训练。为了评估,我们遵循(Huang 等,2021)并从测试集中抽取一个多类分类数据集,即 CheXpert-5x200。该多类分类数据集有 200 张专门的阳性图像,针对五个 CheXpert 竞赛任务:肺不张、心脏肥大、水肿、胸腔积液。
MIMIC-CXR (Johnson et al., 2019) is a large chest X-ray database with free-text radiology reports collected from the Beth Israel Deaconess Medical Center in Boston, MA. We use the training
MIMIC-CXR(Johnson 等,2019)是一个大型胸部 X 光数据库,包含从马萨诸塞州波士顿的贝斯以色列女执事医疗中心收集的自由文本放射学报告。我们使用训练

split of this dataset for pre-training. For evaluation, we also sample a MIMIC-5x200 dataset for the same five tasks above.
该数据集的预训练拆分。为了评估,我们还为上述五个任务抽样了一个 MIMIC-5x200 数据集。
COVID (Rahman et al., 2021) is a publicly available x-ray dataset with COVID v.s. non-COVID labels. The positive and negative ratio is roughly 1 : 1 1 : 1 1:11: 1. We use this dataset for evaluation.
COVID (Rahman et al., 2021) 是一个公开可用的 X 光数据集,包含 COVID 与非 COVID 标签。正负比例大约为 1 : 1 1 : 1 1:11: 1 。我们使用这个数据集进行评估。
RSNA Pneumonia (Shih et al., 2019) is a collection of pneumonia cases found in the database of chest x-rays made public by the National Institutes of Health. This is a binary classification dataset: pneumonia v.s. normal. We sample a balanced subset (i.e., 1 : 1 1 : 1 1:11: 1 positive and negative ratio) and use it for evaluation.
RSNA 肺炎(Shih 等,2019)是一个收集自国家卫生研究院公开的胸部 X 光片数据库的肺炎病例集合。这是一个二分类数据集:肺炎与正常。我们抽取一个平衡的子集(即 1 : 1 1 : 1 1:11: 1 正负比例)并用于评估。

4.2 Baselines 4.2 基线

Random is a ResNet-50 (He et al., 2015) model with its default random initialization.
Random 是一个 ResNet-50(He 等,2015)模型,具有默认的随机初始化。
ImageNet is a ResNet-50 (He et al., 2015) model with weights pretrained on the standard ImageNet ILSVRC-2012 task (Deng et al., 2009).
ImageNet 是一个 ResNet-50(He 等,2015)模型,其权重在标准的 ImageNet ILSVRC-2012 任务上进行了预训练(Deng 等,2009)。
CLIP (Radford et al., 2021) is a vision-text contrastive learning framework pre-trained on a dataset of 400 M image-texts pairs collected from the internet.
CLIP(Radford 等,2021)是一个视觉-文本对比学习框架,预先在从互联网收集的 4 亿对图像-文本数据集上进行训练。
ConVIRT works on vision-text contrastive learning in medicine. It employs a plain InfoNCE loss (Van den Oord et al., 2018) on paired X-rays and reports. We reproduce it based on their paper based on BioClinicalBERT text encoder and ResNet50 (He et al., 2016) vision encoder.
ConVIRT 在医学中进行视觉-文本对比学习。它在配对的 X 光片和报告上采用了简单的 InfoNCE 损失(Van den Oord 等,2018)。我们基于他们的论文,使用 BioClinicalBERT 文本编码器和 ResNet50(He 等,2016)视觉编码器进行了复现。
GLoRIA (Huang et al., 2021) entangles image subregions and words in inference by cross-attention which was argued to better capture key characteristics in images and reports. We implement it based on the official code and the provided pretrained weights 2 2 ^(2){ }^{2}.
GLoRIA(Huang 等,2021)通过交叉注意力在推理中将图像子区域和单词纠缠在一起,这被认为更好地捕捉图像和报告中的关键特征。我们基于官方代码和提供的预训练权重 2 2 ^(2){ }^{2} 实现它。

4.3 Implementation Details
4.3 实施细节

We use the BioClinicalBERT 3 3 ^(3){ }^{3} as the backbone text encoder and Swin Transformer (Liu et al., 2021) with ImageNet (Deng et al., 2009) pre-trained
我们使用 BioClinicalBERT 3 3 ^(3){ }^{3} 作为主干文本编码器,并使用 Swin Transformer(Liu et al., 2021)与 ImageNet(Deng et al., 2009)进行预训练。
Table 1: Results of zero-shot image classification tasks on four datasets. We take an additional prompt ensemble version of each method (with subscript ENS ENS  _("ENS "){ }_{\text {ENS }} ). We take the mean and standard deviation (STD) of accuracy (ACC) in five runs considering the randomness of prompt generation process. Best scores across a dataset are in bold.
表 1:四个数据集上零样本图像分类任务的结果。我们采用每种方法的额外提示集成版本(下标 ENS ENS  _("ENS "){ }_{\text {ENS }} )。我们考虑到提示生成过程的随机性,取五次运行的准确率(ACC)的均值和标准差(STD)。每个数据集的最佳得分用粗体表示。
ACC(STD) CheXpert-5x200 MIMIC-5x200 COVID RSNA
CLIP CLIP ENS CLIP  ENS  ^("CLIP "_("ENS "))^{\text {CLIP }_{\text {ENS }}} 0.2016 ( 0.01 ) 0.2016 ( 0.01 ) 0.2016(0.01)0.2016(0.01) 0.1918 ( 0.01 ) 0.1918 ( 0.01 ) 0.1918(0.01)0.1918(0.01) 0.5069 ( 0.03 ) 0.5069 ( 0.03 ) 0.5069(0.03)0.5069(0.03) 0.4989 ( 0.01 ) 0.4989 ( 0.01 ) 0.4989(0.01)0.4989(0.01)
ConVIRT ConVIRT ConVIRT  ^("ConVIRT ")^{\text {ConVIRT }} ENS 0.2036 ( 0.01 ) 0.2036 ( 0.01 ) 0.2036(0.01)0.2036(0.01) 0.2254 ( 0.01 ) 0.2254 ( 0.01 ) 0.2254(0.01)0.2254(0.01) 0.5090 ( < 0.01 ) 0.5090 ( < 0.01 ) 0.5090( < 0.01)0.5090(<0.01)