Attention U-Net: Learning Where to Look for the Pancreas
注意 U-Net：了解在哪里寻找胰腺

Ozan Oktay $^{1, 5}$ , Jo Schlemper $^{1}$ , Loic Le Folgoc $^{1}$ , Matthew Lee $^{4}$ , Mattias Heinrich $^{3}$ , Kazunari Misawa $^{2}$ , Kensaku Mori $^{2}$ , Steven McDonagh $^{1}$ , Nils Y Hammerla $^{5}$ , Bernhard Kainz $^{1}$ , Ben Glocker $^{1}$ , and Daniel Rueckert $^{1}$
Ozan Oktay $^{1, 5}$ ， Jo Schlemper $^{1}$ ， Loic Le Folgoc $^{1}$ ， Matthew Lee $^{4}$ ， Mattias Heinrich $^{3}$ ， Kazunari Misawa $^{2}$ ， Kensaku Mori $^{2}$ ， Steven McDonagh $^{1}$ ， Nils Y Hammerla $^{5}$ ， Bernhard Kainz $^{1}$ ， Ben Glocker $^{1}$ 和 Daniel Rueckert $^{1}$ $^{1}$ Biomedical Image Analysis Group, Imperial College London, London, UK
$^{1}$ 英国伦敦帝国理工学院生物医学图像分析小组 $^{2}$ Dept. of Media Science, Nagoya University & Aichi Cancer Center, JP
$^{2}$ 名古屋大学媒体科学系和爱知癌症中心，JP $^{3}$ Medical Informatics, University of Luebeck, DE, $^{4}$ HeartFlow, California, USA
$^{3}$ 医学信息学，德国吕贝克大学， $^{4}$ 美国加利福尼亚州 HeartFlow $^{5}$ Babylon Health, London, UK
$^{5}$ Babylon Health，英国伦敦

Abstract 抽象

We propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly learn to suppress irrelevant regions in an input image while highlighting salient features useful for a specific task. This enables us to eliminate the necessity of using explicit external tissue/organ localisation modules of cascaded convolutional neural networks (CNNs). AGs can be easily integrated into standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy. The proposed Attention U-Net architecture is evaluated on two large CT abdominal datasets for multi-class image segmentation. Experimental results show that AGs consistently improve the prediction performance of U-Net across different datasets and training sizes while preserving computational efficiency. The source code for the proposed architecture is publicly available.
我们提出了一种新的医学成像注意力门（AG）模型，该模型可自动学习关注不同形状和大小的目标结构。使用 AG 训练的模型隐式学习抑制输入图像中不相关的区域，同时突出显示对特定任务有用的突出特征。这使我们能够消除使用级联卷积神经网络（CNN）的显式外部组织/器官定位模块的必要性。AG 可以轻松集成到标准 CNN 架构中，例如 U-Net 模型，计算开销最小，同时提高模型灵敏度和预测准确性。所提出的 Attention U-Net 架构在两个大型 CT 腹部数据集上进行了多类图像分割的评估。实验结果表明，AGs 在保持计算效率的同时，始终提高了 U-Net 在不同数据集和训练规模下的预测性能。建议的体系结构的源代码是公开可用的。

1 Introduction 1 引言

Automated medical image segmentation has been extensively studied in the image analysis community due to the fact that manual, dense labelling of large amounts of medical images is a tedious and error-prone task. Accurate and reliable solutions are desired to increase clinical work flow efficiency and support decision making through fast and automatic extraction of quantitative measurements.
自动医学图像分割在图像分析界已得到广泛研究，因为手动、密集标记大量医学图像是一项繁琐且容易出错的任务。需要准确可靠的解决方案来提高临床工作流程的效率，并通过快速自动提取定量测量值来支持决策。
With the advent of convolutional neural networks (CNNs), near-radiologist level performance can be achieved in automated medical image analysis tasks including cardiac MR segmentation [3] and cancerous lung nodule detection [17]. High representation power, fast inference, and filter sharing properties have made CNNs the de facto standard for image segmentation. Fully convolutional networks (FCNs) [18] and the U-Net [24] are two commonly used architectures. Despite their good representational power, these architectures rely on multi-stage cascaded CNNs when the target organs show large inter-patient variation in terms of shape and size. Cascaded frameworks extract a region of interest (ROI) and make dense predictions on that particular ROI. The application areas include cardiac MRI [14], cardiac CT [23], abdominal CT [26, 27] segmentation, and lung CT nodule detection [17]. However, this approach leads to excessive and redundant use of computational resources and model parameters; for instance, similar low-level features are repeatedly extracted by all models within the cascade. To address this general problem, we propose a simple and yet effective solution, namely attention gates (AGs). CNN models with AGs can be trained from scratch in a standard way similar to the training of a FCN model, and AGs automatically learn to focus on target
随着卷积神经网络（CNN）的出现，可以在自动化医学图像分析任务中实现接近放射科医生水平的性能，包括心脏 MR 分割 [3] 和癌性肺结节检测 [17]。高表示能力、快速推理和滤波器共享特性使 CNN 成为图像分割的事实标准。全卷积网络（FCN） [18] 和 U-Net [24] 是两种常用的架构。尽管它们具有良好的表征能力，但当目标器官在形状和大小方面表现出较大的患者间差异时，这些结构依赖于多阶段级联 CNN。级联框架提取感兴趣区域（ROI）并对该特定 ROI 进行密集预测。应用领域包括心脏 MRI [14]、心脏 CT [23]、腹部 CT [26， 27] 分割和肺部 CT 结节检测 [17]。但是，这种方法会导致计算资源和模型参数的过度和冗余使用;例如，级联中的所有模型都会重复提取类似的低级特征。为了解决这个普遍问题，我们提出了一个简单而有效的解决方案，即注意力门（AG）。带有 AG 的 CNN 模型可以采用类似于 FCN 模型训练的标准方式从头开始训练，并且 AG 会自动学习专注于目标
structures without additional supervision. At test time, these gates generate soft region proposals implicitly on-the-fly and highlight salient features useful for a specific task. Moreover, they do not introduce significant computational overhead and do not require a large number of model parameters as in the case of multi-model frameworks. In return, the proposed AGs improve model sensitivity and accuracy for dense label predictions by suppressing feature activations in irrelevant regions. In this way, the necessity of using an external organ localisation model can be eliminated while maintaining the high prediction accuracy. Similar attention mechanisms have been proposed for natural image classification [11] and captioning [1] to perform adaptive feature pooling, where model predictions are conditioned only on a subset of selected image regions. In this paper, we generalise this design and propose image-grid based gating that allows attention coefficients to be specific to local regions. Moreover, our approach can be used for attention-based dense predictions.
没有额外监督的结构。在测试时，这些门会即时隐式生成软区域建议，并突出显示对特定任务有用的突出特征。此外，它们不会引入大量的计算开销，也不需要像多模型框架那样需要大量的模型参数。作为回报，所提出的 AG 通过抑制不相关区域中的特征激活，提高了模型对密集标签预测的灵敏度和准确性。通过这种方式，可以消除使用外部器官定位模型的必要性，同时保持高预测精度。已经为自然图像分类 [11] 和字幕 [1] 提出了类似的注意力机制来执行自适应特征池，其中模型预测仅以选定图像区域的子集为条件。在本文中，我们推广了这种设计，并提出了基于图像网格的门控，允许注意力系数特定于局部区域。此外，我们的方法可用于基于注意力的密集预测。
We demonstrate the implementation of AG in a standard U-Net architecture (Attention U-Net) and apply it to medical images. We choose the challenging CT pancreas segmentation problem to provide experimental evidence for our proposed contributions. This problem constitutes a difficult task due to low tissue contrast and large variability in organ shape and size. We evaluate our implementation on two commonly used benchmarks: TCIA Pancreas

C T - 82

[25] and multi-class abdominal

C T - 150

. The results show that AGs consistenly improve prediction accuracy across different datasets and training sizes while achieving state-of-the-art performance without requiring multiple CNN models.
我们演示了 AG 在标准 U-Net 架构（Attention U-Net）中的实施，并将其应用于医学影像。我们选择具有挑战性的 CT 胰腺分割问题为我们提出的贡献提供实验证据。由于组织对比度低且器官形状和大小变化较大，因此此问题是一项艰巨的任务。我们根据两个常用的基准评估我们的实施：TCIA 胰腺

C T - 82

[25] 和多类腹部

C T - 150

。结果表明，AG 持续提高了不同数据集和训练规模的预测准确性，同时无需多个 CNN 模型即可实现最先进的性能。

CT Pancreas Segmentation: Early work on pancreas segmentation from abdominal CT used statistical shape models [5, 28] or multi-atlas techniques [22, 34]. In particular, atlas approaches benefit from implicit shape constraints enforced by propagation of manual annotations. However, in public benchmarks such as the TCIA dataset [25], Dice similarity coefficients (DSC) for atlas-based frameworks ranges from

69.6 %

73.9 %

[22, 34]. In [39] a classification based framework is proposed to remove the dependency of atlas to image registration. Recently, cascaded multi-stage CNN models [26, 27, 38] have been proposed to address the problem. Here, an initial coarse-level model (e.g. U-Net or Regression Forest) is used to obtain a ROI and then a cropped ROI is used for segmentation refinement by a second model. Similarly, combinations of 2D-FCN and recurrent neural network (RNN) models are utilised in [4] to exploit dependencies between adjacent axial slices. These approaches achieve state-of-the-art performance in the TCIA benchmark (

81.2 % - 82.4 %

DSC). Without using a cascaded framework, the performance drops between

2.0 %

and

4.4 %

. Recent work [37] proposed an iterative two-stage model that recursively updates local and global predictions, and both models are trained end-to-end. Besides standard FCNs, dense connections [6] and sparse convolutions

[8, 9]

have been applied to the CT pancreas segmentation problem. Dense connections and sparse kernels reduce computational complexity by requiring less number of non-zero parameters.
CT 胰腺分割：腹部 CT 胰腺分割的早期工作使用了统计形状模型 [5， 28] 或多图谱技术 [22， 34]。特别是，atlas 方法受益于由手动注释的传播强制执行的隐式形状约束。然而，在 TCIA 数据集 [25] 等公共基准测试中，基于图集的框架的 Dice 相似系数（DSC）范围为

69.6 %

[

73.9 %

22， 34]。在 [39] 中，提出了一个基于分类的框架，以消除 atlas 对图像配准的依赖性。最近，已经提出了级联多阶段 CNN 模型 [26， 27， 38] 来解决这个问题。在这里，使用初始粗略级别模型（例如 U-Net 或 Regression Forest）来获得 ROI，然后裁剪的 ROI 由第二个模型用于细分细化。同样，[4] 中利用 2D-FCN 和递归神经网络（RNN）模型的组合来利用相邻轴向切片之间的依赖关系。这些方法在 TCIA 基准测试（

81.2 % - 82.4 %

DSC）中实现了最先进的性能。如果不使用级联框架，性能会在和

4.4 %

之间

2.0 %

下降。最近的工作[37]提出了一个迭代的两阶段模型，该模型递归地更新局部和全局预测，并且两个模型都是端到端训练的。除了标准 FCN 外，密集连接 [6] 和稀疏卷积

[8, 9]

也被应用于 CT 胰腺分割问题。密集连接和稀疏内核通过减少非零参数的数量来降低计算复杂性。
Attention Gates: AGs are commonly used in natural image analysis, knowledge graphs, and language processing (NLP) for image captioning [1], machine translation [2, 30], and classification [11, 31, 32] tasks. Initial work has explored attention-maps by interpreting gradient of output class scores with respect to the input image. Trainable attention, on the other hand, is enforced by design and categorised as hard- and soft-attention. Hard attention [21], e.g. iterative region proposal and cropping, is often non-differentiable and relies on reinforcement learning for parameter updates, which makes model training more difficult. Recursive hard-attention is used in [36] to detect anomalies in chest X-ray scans. Contrarily, soft attention is probabilistic and utilises standard back-propagation without need for Monte Carlo sampling. For instance, additive soft attention is used in sentence-to-sentence translation [2, 29] and more recently applied to image classification [11, 32]. In [10], channel-wise attention is used to highlight important feature dimensions, which was the top-performer in the ILSVRC 2017 image classification challenge. Self-attention techniques [11, 33] have been proposed to remove the dependency on external gating information. For instance, non-local self attention is used in [33] to capture long range dependencies. In [11, 32] self-attention is used to perform class-specific pooling, which results in more accurate and robust image classification performance.
注意力门：AG 通常用于自然图像分析、知识图谱和语言处理（NLP）中，用于图像描述 [1]、机器翻译 [2， 30] 和分类 [11， 31， 32] 任务。最初的工作是通过解释输出类分数相对于输入图像的梯度来探索注意力图。另一方面，可训练的注意力是通过设计强制执行的，并被归类为硬注意力和软注意力。硬注意力 [21]，例如迭代区域建议和裁剪，通常是不可微分的，并且依赖于强化学习进行参数更新，这使得模型训练更加困难。递归硬注意力在 [36] 中用于检测胸部 X 光扫描中的异常。相反，软注意力是概率性的，并且利用标准的反向传播，不需要蒙特卡洛采样。例如，加法软注意用于句子到句子的翻译 [2， 29]，最近应用于图像分类 [11， 32]。在 [10] 中，通道级注意力用于突出重要的特征维度，这是 ILSVRC 2017 图像分类挑战赛中表现最好的。已经提出了自我注意技术 [11， 33] 来消除对外部门控信息的依赖。例如，在 [33] 中，非局部自我注意被用来捕捉长期依赖性。在 [11， 32] 中，自我注意用于执行特定于类的池化，从而产生更准确和稳健的图像分类性能。

1.2 Contributions 1.2 贡献

In this paper, we propose a novel self-attention gating module that can be utilised in CNN based standard image analysis models for dense label predictions. Moreover, we explore the benefit of AGs to medical image analysis, in particular, in the context of image segmentation. The contributions of this work can be summarised as follows:
在本文中，我们提出了一种新的自我注意力门控模块，可用于基于 CNN 的标准图像分析模型，用于密集标签预测。此外，我们探讨了 AGs 对医学图像分析的好处，特别是在图像分割的背景下。这项工作的贡献可以总结如下：

Figure 1: A block diagram of the proposed Attention U-Net segmentation model. Input image is progressively filtered and downsampled by factor of 2 at each scale in the encoding part of the network (e.g.

H_{4} = H_{1} / 8

N_{c}

denotes the number of classes. Attention gates (AGs) filter the features propagated through the skip connections. Schematic of the AGs is shown in Figure 2. Feature selectivity in AGs is achieved by use of contextual information (gating) extracted in coarser scales.
图 1：拟议的 Attention U-Net 分割模型的框图。输入图像在网络编码部分（例如

H_{4} = H_{1} / 8

）的每个比例上按 2 倍进行渐进过滤和下采样。

N_{c}

表示类的数量。注意力门（AG）过滤通过 skip 连接传播的特征。AG 的示意图如图 2 所示。AGs 中的特征选择性是通过使用以较粗尺度提取的上下文信息（门控）来实现的。

We take the attention approach proposed in [11] a step further by proposing grid-based gating that allows attention coefficients to be more specific to local regions. This improves performance compared to gating based on a global feature vector. Moreover, our approach can be used for dense predictions since we do not perform adaptive pooling.
我们进一步采用了 [11] 中提出的注意力方法，提出了基于网格的门控，允许注意力系数更具体地针对局部区域。与基于全局特征向量的门控相比，这提高了性能。此外，我们的方法可用于密集预测，因为我们不执行自适应池化。
We propose one of the first use cases of soft-attention technique in a feed-forward CNN model applied to a medical imaging task. The proposed attention gates can replace hardattention approaches used in image classification [36] and external organ localisation models in image segmentation frameworks [14, 22, 26, 27].
我们提出了应用于医学成像任务的前馈 CNN 模型中软注意力技术的首批用例之一。所提出的注意力门可以取代图像分类 [36] 中使用的 hardattention 方法和图像分割框架中的外部器官定位模型 [14， 22， 26， 27]。
An extension to the standard U-Net model is proposed to improve model sensitivity to foreground pixels without requiring complicated heuristics. Accuracy improvements over U-Net are experimentally observed to be consistent across different imaging datasets.
提出了对标准 U-Net 模型的扩展，以提高模型对前景像素的灵敏度，而无需复杂的启发式方法。通过实验观察到 U-Net 的准确性改进在不同的成像数据集中是一致的。

2 Methodology 2 方法

Fully Convolutional Network (FCN): Convolutional neural networks (CNNs) outperform traditional approaches in medical image analysis on public benchmark datasets [14, 17] while being an order of magnitude faster than, e.g., graph-cut and multi-atlas segmentation techniques [34]. This is mainly attributed to the fact that (I) domain specific image features are learnt using stochastic gradient descent (SGD) optimisation, (II) learnt kernels are shared across all pixels, and (III) image convolution operations exploit the structural information in medical images well. In particular, fully convolutional networks (FCN) [18] such as U-Net [24], DeepMedic [13] and holistically nested networks [16,35] have been shown to achieve robust and accurate performance in various tasks including cardiac MR [3], brain tumours [12] and abdominal CT [26, 27] image segmentation tasks.
全卷积网络（FCN）：卷积神经网络（CNN）在公共基准数据集上优于传统医学图像分析方法 [14， 17]，同时比图形切割和多图谱分割技术快一个数量级 [34]。这主要归因于以下事实：（I）特定领域的图像特征是使用随机梯度下降（SGD）优化学习的，（II）学习的内核在所有像素之间共享，以及（III）图像卷积作很好地利用了医学图像中的结构信息。特别是，全卷积网络（FCN） [18]，如 U-Net [24]、DeepMedic [13] 和整体嵌套网络 [16,35] 已被证明可以在各种任务中实现稳健和准确的性能，包括心脏 MR [3]、脑肿瘤 [12] 和腹部 CT [26， 27] 图像分割任务。
Convolutional layers progressively extract higher dimensional image representations (

x^{l}

) by processing local information layer by layer. Eventually, this separates pixels in a high dimensional space according to their semantics. Through this sequential process, model predictions are conditioned on information collected from a large receptive field. Hence, feature-map

x^{l}

is obtained at the output of layer

l

by sequentially applying a linear transformation followed by a non-linear activation function. It is often chosen as rectified linear unit:

σ_{1} (x_{i, c}^{l}) = max (0, x_{i, c}^{l})

where

i

and

c

denote spatial and channel dimensions respectively. Feature activations can be formulated as:

x_{c}^{l} = σ_{1} (\sum_{c^{'} \in F_{l}} x_{c^{'}}^{l - 1} * k_{c^{'}, c})

where

*

denotes the convolution operation, and the spatial subscript

(i)

is omitted in the formulation for notational clarity. The function

f (x^{l}; Φ^{l}) = x^{(l + 1)}

applied in convolution layer

l

is characterised by trainable kernel parameters

Φ^{l}

. The parameters are learnt by
卷积层通过逐层处理局部信息来逐步提取更高维的图像表示（

x^{l}

）。最终，这会根据像素的语义将高维空间中的像素分开。通过这个连续的过程，模型预测以从大型感受野收集的信息为条件。因此，特征图

x^{l}

是通过依次应用线性变换和非线性激活函数在层

l

的输出处获得的。它通常被选为整流线性单位：

σ_{1} (x_{i, c}^{l}) = max (0, x_{i, c}^{l})

其中

i

和

c

分别表示空间和通道维度。特征激活可以表述为：

x_{c}^{l} = σ_{1} (\sum_{c^{'} \in F_{l}} x_{c^{'}}^{l - 1} * k_{c^{'}, c})

其中

*

表示卷积运算，为了符号清晰，在公式中省略了空间下标

(i)

。卷积层

l

中应用的函数

f (x^{l}; Φ^{l}) = x^{(l + 1)}

以可训练的核参数

Φ^{l}

为特征。参数由

Attention U-Net: Learning Where to Look for the Pancreas 注意 U-Net：了解在哪里寻找胰腺

Abstract 抽象

1 Introduction 1 引言

1.1 Related Work 1.1 相关工作

1.2 Contributions 1.2 贡献

2 Methodology 2 方法

Attention U-Net: Learning Where to Look for the Pancreas
注意 U-Net：了解在哪里寻找胰腺