这是用户在 2024-11-20 14:01 为 https://app.immersivetranslate.com/pdf-pro/0be7b1ee-af88-4b2f-99bf-1dd4f43ae53c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation
U-KAN 为医学图像分割和生成提供强大骨干

Chenxin Li 1 1 ^(1**){ }^{1 *}, Xinyu Liu 1 1 ^(1**){ }^{1 *}, Wuyang Li 1 1 ^(1**)^{1 *}, Cheng Wang 1 1 ^(1**){ }^{1 *}, Hengyu Liu 1 1 ^(1){ }^{1}, Yifan Liu 1 1 ^(1){ }^{1}, Zhen Chen 2 2 ^(2){ }^{2}, Yixuan Yuan 1 1 ^(1){ }^{1},
李晨心 1 1 ^(1**){ }^{1 *} ,刘欣宇 1 1 ^(1**){ }^{1 *} ,李武扬 1 1 ^(1**)^{1 *} ,王成 1 1 ^(1**){ }^{1 *} ,刘恒宇 1 1 ^(1){ }^{1} ,刘一凡 1 1 ^(1){ }^{1} ,陈振 2 2 ^(2){ }^{2} ,袁一璇 1 1 ^(1){ }^{1}
1 1 ^(1){ }^{1} The Chinese University of Hong Kong, 2 2 ^(2){ }^{2} CAIR, HKISI-CAS
香港中文大学, 2 2 ^(2){ }^{2} 北京师范大学-香港浸会大学联合国际学院

Abstract 摘要

U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of U U UU KAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures. Project page: https://yes-u-kan.github.io/.
U-Net 已成为各种视觉应用(如图像分割和扩散概率模型)的基石。虽然通过引入 Transformer 或 MLP 等创新设计和改进,网络仍然局限于线性建模模式以及缺乏可解释性。为了解决这些挑战,我们的直觉受到 Kolmogorov-Arnold 网络(KANs)在准确性和可解释性方面的出色结果的启发,这些网络通过 Kolmogorov-Anold 表示定理派生的非线性可学习激活函数堆叠重塑了神经网络学习。具体来说,在这篇论文中,我们探索了 KANs 在提高视觉任务骨干网络方面的潜力。我们研究了、修改并重新设计了现有的 U-Net 管道,通过在标记的中间表示中集成专门的 KAN 层,称为 U-KAN。严格的医学图像分割基准验证了 KAN 在更高准确性和更低的计算成本下的优越性。 我们进一步探讨了 U-KAN 作为扩散模型中替代 U-Net 噪声预测器的潜力,展示了其在生成面向任务的模型架构中的应用。项目页面:https://yes-u-kan.github.io/。

Introduction 简介

Over the past decade, numerous works have focused on developing efficient and robust segmentation methods for medical imaging (Shen, Wu, and Suk 2017; Sun et al. 2022; Li et al. 2022c, 2021b), driven by the need for computer-aided diagnosis and image-guided surgical systems (Liu et al. 2024b, a; Li et al. 2024a; Liu and Yuan 2022; Liu et al. 2021; Ali et al. 2024). Among these, U-Net (Ronneberger, Fischer, and Brox 2015) is a landmark work that initially demonstrated the effectiveness of encoder-decoder convolutional networks with skip connections for medical image segmentation (Wang et al. 2022a; Li et al. 2021a; Ding et al. 2022; Xu et al. 2022), and has also shown promising results in many image translation tasks (Torbunov et al. 2023; Kalantar et al. 2021). Additionally, recent diffusion models have utilized U-Net, training it to iteratively predict the noise to be removed in each denoising step (Ho, Jain, and Abbeel 2020; Rombach et al. 2022; Saharia et al. 2022b).
过去十年中,许多研究致力于开发用于医学图像的高效且鲁棒的分割方法(Shen, Wu, 和 Suk 2017;Sun 等人 2022;Li 等人 2022c,2021b),这推动了计算机辅助诊断和图像引导手术系统的发展(Liu 等人 2024b,a;Li 等人 2024a;Liu 和 Yuan 2022;Liu 等人 2021;Ali 等人 2024)。在这些研究中,U-Net(Ronneberger,Fischer,和 Brox 2015)是一项里程碑式的工作,它最初证明了具有跳过连接的编码器-解码器卷积网络在医学图像分割中的有效性(Wang 等人 2022a;Li 等人 2021a;Ding 等人 2022;Xu 等人 2022),并在许多图像转换任务中也显示出有希望的结果(Torbunov 等人 2023;Kalantar 等人 2021)。此外,最近的扩散模型已经利用了 U-Net,训练它迭代地预测每个去噪步骤中要去除的噪声(Ho,Jain,和 Abbeel 2020;Rombach 等人 2022;Saharia 等人 2022b)。
Since the inception of U-Net (Ronneberger, Fischer, and Brox 2015), a series of crucial modifications have been introduced, especially in the subfield of medical imaging, including U-Net++(Zhou et al. 2018), 3D U-Net(Çiçek et al. 2016), V-Net (Milletari, Navab, and Ahmadi 2016), and YNet (Mehta et al. 2018). U-NeXt (Valanarasu and Patel 2022) and Rolling U-Net(Liu et al. 2024d) integrate hybrid approaches involving convolutional operations and MLP to optimize the efficacy of segmentation networks, enabling their deployment at point-of-care settings with limited resources. Recently, numerous transformer-based networks have been utilized to enhance the U-Net backbone for medical image segmentation. These networks have demonstrated effectiveness in addressing global context and long-range dependencies(Raghu et al. 2021; Hatamizadeh et al. 2023; Li, Liu, and Yuan 2022; Li, Guo, and Yuan 2023). Examples include Trans-UNet (Chen et al. 2021), which adopts ViT architecture(Dosovitskiy et al. 2021) for 2D medical image segmentation using U-Net, and other transformer-based networks like MedT (Valanarasu et al. 2021) and UNETR(Hatamizadeh et al. 2022). Although showing great scaling capacity due to the sophisticated designs, transformers tend to overfit when dealing with limited datasets, indicating their data-hungry nature (Touvron et al. 2021; Liu et al. 2023). In contrast, structured state-space sequence models (SSMs) (Fu et al. 2022; Peng et al. 2023; Gu and Dao 2023) have recently shown high efficiency and effectiveness in long-sequence modeling. For medical image segmentation, U-Mamba (Ma, Li, and Wang 2024) and SegMamba (Xing et al. 2024) have proposed task-specific architectures with Mamba blocks respectively based on nn-UNet (Isensee et al. 2021) and Swin UNETR (Hatamizadeh et al. 2021), achieving promising results in various vision tasks.
自 U-Net(Ronneberger 等,2015 年)问世以来,一系列关键改进被引入,特别是在医学成像领域,包括 U-Net++(周等,2018 年)、3D U-Net(Çiçek 等,2016 年)、V-Net(Milletari 等,2016 年)和 YNet(Mehta 等,2018 年)。U-NeXt(Valanarasu 和 Patel,2022 年)和 Rolling U-Net(刘等,2024d)集成了涉及卷积操作和 MLP 的混合方法,以优化分割网络的有效性,使其能够在资源有限的护理点部署。最近,许多基于 transformer 的网络被用于增强 U-Net 主干以进行医学图像分割。这些网络在解决全局上下文和长距离依赖方面表现出有效性(Raghu 等,2021 年;Hatamizadeh 等,2023 年;李、刘和袁,2022 年;李、郭和袁,2023 年)。例如,Trans-UNet(陈等,2021 年)采用 ViT 架构(Dosovitskiy 等,2021 年)对 U-Net 进行 2D 医学图像分割,以及其他基于 transformer 的网络,如 MedT(Valanarasu 等,2021 年)和 UNETR(Hatamizadeh 等,2022 年)。 尽管由于复杂的设计显示出巨大的扩展能力,变压器在处理有限数据集时往往会出现过拟合,这表明它们对数据的渴求(Touvron 等人 2021;Liu 等人 2023)。相比之下,结构化状态空间序列模型(SSMs)(Fu 等人 2022;Peng 等人 2023;Gu 和 Dao 2023)最近在长序列建模中显示出高效率和有效性。在医学图像分割方面,U-Mamba(Ma,Li 和 Wang 2024)和 SegMamba(Xing 等人 2024)分别基于 nn-UNet(Isensee 等人 2021)和 Swin UNETR(Hatamizadeh 等人 2021)提出了基于 Mamba 块的特定任务架构,在各种视觉任务中取得了有希望的结果。
While existing U U UU-shape variations have been advanced in fine-trained medical scenarios, e.g., medical image segmentation, they still have fundamental challenges due to their sub-optimal kernel design and the unexplainable nature. Concretely, first, they typically employ conventional kernels 1 1 ^(1){ }^{1} to capture the spatial dependence between local pixels, which are limited to linearly modeling patterns and relationships across different channels in latent space. This makes it chal-
尽管现有的 U U UU 形状变体在精细训练的医疗场景中得到了提升,例如医学图像分割,但由于它们的子优化核设计和不可解释的本质,它们仍然存在根本性的挑战。具体来说,首先,它们通常采用传统的核 1 1 ^(1){ }^{1} 来捕捉局部像素之间的空间依赖性,这些核仅限于在潜在空间的不同通道之间线性建模模式和关系。这使得它变得
lenging to capture complex nonlinear patterns. Such intricate nonlinear patterns among channels are prevalent in visual tasks, such as medical imaging, where images often have intricate diagnostic characteristics. This complexity implies that feature channels might possess varying clinical relevance, representing different anatomical components or pathological indicators. Second, they mostly conduct empirical network search and heuristic model design to find the optimal architecture, ignoring the interpretability and explainability in existing black-box U-shape models. In existing U-shape variations, this unexplainable property poses a significant risk in clinical decision-making, further preventing the truthworth of diagnostic system design. Recently, KolmogorovArnold Networks (KANs) have attempted to open the black box of conventional network structures with superior interpretability, revealing the great potential of white-box network reseach (Yu et al. 2024; Pai et al. 2024). Considering the excellent architecture properties merged in KANs, it makes sense to effectively leverage KAN to bridge the gap between the network’s physical attributes and empirical performance.
挑战捕捉复杂的非线性模式。在视觉任务中,如医学成像,通道之间的复杂非线性模式普遍存在,图像往往具有复杂的诊断特征。这种复杂性意味着特征通道可能具有不同的临床相关性,代表不同的解剖成分或病理指标。其次,它们主要进行经验网络搜索和启发式模型设计以找到最佳架构,忽略了现有黑盒 U 形模型中的可解释性和可说明性。在现有的 U 形变体中,这种不可解释的性质在临床决策中构成了重大风险,进一步阻碍了诊断系统设计的可信度。最近,KolmogorovArnold Networks(KANs)试图通过优越的可解释性打开传统网络结构的黑盒,揭示了白盒网络研究(Yu 等人,2024;Pai 等人,2024)的巨大潜力。考虑到 KANs 中融合的出色架构特性,有效地利用 KAN 来弥合网络物理属性与经验性能之间的差距是有意义的。
In this endeavor, we have embarked on the exploration of a universally applicable U-KAN framework, denoted as U-KAN, marking an inaugural attempt to integrate advanced KAN into the pivotal visual backbone of UNet, through a convolutional KAN mixed architectural style. Notably, adhering to the benchmark setup of U-Net, we employ a multilayered deep encoder-decoder architecture with skip connections, incorporating a novel tokenized KAN block at higherlevel representations proximate to the bottleneck. This block projects intermediate features into tokens, subsequently applying the KAN operator to extricate informative patterns. The proposed U-KAN benefits from the alluring attributes of KAN networks in terms of non-linear modeling capabilities and interpretability, distinguishing it prominently within the prevalent U-Net architecture. Empirical evaluations on stringent medical segmentation benchmarks, both quantitative and qualitative, underscore U-KAN’s superior performance, outpacing established U-Net backbones with enhanced accuracy even with lower computation cost. Our investigation further delves into the potentiality of U-KAN as an alternative U-Net noise predictor in diffusion models, substantiating its relevance in generating task-oriented model architectures. In a nutshell, U-KAN signifies a steady step toward the design that incorporates mathematics theory-inspired operators into efficient visual pipelines and foretells its prospects in extensive visual applications. Our contributions can be summarized as follows:
在这个努力中,我们开始了对通用 U-KAN 框架的探索,该框架被称为 U-KAN,标志着将高级 KAN 集成到 UNet 关键视觉骨干中的开创性尝试,通过卷积 KAN 混合架构风格。值得注意的是,遵循 U-Net 的基准设置,我们采用多层深度编码器-解码器架构,具有跳过连接,在瓶颈附近的高层表示中引入了新颖的标记化 KAN 块。该块将中间特征投影到标记上,随后应用 KAN 算子以提取信息模式。所提出的 U-KAN 得益于 KAN 网络在非线性建模能力和可解释性方面的吸引力,使其在流行的 U-Net 架构中脱颖而出。在严格的医学分割基准上的实证评估,无论是定量还是定性,都强调了 U-KAN 的卓越性能,即使计算成本较低,也超越了现有的 U-Net 骨干,提高了准确性。 我们的调查进一步深入探讨了 U-KAN 作为扩散模型中替代 U-Net 噪声预测器的潜力,证实了其在生成面向任务的模型架构中的相关性。简而言之,U-KAN 标志着将受数学理论启发的算子融入高效视觉管道的设计中,并预示了其在广泛视觉应用中的前景。我们的贡献可以概括如下:
  • We present the first effort to incoporate the advantage of emerging KAN, improving the established U-Net pipeline to be more accurate, efficient, and interpretable.
    我们首次提出将新兴的 KAN 优势融入其中,改进现有的 U-Net 管道,使其更准确、高效和可解释。
  • We propose a tokenized KAN block to effectively steer the KAN operators to be compatible with the existing convolution-based designs.
    我们提出了一种分词的 KAN 块,以有效地引导 KAN 算子与现有的基于卷积的设计兼容。
  • We empirically validate U-KAN on a wide range of medical segmentation benchmarks, achieving impressive accuracy and efficiency.
    我们在广泛的医学分割基准上对 U-KAN 进行实证验证,实现了令人印象深刻的准确性和效率。
  • The application of U-KAN to existing diffusion models as an improved noise predictor demonstrates its potential
    U-KAN 应用于现有扩散模型作为改进的噪声预测器,展示了其潜力

    in backbone generative tasks and broader vision settings.
    在骨干生成任务和更广泛的视觉设置中。

U-Net Backbone for Medical Image Segmentation
U-Net 主干网络用于医学图像分割

Medical image segmentation (Ronneberger, Fischer, and Brox 2015; Myronenko 2019; Li et al. 2024d, 2022b) is a challenging task to which deep learning methods have been extensively applied and achieved breakthrough advancements in recent years (Shen, Wu, and Suk 2017; Liu et al. 2024a; Li et al. 2024a; Yang et al. 2023; Liu, Li, and Yuan 2023; Li et al. 2021c; Chen et al. 2023; Liu, Li, and Yuan 2022; Wuyang et al. 2021). U-Net (Ronneberger, Fischer, and Brox 2015) is a popular network structure for medical image segmentation. Its encoder-decoder architecture effectively captures image features. The CE-Net (Gu et al. 2019) further integrates a contextual information encoding module, enhancing the model’s receptive field and semantic representation capabilities. Unet++ (Zhou et al. 2018) proposes a nested U-Net structure that fuses multi-scale features to improve segmentation accuracy. In addition to convolution-based methods, Transformer-based models have also gained attention. The Vision Transformer (Dosovitskiy et al. 2021) demonstrates the effectiveness of Transformers in image recognition tasks. The Medical Transformer (Valanarasu et al. 2021) and TransUNet (Chen et al. 2021) further incorporate Transformers into medical image segmentation, achieving satisfying performance. Moreover, techniques such as attention mechanism (Schlemper et al. 2019) and multi-scale feature fusion (Huang et al. 2020) are widely used in medical image segmentation tasks. 3D segmentation models like Multi-dimensional Gated Recurrent Units (Andermatt, Pezold, and Cattin 2016) and Efficient Multi-Scale 3D CNN (Kamnitsas et al. 2017) also yield commendable results. In summary, medical image segmentation is an active research field where deep learning methods have made significant progress. Recently, Mamba (Gu and Dao 2023) has achieved a groundbreaking milestone with its linear-time inference and efficient training process by integrating selection mechanism and hardware-aware algorithms into previous works (Gu et al. 2022; Gupta, Gu, and Berant 2022; Mehta et al. 2022). Building on the success of Mamba for visual application, Vision Mamba (Liu et al. 2024c) and VMamba (Zhu et al. 2024) use bidirectional Vim Block and the Cross-Scan Module, respectively, to gain data-dependent global visual context. At the same time, U-Mamba (Ma, Li, and Wang 2024) and other works (Xing et al. 2024; Ruan and Xiang 2024) show superior performance in medical image segmentation. As Kolmogorov-Arnold Network (KAN) (Liu et al. 2024e) has been emerged as a promising alternative for MLP and demonstrates its precision, efficiency, and interpretability, we believe now is the right time to open up the exploration of its broader applications in vision backbones.
医学图像分割(Ronneberger、Fischer 和 Brox 2015;Myronenko 2019;Li 等 2024d,2022b)是一项具有挑战性的任务,近年来深度学习方法被广泛应用于此,并取得了突破性的进展(Shen、Wu 和 Suk 2017;Liu 等 2024a;Li 等 2024a;Yang 等 2023;Liu、Li 和 Yuan 2023;Li 等 2021c;Chen 等 2023;Liu、Li 和 Yuan 2022;Wuyang 等 2021)。U-Net(Ronneberger、Fischer 和 Brox 2015)是医学图像分割中的一种流行网络结构。其编码器-解码器架构有效地捕捉图像特征。CE-Net(Gu 等 2019)进一步集成了一个上下文信息编码模块,增强了模型的感觉野和语义表示能力。Unet++(Zhou 等 2018)提出了一种嵌套 U-Net 结构,融合多尺度特征以提高分割精度。除了基于卷积的方法外,基于 Transformer 的模型也引起了关注。视觉 Transformer(Dosovitskiy 等 2021)证明了 Transformers 在图像识别任务中的有效性。医学 Transformer(Valanarasu 等 2021)和 TransUNet(Chen 等 2021) 进一步将 Transformer 应用于医学图像分割,取得了令人满意的效果。此外,注意力机制(Schlemper 等人,2019 年)和多尺度特征融合(黄等人,2020 年)等技术也被广泛应用于医学图像分割任务中。多维门控循环单元(Andermatt,Pezold 和 Cattin,2016 年)和高效多尺度 3D CNN(Kamnitsas 等人,2017 年)等 3D 分割模型也取得了可嘉的成果。总之,医学图像分割是一个活跃的研究领域,深度学习方法取得了重大进展。最近,Mamba(Gu 和 Dao,2023 年)通过将选择机制和硬件感知算法整合到先前工作中(Gu 等人,2022 年;Gupta,Gu 和 Berant,2022 年;Mehta 等人,2022 年),实现了突破性的里程碑,其线性时间推理和高效训练过程。在 Mamba 在视觉应用的成功基础上,Vision Mamba(Liu 等人,2024c)和 VMamba(Zhu 等人,2024)分别使用双向 Vim 块和交叉扫描模块,以获得数据依赖的全局视觉上下文。 同时,U-Mamba(Ma,Li 和 Wang 2024)以及其他工作(Xing 等人 2024;Ruan 和 Xiang 2024)在医学图像分割方面表现出卓越的性能。随着 Kolmogorov-Arnold 网络(KAN)(Liu 等人 2024e)已成为 MLP 的有希望替代品,并展示了其精度、效率和可解释性,我们认为现在是开放其更广泛应用在视觉骨干中的探索的合适时机。

U-Net Diffusion Backbone for Image Generation
U-Net 扩散骨干网络用于图像生成

Diffusion Probability Models, a frontier category of generative models, have emerged as a focal point in the research domain, particularly in tasks related to computer vision (Ho, Jain, and Abbeel 2020; Rombach et al. 2022; Ramesh et al. 2022). Unlike other categories of generative models (Kingma
扩散概率模型,生成模型的前沿类别,已成为研究领域的焦点,尤其是在与计算机视觉相关的任务中(Ho,Jain 和 Abbeel 2020;Rombach 等人 2022;Ramesh 等人 2022)。与其他生成模型类别不同(Kingma

and Welling 2013; Wang, Li, and Vasconcelos 2021; Goodfellow et al. 2014; Mirza and Osindero 2014; Brock, Donahue, and Simonyan 2018; Karras et al. 2018), such as Variational Autoencoders (VAE) (Kingma and Welling 2013), Generative Adversarial Networks (GANs) (Goodfellow et al. 2014; Brock, Donahue, and Simonyan 2018; Karras et al. 2018; Zhang et al. 2021), and vector quantization methods (Van Den Oord, Vinyals et al. 2017; Esser, Rombach, and Ommer 2021), diffusion models introduce a novel generative paradigm. These models employ a fixed Markov chain to map the latent space, fostering complex mappings that capture the intricate structure inherent in datasets. Recently, their impressive generative prowess, from high-level detail to diversity in generated samples, has propelled breakthrough progress in various computer vision applications, such as image synthesis (Ho, Jain, and Abbeel 2020; Rombach et al. 2022; Saharia et al. 2022b), image editing (Avrahami, Lischinski, and Fried 2022; Choi et al. 2021; Meng et al. 2022; Li et al. 2024g), image-to-image translation (Choi et al. 2021; Saharia et al. 2022a; Wang et al. 2022b; Li et al. 2024f), and video generation (Hong et al. 2022; Blattmann et al. 2023; He et al. 2022; Li et al. 2024c). Diffusion models consist of a diffusion process and a denoising process. In the diffusion process, Gaussian noise is gradually added to the input data, eventually corroding it to approximate pure Gaussian noise. In the denoising process, the original input data is recovered from its noisy state through a learned sequence of inverse diffusion operations. Typically, convolutional U-Nets (Ronneberger, Fischer, and Brox 2015), the de-facto choice of backbone architecture, are trained to iteratively predict the noise to be removed at each denoising step. Diverging from previous work that focuses on utilizing pre-trained diffusion U-Nets for downstream applications, recent work has committed to exploring the intrinsic features and structural properties of diffusion U-Nets. Free-U investigates strategically reassessing the contribution of U-Net’s skip connections and backbone feature maps to leverage the strengths of the two components of the U-Net architecture. RINs (Jabri, Fleet, and Chen 2022) introduced a novel, efficient architecture based on attention for DDPMs. DiT (Peebles and Xie 2023) proposed the combination of pure transformer with diffusion, showcasing its scalable nature. In this paper, we demonstrate the potential of a backbone scheme integrating U-Net and KAN for generation, pushing the boundaries and options for generation backbone.
和 Welling 2013;王、李和瓦斯科内塞洛斯 2021;古德费洛等人 2014;米尔扎和奥斯因德罗 2014;布洛克、多纳休和西莫尼亚 2018;卡拉斯等人 2018),例如变分自编码器(VAE)(Kingma 和 Welling 2013)、生成对抗网络(GANs)(古德费洛等人 2014;布洛克、多纳休和西莫尼亚 2018;卡拉斯等人 2018;张等人 2021)和矢量量化方法(Van Den Oord,Vinyals 等人 2017;Esser,Rombach 和 Ommer 2021),扩散模型引入了一种新的生成范式。这些模型使用一个固定的马尔可夫链来映射潜在空间,促进复杂的映射,捕捉数据集中固有的复杂结构。最近,它们在生成能力方面的惊人表现,从高级细节到生成样本的多样性,推动了计算机视觉应用在各个方面的突破性进展,例如图像合成(何、贾因和阿贝尔 2020;罗姆巴赫等人 2022;萨哈里亚等人 2022b)、图像编辑(阿夫拉哈米、利希斯金和弗里德 2022;崔等人 2021;孟等人 2022;李等人 2024g)、图像到图像的翻译(崔等人 2021;萨哈里亚等人 2022a;王等人 2022b;李等人 2024f)和视频生成(洪等人 2022;Blattmann 等人 2023;He 等人 2022;Li 等人 2024c)。扩散模型由扩散过程和去噪过程组成。在扩散过程中,高斯噪声逐渐添加到输入数据中,最终将其腐蚀到近似纯高斯噪声。在去噪过程中,通过学习到的逆扩散操作序列,从其噪声状态中恢复原始输入数据。通常,卷积 U-Nets(Ronneberger,Fischer 和 Brox 2015),作为事实上的骨干架构选择,被训练以迭代预测每个去噪步骤中要去除的噪声。与之前专注于利用预训练的扩散 U-Nets 进行下游应用的工作不同,最近的研究致力于探索扩散 U-Nets 的内在特征和结构属性。Free-U 调查了战略性地重新评估 U-Net 跳跃连接和骨干特征图对利用 U-Net 架构两个组件优势的贡献。RINs(Jabri,Fleet 和 Chen 2022)引入了一种基于注意力的 DDPMs 的新型、高效架构。 DiT(皮布尔斯和谢,2023)提出了纯 Transformer 与扩散的结合,展示了其可扩展性。在本文中,我们展示了将 U-Net 和 KAN 集成到骨干方案中的潜力,推动了生成骨干的边界和选项。

Kolmogorov-Arnold Networks (KANs)
高尔戈罗夫-阿诺德网络(KANs)

The Kolmogorov-Arnold theorem (Kolmogorov 1957) postulates that any continuous function can be expressed as a composition of continuous unary functions of finite variables, providing a theoretical basis for the construction of universal neural network models. This was further substantiated by Hornik et al. (Hornik, Stinchcombe, and White 1989), who demonstrated that feed-forward neural networks possess universal approximation capabilities, paving the way for the development of deep learning. Drawing from the Kolmogorov-Arnold theorem, scholars proposed a novel neural network architecture known as Kolmogorov-Arnold Networks (KANs) (Huang, Zhao, and Song 2014). KANs consist
高尔戈罗夫-阿诺德定理(高尔戈罗夫,1957 年)假设任何连续函数都可以表示为有限变量的连续单变量函数的复合,为构建通用神经网络模型提供了理论基础。这一观点进一步得到了霍尼克等人(霍尼克、斯廷奇科姆和怀特,1989 年)的证实,他们证明了前馈神经网络具有通用逼近能力,为深度学习的发展铺平了道路。借鉴高尔戈罗夫-阿诺德定理,学者们提出了一种新的神经网络架构,称为高尔戈罗夫-阿诺德网络(KANs)(黄、赵和宋,2014 年)。KANs 由

of a series of concatenated Kolmogorov-Arnold layers, each containing a set of learnable one-dimensional activation functions. This network structure has proven effective in approximating high-dimensional complex functions, demonstrating robust performance across various applications. KANs are characterized by strong theoretical interpretability and explainability. Huang et al. (Huang, Zhao, and Xing 2017) analyzed the optimization characteristics and convergence of KANs, validating their excellent approximation capacity and generalization performance. Liang et al. (Liang, Zhao, and Huang 2018) further introduced a deep KAN model and applied it to tasks such as image classification. Xing et al. (Xing, Zhao, and Huang 2018) deployed KANs for time series prediction and control problems. Despite these advancements, there has been a lack of practical implementations to broadly incorporate the novel neural network model of KAN, which has strong theoretical foundations, into general-purpose vision networks. In contrast, this paper undertakes an initial exploration, attempting to design a universal visual network architecture that integrates KAN and validates it on a wide range of segmentation and generative tasks.
一系列连接的 Kolmogorov-Arnold 层,每个层包含一组可学习的单维激活函数。这种网络结构在逼近高维复杂函数方面已被证明是有效的,在各种应用中表现出强大的性能。KANs 以其强大的理论可解释性和可解释性为特征。黄等人(黄、赵、邢 2017)分析了 KANs 的优化特性和收敛性,验证了其卓越的逼近能力和泛化性能。梁等人(梁、赵、黄 2018)进一步引入了一个深度 KAN 模型并将其应用于图像分类等任务。邢等人(邢、赵、黄 2018)将 KANs 用于时间序列预测和控制问题。尽管取得了这些进展,但将具有强大理论基础的 KAN 新型神经网络模型广泛纳入通用视觉网络中的实际应用仍然不足。 相比之下,本文进行了一次初步探索,试图设计一个集成了 KAN 的通用视觉网络架构,并在广泛的分割和生成任务上对其进行验证。

Method 方法

Overview Fig. 1 illustrates the overall architecture of the proposed U-KAN, following a two-phase encoder-decoder architecture comprising a Convolution Phrase and a Tokenized Kolmogorov-Arnold Network (Tok-KAN) Phrase. The input image traverses the encoder, where the initial three blocks utilize convolution operations, followed by two tokenized MLP blocks. The decoder comprises two tokenized KAN blocks followed by three convolution blocks. Each encoder block halves the feature resolution, while each decoder block doubles it. Additionally, skip connections are integrated between the encoder and decoder. The channel count for each block in Convolution Phrase and Tok-KAN Phrase is respectively determined by hyperparameters as C 1 C 1 C_(1)C_{1} to C 3 C 3 C_(3)C_{3} and D 1 D 1 D_(1)D_{1} to D 3 D 3 D_(3)D_{3}.
概述图 1 展示了所提出的 U-KAN 的整体架构,采用由卷积短语和标记化柯尔莫哥洛夫-阿诺德网络(Tok-KAN 短语)组成的两阶段编码器-解码器架构。输入图像通过编码器,其中前三个块使用卷积操作,随后是两个标记化 MLP 块。解码器包括两个标记化 KAN 块,接着是三个卷积块。每个编码器块将特征分辨率减半,而每个解码器块将其加倍。此外,编码器和解码器之间还集成了跳跃连接。卷积短语和 Tok-KAN 短语中每个块的通道数分别由超参数 C 1 C 1 C_(1)C_{1} C 3 C 3 C_(3)C_{3} D 1 D 1 D_(1)D_{1} D 3 D 3 D_(3)D_{3} 确定。

KAN as Efficient Embedder
KAN 作为高效嵌入器

This research aims to incorporate Kolmogorov-Arnold Networks (KANs) into the U-Net framework. The basis of this approach is the proven high efficiency and interpretability of KANs as outlined in (Liu et al. 2024e). A Multi-Layer Perceptron (MLP) comprising K K KK layers can be described as an interplay of transformation matrices W W WW and activation functions σ σ sigma\sigma. This can be mathematically expressed as:
本研究旨在将 Kolmogorov-Arnold 网络(KANs)集成到 U-Net 框架中。这种方法的基础是 KANs 在(Liu 等人 2024e)中概述的已证高效性和可解释性。由 K K KK 层组成的多层感知器(MLP)可以描述为变换矩阵 W W WW 和激活函数 σ σ sigma\sigma 的相互作用。这可以数学上表示为:
MLP ( Z ) = ( W K 1 σ W K 2 σ W 1 σ W 0 ) Z MLP ( Z ) = W K 1 σ W K 2 σ W 1 σ W 0 Z MLP(Z)=(W_(K-1)@sigma@W_(K-2)@sigma@cdots@W_(1)@sigma@W_(0))Z\operatorname{MLP}(\mathbf{Z})=\left(W_{K-1} \circ \sigma \circ W_{K-2} \circ \sigma \circ \cdots \circ W_{1} \circ \sigma \circ W_{0}\right) \mathbf{Z}
where it strives to mimic complex functional mappings through a sequence of nonlinear transformations over multiple layers. Despite its potential, the inherent obscurity within this structure significantly hampers the model’s interpretability, thus posing considerable challenges to intuitively understanding the underlying decision-making mechanisms. In an effort to mitigate the issues of low parameter efficiency and limited interpretability inherent in MLPs, Liu et al. (Liu et al. 2024e) proposed the Kolmogorov-Arnold Network (KAN), drawing inspiration from the Kolmogorov-Arnold representation theorem (Kolmogorov 1961).
它通过多层非线性变换的序列来模拟复杂的函数映射。尽管具有潜力,但该结构内在的晦涩难懂显著阻碍了模型的可解释性,从而对直观理解其背后的决策机制提出了重大挑战。为了缓解 MLP(多层感知器)中固有的低参数效率和有限的可解释性问题,刘等人(刘等人,2024e)提出了柯尔莫哥洛夫-阿诺德网络(KAN),并从柯尔莫哥洛夫-阿诺德表示定理(柯尔莫哥洛夫,1961)中汲取灵感。

  1. *These authors contributed equally.
    这些作者贡献相同。

    Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
    版权所有 © 2025,人工智能推进协会(www.aaai.org)。保留所有权利。
  2. 1 1 ^(1){ }^{1} Such operations include convolution, Transformers, and MLPs, etc.
    这些操作包括卷积、Transformer 和 MLP 等。