Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures
神经电路图：深度学习架构的通信、实施和分析的可靠图表

Vincent Abbott 文森特·阿博特Australian National University
澳大利亚国立大学

Reviewed on OpenReview: https://openreview.net/forum? id=RyZB4qXEgt
在 OpenReview 上审查: https://openreview.net/forum? id=RyZB4qXEgt

Abstract 摘要

Diagrams matter. Unfortunately, the deep learning community has no standard method for diagramming architectures. The current combination of linear algebra notation and ad-hoc diagrams fails to offer the necessary precision to understand architectures in all their detail. However, this detail is critical for faithful implementation, mathematical analysis, further innovation, and ethical assurances. I present neural circuit diagrams, a graphical language tailored to the needs of communicating deep learning architectures. Neural circuit diagrams naturally keep track of the changing arrangement of data, precisely show how operations are broadcast over axes, and display the critical parallel behavior of linear operations. A lingering issue with existing diagramming methods is the inability to simultaneously express the detail of axes and the free arrangement of data, which neural circuit diagrams solve. Their compositional structure is analogous to code, creating a close correspondence between diagrams and implementation. In this work, I introduce neural circuit diagrams for an audience of machine learning researchers. After introducing neural circuit diagrams, I cover a host of architectures to show their utility and breed familiarity. This includes the transformer architecture, convolution (and its difficult-to-explain extensions), residual networks, the U-Net, and the vision transformer. I include a Jupyter notebook that provides evidence for the close correspondence between diagrams and code. Finally, I examine backpropagation using neural circuit diagrams. I show their utility in providing mathematical insight and analyzing algorithms’ time and space complexities.
图示很重要。不幸的是,深度学习社区没有一种标准方法来绘制架构图。当前使用线性代数符号和临时图表的组合,无法提供足够的精确度来理解架构的所有细节。然而,这些细节对于忠实的实现、数学分析、进一步创新和道德保证都至关重要。我提出了神经电路图,这是一种专门针对深度学习架构通信需求的图形语言。神经电路图自然地跟踪数据布局的变化,精确地显示操作是如何广播到轴的,并展示线性操作的关键并行行为。现有的绘图方法存在一个持续的问题,即无法同时表达轴的细节和数据的自由排列,而神经电路图解决了这一问题。它们的组合结构类似于代码,使图表和实现之间存在密切对应关系。在这项工作中,我向机器学习研究人员介绍了神经电路图。在介绍神经电路图之后,我讨论了一系列架构,以展示它们的实用性和培养熟悉度。这包括 transformer 架构、卷积(及其难以解释的扩展)、残差网络、U-Net 和视觉 transformer。我包含了一个 Jupyter 笔记本,提供了图表和代码之间密切对应关系的证据。最后,我使用神经电路图研究了反向传播。我展示了它们在提供数学洞见和分析算法时间和空间复杂性方面的实用性。

1 Introduction 1 引言

1.1 Necessity of Improved Communication in Deep Learning
深度学习中改善沟通的必要性

Deep learning models are immense statistical engines. They rely on components connected in intricate ways to slowly nudge input data toward some target. Deep learning models convert big data into usable predictions, forming the core of many AI systems. The design of a model-its architecture can significantly impact performance (Krizhevsky et al., 2012), ease of training (He et al., 2015; Srivastava et al., 2015), generalization (Ioffe & Szegedy, 2015; Ba et al., 2016), and ability to efficiently tackle certain classes of data (Vaswani et al., 2017; Ho et al., 2020).
深度学习模型是庞大的统计引擎。它们依赖于以复杂方式连接的组件,缓慢地推动输入数据朝着某个目标前进。深度学习模型将大数据转化为可用的预测,形成许多 AI 系统的核心。模型的设计及其体系结构会显著影响性能(Krizhevsky et al., 2012)、训练的容易程度(He et al., 2015; Srivastava et al., 2015)、泛化能力(Ioffe & Szegedy, 2015; Ba et al., 2016),以及有效处理某些数据类型的能力(Vaswani et al., 2017; Ho et al., 2020)。

Architectures can have subtle impacts, such as different image models recognizing patterns at various scales (Ronneberger et al., 2015; Luo et al., 2017). Many significant innovations in deep learning have resulted from architecture design, often from frighteningly simple modifications (He et al., 2015). Furthermore, architecture design is in constant flux. New developments constantly improve on state-of-the-art methods (He et al., 2016; Lee, 2023), often showing that the most common designs are just one of many approaches worth investigating (Liu et al., 2021; Sun et al., 2023).
架构可能会产生微妙的影响,例如不同的图像模型在不同尺度上识别模式(Ronneberger 等,2015;Luo 等,2017)。深度学习的许多重大创新都来自于架构设计,通常源于可怕的简单修改(He 等,2015)。此外,架构设计一直在不断变化。新的发展不断改进最先进的方法(He 等,2016;Lee,2023),通常表明最常见的设计只是值得探索的众多方法之一(Liu 等,2021;Sun 等,2023)。

However, these critical innovations are presented using ad-hoc diagrams and linear algebra notation (Vaswani et al., 2017; Goodfellow et al., 2016). These methods are ill-equipped for the non-linear operations and actions on multi-axis tensors that constitute deep learning models (Xu et al., 2023; Chiang et al., 2023). Furthermore, these tools are insufficient for papers to present their models in full detail. Subtle details such as the order of normalization or activation components can be missing, despite their impact on performance (He et al., 2016).
然而,这些关键创新是使用临时图表和线性代数符号(Vaswani 等, 2017; Goodfellow 等, 2016)来呈现的。这些方法无法应对构成深度学习模型的非线性操作和多轴张量上的动作(Xu 等, 2023; Chiang 等, 2023)。此外,这些工具也不足以让论文全面地介绍其模型。诸如归一化或激活组件的顺序等细节可能会缺失,尽管它们对性能有影响(He 等, 2016)。

Works with immense theoretical contributions can fail to communicate equally insightful architectural developments (Rombach et al., 2022; Nichol & Dhariwal, 2021). Many papers cannot be reproduced without reference to the accompanying code. This was quantified by Raff (2019), where only

63.5 %

of 255 machine learning papers from 1984 to 2017 could be independently reproduced without reference to the author’s code. Interestingly, the number of equations present was negatively correlated with reproduction, further highlighting the deficits of how models are currently communicated. The year that papers were published had no correlation to reproducibility, indicating that this problem is not resolving on its own.
即使提出了巨大的理论贡献,也可能无法平等地传达深入的架构发展(Rombach 等人,2022;Nichol & Dhariwal,2021)。许多论文如果没有参考附带的代码,就无法被复制。Raff(2019)对此进行了量化,发现 1984 年到 2017 年之间 255 篇机器学习论文中,只有

63.5 %

可以在没有作者代码的情况下独立复制。有趣的是,论文中程序公式的数量与复制能力呈负相关,进一步突出了当前模型传达方式的缺陷。论文发表年份与可复制性没有相关性,表明这个问题并没有自行得到解决。

Relying on code raises many issues. The reader must understand a specific programming framework, and there is a burden to dissect and reimplement the code if frameworks mismatch. Without reference to a blueprint, mistakes in code cannot be cross-checked. The overall structure of algorithms is obfuscated, raising ethical risks about how data is managed (Kapoor & Narayanan, 2022).
依赖于代码会带来许多问题。读者必须了解特定的编程框架,如果框架不匹配,则需要费力地分析和重新实现代码。没有蓝图参考,代码中的错误无法进行交叉验证。算法的整体结构被模糊化,引发了关于数据管理方式的伦理风险(Kapoor 和 Narayanan,2022)。

Furthermore, papers that clearly explain their models without resorting to code provide stronger scientific insight. As argued by Drummond (2009), replicating the code associated with experiments leads to weaker scientific results than reproducing a procedure. After all, replicating an experiment perfectly controls all variables, including irrelevant ones, making it difficult to link any independent variable to the observed outcome.
此外，没有使用代码就能清楚解释模型的论文可以提供更强的科学洞见。正如 Drummond (2009)所论述的,复制与实验相关的代码会产生比复现实验过程更弱的科学结果。毕竟,完美复制一个实验可以控制所有变量,包括无关变量,这使得很难将任何自变量与观察到的结果联系起来。

However, in machine learning, papers often cannot be independently reproduced without referencing their accompanying code. As a result, the machine learning community misses out on experiments that provide general insight independent of specific implementations. Improved communication of architectures, therefore, will offer clear scientific value.
然而，在机器学习中，如果没有参考论文附带的代码,论文往往无法独立复制。因此,机器学习社区错过了提供独立于特定实现的一般见解的实验。因此,改善架构的沟通将提供明确的科学价值。

1.2 Case Study: Shortfalls of Attention is All You Need
1.2 案例研究:注意力单一模型的不足

To highlight the problem of insufficient communication of deep learning architectures, I present a case study of Attention is All You Need, the paper that introduced transformer models (Vaswani et al., 2017). Introduced in 2017, transformer models have revolutionized machine learning, finding applications in natural language processing, image processing, and generative tasks (Phuong & Hutter, 2022; Lin et al., 2021).
为了突出深度学习架构沟通不足的问题，我提出了注意力机制是你所需要的论文的案例研究(Vaswani et al., 2017)。2017 年引入的 transformer 模型彻底改变了机器学习,在自然语言处理、图像处理和生成任务中找到了应用(Phuong & Hutter, 2022; Lin et al., 2021)。
Transformers’ effectiveness stems partly from their ability to inject external data of arbitrary width into base data. I refer to axes representing the number of items in data as a width, and axes indicating information per item as a depth.
变压器的有效性部分源于其将任意宽度的外部数据注入基础数据的能力。我指的是表示数据中项目数量的坐标轴为宽度,而表示每个项目信息的坐标轴为深度。

An attention head gives a weighted sum of the injected data’s value vectors,

V

. The weights depend on the attention score the base data’s query vectors,

Q

, assign to each key vector,

K

, of the injected data. Value and key vectors come in pairs. Fully connected layers, consisting of learned matrix multiplication, generate

Q

K

, and

V

vectors from the original base and injected data. Multi-head attention uses multiple attention heads in parallel, enabling efficient parallel operations and the simultaneous learning of distinct attributes.
注意力头给出了注入数据的值向量的加权和。权重取决于基础数据的查询向量分配给注入数据的每个键向量的注意力得分。值向量和键向量是成对出现的。全连接层由学习的矩阵乘法组成,从原始基础和注入数据生成向量。多头注意力在并行中使用多个注意力头,实现高效的并行操作和对不同属性的同时学习。

Attention is All You Need, which I refer to as the original transformer paper, explains these algorithms using diagrams (see Figure 1) and equations (see Equation 1,2,3) that hinder understandability (Chiang et al., 2023; Phuong & Hutter, 2022).
注意力机制就是你所需要的,这篇论文我称之为原始的 transformer 论文,使用图表(参见图 1)和公式(参见公式 1、2、3)解释这些算法,这阻碍了可理解性(Chiang 等人,2023;Phuong&Hutter,2022)。

Figure 1: My annotations of the diagrams of the original transformer model. Critical information is missing regarding the origin of

Q, K

, and

V

values (red and blue), and the axes over which operations act (green).
图 1：我对原始 transformer 模型的图表进行的注释。关于

Q, K

和

V

（红色和蓝色）的来源以及操作作用于其上的坐标轴（绿色）的关键信息缺失。

\begin{aligned} Attention (Q, K, V) = SoftMax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V (d_{k} is the key depth) \\ MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \\ where {head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{aligned}

The original transformer paper obscures dimension sizes and their interactions. The dimensions over which SoftMax

^{1}

and matrix multiplication operates is ambiguous (Figure 1.1, green; Equation 1, 2, 3).
原始变换器论文模糊了维度大小及其交互作用。SoftMax

^{1}

和矩阵乘法操作的维度是模糊的(图 1.1, 绿色; 公式 1、2、3)。
Determining the initial and final matrix dimensions is left to the reader. This obscures key facts required to understand transformers. For instance,

K

and

V

can have a different width to

Q

, allowing them to inject external information of arbitrary width. This fact is not made clear in the original diagrams or equations. Yet, it is necessary to understand why transformers are so effective at tasks with variable input widths, such as language processing.
确定初始和最终矩阵尺寸留给读者。这模糊了了解变换器所需的关键事实。例如，

K

和

V

的宽度可以不同于

Q

，允许它们注入任意宽度的外部信息。原始图表或等式中没有明确说明这一事实。然而，这是理解 transformers 在具有可变输入宽度的任务(如语言处理)中如此有效的必要条件。

The original transformer paper also has uncertainty regarding

Q, K

, and

V

. In Figure 1.1 and Equation 1, they represent separate values fed to each attention head. In Figure 1.2 and Equation 2 and 3, they are all copies of each other at location (A) of the overall model in Figure 1.3, while

Q

is separate in situation (B).
原始变压器论文对

Q, K

和

V

也存在不确定性。在图 1.1 和公式 1 中,它们表示输入到每个注意力头的分离值。在图 1.2 和公式 2 和 3 中,它们在图 1.3 整个模型中的位置(A)处都是相同的副本,而

Q

在情况(B)下是分离的。

Annotating makeshift diagrams does not resolve the issue of low interpretability. As they are constructed for a specific purpose by their author, they carry the author’s curse of knowledge (Pinker, 2014; Hayes & Bajzek, 2008; Ross et al., 1977). In Figure 1, low interpretability arises from missing critical information, not from insufficiently annotating the information present. The information about which axes are matrix multiplied or are operated on with the SoftMax is simply not present.
注释临时性示意图并不能解决低可解释性的问题。它们是由作者出于特定目的而构建的,因此也带有作者知识的诅咒(Pinker, 2014; Hayes & Bajzek, 2008; Ross et al., 1977)。在图 1 中,低可解释性是由于关键信息的缺失,而不是对现有信息的注释不足。关于哪些轴被矩阵相乘或经过 SoftMax 操作的信息根本就没有给出。

Therefore, we need to develop a framework for diagramming architectures that ensures key information, such as the axes over which operations occur, is automatically shown. Taking full advantage of annotating the critical information already present in neural circuit diagrams, I present alternative diagrams in Figures 20, 21 , and 22.
因此,我们需要开发一个框架来绘制架构图,确保关键信息(如操作发生的坐标轴)自动显示。充分利用在神经电路图中已经存在的关键信息进行注解,我在图 20、21 和 22 中提出了替代性图表。

These issues with the current ad-hoc approaches to communicating architectures have been identified in prior works, which have proposed their own solutions (Phuong & Hutter, 2022; Chiang et al., 2023; Xu et al., 2023; Xu & Maruyama, 2022). This shows that this is a known issue of interest to the deep learning community. Non-graphical approaches focus on enumerating all the variables and operations explicitly, whether by extending linear algebra notation (Chiang et al., 2023) or explicitly describing every step with pseudocode (Phuong & Hutter, 2022). Visualization, however, is essential to human comprehension (Pinker, 2014; Borkin et al., 2016; Sadoski, 1993). Standard non-graphical methods are essential to pursue, and the community will benefit significantly from their adoption; however, a standardized graphical language is still needed.
当前用于传达架构的临时性方法存在一些问题,这些问题在以前的研究中已被识别,并提出了自己的解决方案(Phuong & Hutter, 2022; Chiang et al., 2023; Xu et al., 2023; Xu & Maruyama, 2022)。这表明这是一个深度学习社区关注的已知问题。非图形方法专注于显式列举所有变量和操作,无论是通过扩展线性代数符号(Chiang et al., 2023)还是用伪代码详细描述每一步(Phuong & Hutter, 2022)。然而,可视化对于人类理解至关重要(Pinker, 2014; Borkin et al., 2016; Sadoski, 1993)。继续追求标准的非图形方法是必要的,社区将从其采用中获益良多;然而,仍需要一种标准化的图形语言。

The inclination towards visualizing complex systems has led to many tools being developed for industrial applications. Labview, MATLAB’s Simulink, and Modelica are used in academia and industry to model various systems. For deep learning, TensorBoard and Torchview have become convenient ways to graph architectures. These tools, however, do not offer sufficient detail to implement architectures. They are often dedicated to one programming language or framework, meaning they cannot serve as a general means of communicating new developments. Besides, a rigorously developed framework-independent graphical language for deep learning architectures would help to improve these tools. This requires diagrams equipped with a mathematical framework that captures the changing structure of data, along with key operations such as broadcasting and linear transformations.
复杂系统可视化的倾向推动了许多工业应用工具的开发。Labview、MATLAB 的 Simulink 和 Modelica 在学术界和工业界用于建模各种系统。针对深度学习,TensorBoard 和 Torchview 已成为绘制架构的便捷方式。然而,这些工具无法提供足够的细节来实现架构。它们通常专注于一种编程语言或框架,这意味着它们无法作为传达新进展的通用手段。此外,一个经过严格开发的框架独立的深度学习架构图形语言将有助于改进这些工具。这需要配备数学框架的图表,以捕捉数据结构的变化,以及广播和线性变换等关键操作。

Many mathematically rigorous graphical methods exist for a variety of fields. This includes Petri nets, which have been used to model several processes in research and industry (Murata, 1989). Tensor networks were developed for quantum physics and have been successfully extended to deep learning (Biamonte & Bergholm, 2017; Xu et al., 2023; Xu & Maruyama, 2022). Xu et al. (2023) showed that re-implementing models after making them graphically explicit can improve performance by letting parallelized tensor algorithms be employed. Robust diagrams, therefore, can benefit both the communication and performance of architectures. Formal graphical methods have also been developed in physics, logic, and topology (Baez & Stay, 2010; Awodey, 2010).
许多数学上严格的图形方法存在于各种领域。其中包括 Petri 网络,它们已被用于研究和工业中的几个过程(Murata, 1989)。张量网络最初是为量子物理开发的,后来也成功地应用到深度学习中(Biamonte & Bergholm, 2017; Xu et al., 2023; Xu & Maruyama, 2022)。Xu 等人(2023)发现,在将模型图形化之后重新实现它们,可以让并行处理的张量算法发挥作用,从而提高性能。因此,健壮的图表可以提高架构的通信和性能。物理、逻辑和拓扑学等领域也已经开发了正式的图形方法(Baez & Stay, 2010; Awodey, 2010)。

All these graphical methods have been found to represent an underlying category, a mathematical space with well-defined composition rules (Meseguer & Montanari, 1990; Baez & Stay, 2010). A category theory approach allows a common structure, monoidal products, to define an intuitive graphical language (Selinger, 2009; Fong & Spivak, 2019). Category theory, therefore, provides a robust framework to understand and develop new graphical methods.
所有这些图形方法都已被发现能够表示一个潜在的类别,一个具有明确定义的组合规则的数学空间(Meseguer & Montanari, 1990; Baez & Stay, 2010)。范畴论方法允许一个共同的结构,单幺积,来定义一个直观的图形语言(Selinger, 2009; Fong & Spivak, 2019)。因此,范畴论提供了一个强大的框架来理解和开发新的图形方法。

However, a noted issue (Chiang et al., 2023) of previous graphical approaches is they have difficulty expressing non-linear operations. This arises from a tensor approach to monoidal products. Data brought together cannot necessarily be copied or deleted. This represents, for instance, axes brought together to form a matrix and this approach makes linear operations elegantly manageable. It, however, makes expressing copying and deletion impossible. The alternative Cartesian approach allows copying and deletion, reflecting the mechanics of classical computing.
然而,之前的图形方法存在一个值得注意的问题(Chiang et al., 2023),即它们很难表达非线性运算。这源于张量方法对于幺半群乘积。合并在一起的数据不一定可以复制或删除。这表示,例如将坐标轴合并形成矩阵,这种方法使线性运算优雅地可管理。然而,它使复制和删除变得不可能。作为替代的笛卡尔方法允许复制和删除,反映了古典计算的机制。

The Cartesian approach has been used to develop a mathematical understanding of deep learning (Shiebler et al., 2021; Fong et al., 2019; Wilson & Zanasi, 2022; Cruttwell et al., 2022). However, Cartesian monoidal products do not automatically keep track of dimensionality and cannot easily represent broadcasting or linear operations. These works often rely on the most rudimentary model of deep-learning networks as sequential linear layers and activation functions, despite residual networks having become the norm (He et al., 2015; 2016). The graphical language generated by a pure Cartesian approach fails to show the details of architectures, limiting its ability to consider models as they appear in practice.
笛卡尔方法已被用于开发深度学习的数学理解(Shiebler et al., 2021; Fong et al., 2019; Wilson & Zanasi, 2022; Cruttwell et al., 2022)。然而,笛卡尔单元乘积无法自动跟踪维度,也无法轻松表示广播或线性运算。这些工作通常依赖于深度学习网络作为顺序线性层和激活函数的最基本模型,尽管残差网络已成为规范(He et al., 2015; 2016)。纯笛卡尔方法生成的图形语言无法显示架构的细节,限制了其考虑实际应用中模型的能力。

The issue of only looking at rudimentary, linear layer-activation layer models is pervasive in deep learning research (Zhang et al., 2017; Saxe et al., 2019; Li et al., 2022). There are uncountably many ways of relating inputs to outputs. Every theory or hypothesis about deep learning algorithms has to assume that we are working with some subset of all possible functions. However, specifying this subset means theoretical insights can only apply to that subset. This precludes us from using such theories to compare disparate architectures and make design choices.
深度学习研究中只关注简单的线性层激活层模型的问题普遍存在(Zhang et al., 2017; Saxe et al., 2019; Li et al., 2022)。将输入与输出相关联的方式数不胜数。关于深度学习算法的每一个理论或假设都必须假设我们使用的是所有可能函数的某个子集。然而,指定这个子集意味着理论洞见只能应用于该子集。这阻碍了我们使用这些理论来比较不同的架构并进行设计选择。
The problem of only considering rudimentary models is partially a consequence of us not having the tools to robustly represent more complex models, never mind the tools to confidently analyze them. Category theory-based diagrams can serve as models of intricate systems. Structure-preserving maps allow analyses to scale over entire models. Therefore, developing comprehensive diagrams that correspond to mathematical expressions can be the first step in a rigorous theory of deep learning architectures with clear practical applications.
仅考虑初级模型的问题部分是由于我们没有工具来稳健地表示更复杂的模型,更不用提分析它们的工具。基于范畴理论的图表可以作为复杂系统的模型。保持结构的映射允许分析在整个模型上进行扩展。因此,开发与数学表达式对应的全面图表可以成为深度学习架构严谨理论的第一步,并有明确的实际应用。

The literature reveals a combination of problems that need to be solved. Deep learning suffers from poor communication and needs a graphical language to understand and analyze architectures. Category theory can provide a rigorous graphical language but typically forces a choice between tensor or Cartesian approaches. The elegance of tensor products and the flexibility of Cartesian products must both be available to properly represent architectures. A category arises when a system has sufficient compositional structure, meaning a non-category theory approach to diagramming architectures will likely yield a category anyway. The challenge of reconciling Cartesian and tensor approaches, therefore, remains.
文献揭示了需要解决的一系列问题。深度学习存在沟通不畅的问题,需要一种图形语言来理解和分析架构。范畴论可以提供一种严格的图形语言,但通常需要在张量或笛卡尔方法之间做出选择。张量积的优雅性和笛卡尔积的灵活性都必须可用,才能正确表示架构。当一个系统具有足够的组合结构时,就会产生一个范畴,这意味着对于架构的图示方法,即使不采用范畴论方法,也很可能会得出一个范畴。因此,调和笛卡尔和张量方法的挑战仍然存在。

1.4 The Philosophy of My Approach
我的方法论

As I am introducing these diagrams, I have a burden to explain how I think they should be used and to address criticisms of creating a diagramming standard in the first place. I will take a brief aside to address these points, which I believe will aid in the adoption of neural circuit diagrams.
正如我介绍这些图表时,我有责任解释我认为它们应该如何使用,并解决首先创建图表标准的批评。我将花点时间来解决这些问题,这我相信将有助于神经电路图的采用。

These diagrams are intended to express sequential-tensor deep learning models. This is in contrast to machine learning or artificial intelligence systems more generally. Deep learning models are machine learning models with sequential data processing through neural network layers. I do not cover recursive or branching models in this work. Furthermore, I assume data is always in the form of tuples of tensors. Generalizing diagrams to further contexts is an exciting avenue for future research.
这些图示旨在表达顺序张量深度学习模型。这与机器学习或人工智能系统更广义的情况形成对比。深度学习模型是通过神经网络层进行顺序数据处理的机器学习模型。我在这项工作中没有涉及递归或分支模型。此外,我假设数据始终以张量元组的形式存在。将图示推广到更多上下文是未来研究的一个令人兴奋的方向。

By making these assumptions, I develop diagrams specialized for some of the most essential but difficult-toexplain systems in artificial intelligence research. Researchers outside the narrow scope of sequential-tensor deep learning models often rely on these tools. By more clearly communicating them, researchers who may not be up to date on the latest innovations or aware of their options stand to benefit an immense deal.
通过这些假设，我为人工智能研究中一些最基本但难以解释的系统开发了专门的图表。在序列张量深度学习模型领域之外的研究人员经常依赖这些工具。通过更清晰地传达它们，那些可能不了解最新创新或不知道自己的选择的研究人员将受益匪浅。

I do not expect two independent teams to diagram architectures the exact same way. Indeed, I do not believe the appropriate diagramming framework would have this property. Diagrams should have the flexibility to allow for innovations and to appeal to the audience’s level of knowledge. Instead, the benefit of my framework is to have comprehensive, robust diagrams with clear correspondence to implementation and analysis, in contrast to ad-hoc diagrams, which often fail to include critical information.
我不期望两个独立的团队以完全相同的方式设计架构图。事实上,我不相信适当的设计框架会有这种特性。图表应该有灵活性,允许创新并吸引观众的知识水平。相反,我的框架的好处是提供全面、强大的图表,与实施和分析有明确的对应关系,与临时性图表形成对比,后者常常无法包含关键信息。

Neural circuit diagrams can be decomposed into sections that allow for layered abstraction. The exact details of code can be abstracted into single-symbol components. Sections of diagrams can be highlighted for the reader’s clarity, and repeated patterns can be defined as components. Diagrams have an immense compositional structure. The horizontal axis represents sequential composition, and the vertical axis represents parallel composition. Sections and components can be joined like Lego bricks to construct models.
神经电路图可以分解为允许分层抽象的部分。代码的精确细节可以抽象为单个符号组件。图表的某些部分可以突出显示以增加读者的清晰度,而重复的模式可以定义为组件。图表具有巨大的组合结构。水平轴表示顺序组合,垂直轴表示并行组合。部分和组件可以像乐高积木一样组合起来构建模型。

This sectioning allows for a close correspondence between diagrams and implementation. Every highlighted section becomes a module in code. Diagrams, therefore, provide a cross-platform blueprint for architectures. This allows implementations to be cross-checked to a reference, increasing reliability. Furthermore, which components are abstracted and the level of abstraction can vary depending on the audience, leading to clearer, specialized communication.
这种分段方式允许图表和实现之间有密切的对应关系。每个高亮显示的部分都会成为代码中的一个模块。因此,图表为架构提供了跨平台的蓝图。这使得实现可以与参考进行交叉检查,从而提高可靠性。此外,抽象哪些组件以及抽象的程度可能会因受众而有所不同,从而产生更清晰、专业化的沟通。

A common criticism is that introducing a new standard simply increases the number of standards, worsening the issue trying to be solved (below). I do not believe this is a relevant critique for deep learning diagrams. Currently, there are no standard diagramming methods. Every paper, in a sense, has its own ad-hoc diagramming scheme. Compared to this, neural circuit diagrams only need to be learned once, after which architectures can be clearly and explicitly explained. Furthermore, they build on existing research on robust monoidal string diagrams, which have been found to be a universal standard for various fields (Baez & Stay, 2010).
常见的批评是引入新的标准只会增加标准的数量,使原本试图解决的问题恶化(如下所示)。我不认为这是对深度学习图的相关批评。目前,没有标准的图示方法。每篇论文都有自己独特的图示方案。相比之下,神经电路图只需要学习一次,之后就可以清晰明确地解释架构。此外,它们建立在对稳健单结构字符串图的现有研究基础之上,这些字符串图已被发现是各个领域的通用标准(Baez & Stay, 2010)。

1.5 Contributions 贡献

To address the need for more robust communication and analysis of deep learning architectures, I introduce neural circuit diagrams. Neural circuit diagrams solve the lingering challenge of accommodating both the details of axes (the tensor approach) and the free arrangement of data (the Cartesian approach) in diagrams. They are specialized for sequential algorithms on memory states consisting of tuples of tensors.
为了解决更健壮的通信和分析深度学习架构的需求,我引入了神经电路图。神经电路图解决了在图表中同时适应轴的细节(张量方法)和数据的自由布置(笛卡尔方法)的持续挑战。它们专门用于由张量元组组成的内存状态的顺序算法。

Diagramming the details of axes means the shape of data is clear throughout a model. They easily show broadcasting and provide a graphical calculus to rearrange linear functions into equivalent forms. At the same time, they clearly represent tuples, copying, and deletion, processes that typical graphical methods struggle with. This makes them uniquely capable of accurately representing deep learning models.
描绘轴线细节意味着数据形状在整个模型中都清晰可见。它们轻松展示广播,并提供一种图形演算来重排线性函数为等价形式。与此同时,它们清楚地表示元组、复制和删除,这是典型图形方法难以应对的过程。这使它们独一无二地能够准确地表示深度学习模型。

Inspired by category theory and especially monoidal string diagrams (Selinger, 2009; Baez & Stay, 2010), this work builds on a literature of robust diagramming methods. However, the category theory details are omitted to maximize impact among machine learning researchers.
受范畴论和特别是单范畴字符串图(Selinger, 2009; Baez & Stay, 2010)的启发,这项工作建立在一系列强大的图表方法文献之上。但是,为了最大程度地影响机器学习研究人员,省略了范畴论的细节。
The benefits of neural circuit diagrams are many. They allow for clearer communication of new developments, making ideas more rapidly disseminated and understood. They offer robust blueprints for designing and implementing models, accelerating innovation and streamlining productivity. Furthermore, they allow for rigorous mathematical analysis of architectures, bringing us closer to a theoretical understanding of deep learning.
神经电路图的优势很多。它们可以更清晰地传达新的发展,使想法得到更快的传播和理解。它们为设计和实现模型提供了强大的蓝图,加速创新和提高生产率。此外,它们还允许我们对架构进行严格的数学分析,使我们更接近于对深度学习的理论理解。

These points are evidenced by diagramming a host of architectures. I cover a basic multi-layer perceptron, the transformer architecture, convolution (and its difficult-to-explain permutations), the identity ResNet, the U-Net, and the vision transformer. I provide a Jupyter notebook that implements these diagrams, which provides further evidence for the close relationship between diagrams and implementation. Finally, I offer a novel analysis of backpropagation, which shows the utility of neural circuit diagrams for rigorous analysis of architectures.
这些观点通过图解多种架构得到印证。我介绍了基本的多层感知机、变压器架构、卷积（及其难以解释的变种）、Identity ResNet、U-Net 和视觉变换器。我提供了一个 Jupyter notebook,实现了这些图表,进一步证明了图表和实现之间的密切关系。最后,我提出了反向传播的新分析,展示了神经电路图在对架构进行严格分析中的效用。

2 Reading Neural Circuit Diagrams
阅读神经电路图

2.1 Commutative Diagrams
2.1 可交换图

We aim to craft diagrams that precisely represent deep learning algorithms. While these diagrams will eventually be generalized, we will initially concentrate on common models. Specifically, we will explore models that successively process data of predictable types. To facilitate understanding, we will introduce diagrams of gradually increasing complexity. To begin, let’s delve into an intuitive diagram, where symbols represent data types, and arrows signify the functions connecting them.
我们旨在制作能准确表示深度学习算法的图表。虽然这些图表最终将被推广,但我们最初将集中于常见模型。具体而言,我们将探索逐步处理可预测类型数据的模型。为了便于理解,我们将介绍逐渐增加复杂性的图表。首先,让我们深入了解一个直观的图表,其中符号代表数据类型,箭头表示连接它们的功能。

Note, I use forward composition with “;”, meaning

f :

str

\to

int composes with

g :

int

\to

float by

(f; g) :

str

\to

float.
注意,我使用前向组合,使用";",意味着

f :

str

\to

int 与

g :

int

\to

float 通过

(f; g) :

str

\to

float 进行组合。

Figure 2: We have two functions:

f :

str

\to

int and

g :

int

\to

float. These functions can be composed into a single function

(f; g) :

str

\to

float. In commuting diagrams, we represent data types, such as str, int, and float, with floating symbols, while functions are denoted by arrows connecting them.
图 2：我们有两个函数：

f :

str

\to

int 和

g :

int

\to

float。这些函数可以组合成一个单一函数

(f; g) :

str

\to

float。在通信图表中，我们用浮动符号表示数据类型，如 str、int 和 float，而函数用箭头表示。

2.1.1 Tuples and Memory
元组和内存

Algorithms are rarely composed of operations on a single variable. Instead, their steps involve operations on memory states composed of multiple variables. The data type of a memory state is a tuple of the variables which compose it. So, a state containing an int and a str would have a type int

\times

str.
算法很少由对单个变量的操作组成。相反,它们的步骤涉及对由多个变量组成的内存状态的操作。内存状态的数据类型是组成它的变量的元组。因此,包含一个 int 和一个 str 的状态将有一个类型 int str。

Consider a single algorithmic step acting on a compound memory state

A \times B \times C

. A function

f : B \times C \to D

acting on this memory state would give an overall step with shape

Id [A] \times f : A \times (B \times C) \to A \times D

. Note that

Id [A]

is the identity. We need to indicate

A

, even though

f

does not act on it, so that the initial and final memory states are properly shown. In Figure 3, I diagram

f

along another function

g : A \times D \to E

.
考虑一个在复合内存状态

A \times B \times C

上执行的单个算法步骤。作用于此内存状态的函数

f : B \times C \to D

将给出形状为

Id [A] \times f : A \times (B \times C) \to A \times D

的总体步骤。注意，

Id [A]

是恒等式。我们需要指示

A

，尽管

f

不作用于它,以便正确显示初始和最终内存状态。在图 3 中,我将

f

沿另一个函数

g : A \times D \to E

进行了图解。

Figure 3: Here, I diagram two functions,

f : B \times C \to D

and

g

A \times D \to E

, acting together. To represent the full memory states, we are required to amend

f

into

Id [A] \times f : A \times (B \times C) \to A \times D

. The composed function is

(Id [A] \times f); g : A \times (B \times C) \to E

.
图 3：这里，我图解了两个函数，

f : B \times C \to D

和

g

：

A \times D \to E

，作为一个整体。为了表示完整的内存状态，我们需要将

f

修改为

Id [A] \times f : A \times (B \times C) \to A \times D

。合成的函数是

(Id [A] \times f); g : A \times (B \times C) \to E

。

2.2 String Diagrams 2.2 字符串图

These commuting diagrams fall short, however. As algorithms scale, operations and memory states get more complex. Usually, functions only act on some variables. However, it is not clear how to these targeted functions. Compound data types and compound functions are better suited by reorienting diagrams as in Figure 4. We will have horizontal wires represent types, and symbols represent functions. Diagrams are forced to horizontally go left to right.
这些通勤图存在不足。随着算法规模的扩大,操作和内存状态变得更加复杂。通常,函数只对部分变量进行操作。然而,如何定位这些目标函数并不清楚。如图 4 所示,重新定位图表更适合复合数据类型和复合函数。我们将使用水平线代表类型,符号代表函数。图表被迫从左到右水平移动。

Figure 4: We reorient diagrams to go left to right. Wires represent data types, and symbols represent functions. This expression defines

h

.
图 4：我们重新定向图表以从左到右。导线表示数据类型，符号表示函数。这个表达式定义了

h

。

This reorientation allows us to represent compound types and functions easily. We can diagram tupled types

A \times B

as a wire for

A

and a wire for

B

vertically stacked, but separated by a dashed line. For increased clarity, we can draw boxes around functions. In Figure 5, we see a clear reexpression of Figure 4. Here, we have the unchanged

A

variable untouched by

f

, which acts only on

B \times C

.
这种重新定位使我们能够轻松地表示复合类型和函数。我们可以将元组类型

A \times B

表示为竖直堆叠但由虚线分隔的两根线。为了增加清晰度，我们可以在函数周围绘制方框。在图 5 中，我们看到图 4 的明确重新表达。在这里，我们有未经

f

修改的不变

A

变量，它仅作用于

B \times C

。

Figure 5: Tupled data types are diagrammed with wires separated by dashed lines. This clearly shows when functions act on only some variables.
图 5:元组数据类型用虚线隔开的线路图示。这清楚地显示了当函数仅作用于某些变量时的情况。

Every vertical section of a diagram represents something. Either, it shows which data type is present in memory, or which function is applied at this step. Diagrams can always be decomposed into vertical sections, each of which must compose with adjacent sections to ensure algorithms are well-defined. Diagrams can also
每个图表的垂直部分都代表着某些东西。它要么显示内存中存在什么数据类型,要么显示在这个步骤中应用了什么函数。图表总是可以分解成垂直部分,每个部分都必须与相邻的部分组合,以确保算法定义良好。图表也可以
be split along dashed lines. Diagrams are built from these vertically and horizontally composed sections, with wires acting like jigsaw indents.
可以沿着虚线分割。图像是由这些垂直和水平排列的部分构建的，电线就像拼图的凹槽一样。

2.3 Tensors 张量

We will specialize our diagrams for deep learning models whose memory states are tuples of tensors. Tensors are numbers arranged along axes. So, a scalar

R

is a rank 0 tensor, a vector

R^{3}

is a rank 1 tensor, a table

R^{4 \times 3}

is a rank 2 tensor, and so on. If our diagram takes tensor data types, we get something like Figure 6 .
我们将专门针对内存状态为张量元组的深度学习模型进行图表专精。张量是沿轴排列的数字。因此，标量

R

是秩为 0 的张量，向量

R^{3}

是秩为 1 的张量，表格

R^{4 \times 3}

是秩为 2 的张量,依此类推。如果我们的图表采用张量数据类型,我们就会得到类似图 6 的东西。

Figure 6: Similar to Figure 5, but with data types being tensors.
图 6：与图 5 类似,但数据类型为 Tensor。

However, we benefit from diagramming the details of axes. Instead of diagramming a wire labeled

R^{a \times b}

, we diagram a wire labeled

a

and a wire labeled

b

, without a dashed line separating them. This lets us diagram Figure 6 into the clear form of Figure 7.
然而，我们从详细描绘坐标轴中获益。我们不会绘制带有

R^{a \times b}

标签的线条,而是绘制带有

a

和

b

标签的线条,中间没有虚线分隔。这让我们将图 6 改写成图 7 的清晰形式。

Figure 7: We can diagram types

R^{a \times b}

as two wires labeled

a

and

b

, without a dashed line separating them. (See cell 2, Jupyter notebook.)
图 7：我们可以将类型

R^{a \times b}

作为两条标记为

a

和

b

的线进行图示,它们之间没有虚线分隔。(请参见 Jupyter 笔记本的单元格 2。)

2.3.1 Indexes 索引

Values in tensors are accessed by indexes. A tensor

A \in R^{4 \times 3}

, for example, has constituent values

A [i_{4}, j_{3}] \in

R

, where

i_{4} \in {0 \dots 3}

and

j_{3} \in {0 \dots 2}

. Indexes can also be used to access subtensors, so we have expressions

A [i_{4}, :] \in R^{3}

. This subtensor extraction is therefore an operation

R^{4 \times 3} \to R^{3}

. We diagram it by having indexes act on the relevant axis. Indexes are diagrammed with pointed pentagons, or kets

|_{\dots} ⟩

. This type of subtensor extraction is diagrammed according to Figure 8.
张量中的值通过索引来访问。例如，张量

A \in R^{4 \times 3}

有组成值

A [i_{4}, j_{3}] \in

R

，其中

i_{4} \in {0 \dots 3}

和

j_{3} \in {0 \dots 2}

。索引也可用于访问子张量，因此我们有表达式

A [i_{4}, :] \in R^{3}

。因此，这种子张量提取是一种操作

R^{4 \times 3} \to R^{3}

。我们通过让索引作用于相关轴来进行图示。索引用尖形五边形或 ket

|_{\dots} ⟩

进行图示。这种类型的子张量提取根据图 8 进行图示。

Figure 8: We diagram indexes with pointed pentagons labeled with the index being extracted. (See cell 3, Jupyter notebook.)
图 8:我们使用带有标签的尖五边形图表索引,其中标签为正在提取的索引。(见 Jupyter 笔记本中的单元格 3。)

Figure 9: These subtensors are defined such that

A [i_{4}, :] [j_{3}] = A [i_{4}, j_{3}]

. This expression is the same in the reverse order. (See cell 4, Jupyter notebook.)
图 9:这些子张量被定义为

A [i_{4}, :] [j_{3}] = A [i_{4}, j_{3}]

。这个表达式在相反的顺序中是相同的。(见 Jupyter 笔记本中的单元格 4。)

2.3.2 Broadcasting 广播

Broadcasting is critical to understanding deep learning models. It lifts an operation to act in parallel over additional axes. Here, we show an operation

G : R^{3} \to R^{2}

lifted to an operation

G^{'} : R^{4 \times 3} \to R^{4 \times 2}

. We diagram this broadcasting by having the 4 -length wire pass over

G

, adding a 4 -length axis to its input and output shapes. Formally, we define

G^{'} (x) [i_{4}, :] = G (x [i_{4}, :])

. This is shown in Figure 10.
广播对于理解深度学习模型是非常关键的。它将操作提升到并行作用于额外轴上。在这里，我们展示了一个操作

G : R^{3} \to R^{2}

被提升到一个操作

G^{'} : R^{4 \times 3} \to R^{4 \times 2}

。我们通过让一个长度为 4 的导线跨越

G

来描述这种广播,为其输入和输出形状添加了长度为 4 的轴。形式化地,我们定义

G^{'} (x) [i_{4}, :] = G (x [i_{4}, :])

。这在图 10 中展示。

Figure 10: An operation is lifted over a 4-length axis by broadcasting. This applies

G

over corresponding subtensors. Broadcasting can be formally defined by equating indexes before and after an operation. (See cell 5, Jupyter notebook.)
图 10：通过广播将操作提升到 4 长度轴上。这将

G

应用于相应的子张量。广播可以通过在操作之前和之后等同化索引来正式定义。（参见 Jupyter 笔记本的第 5 单元格。）

Inner broadcasting acts within tuple segments.

A R^{4 \times 3} \times R^{4}

collection of data can be reduced to

R^{3} \times R^{4}

in 4 different ways. Therefore, there are 4 ways of applying an operation

H : R^{3} \times R^{4} \to R^{2}

to it. This gives a function lifted by “inner broadcasting”, which has a shape

R^{4 \times 3} \times R^{4} \to R^{4 \times 2}

. We diagram this by drawing a wire from the source tuple segment over the function, as shown in Figure 15. This adds an axis of equal length to the target tuple segment and to the output, reflecting the shape of the lifted operation.
内部广播在元组片段内发生。

A R^{4 \times 3} \times R^{4}

数据集合可以通过 4 种不同的方式缩减为

R^{3} \times R^{4}

。因此,有 4 种方式将操作

H : R^{3} \times R^{4} \to R^{2}

应用于其中。这给出了一个由"内部广播"提升的函数,其形状为

R^{4 \times 3} \times R^{4} \to R^{4 \times 2}

。我们通过从源元组片段向函数绘制一个线来描述这一过程,如图 15 所示。这为目标元组片段和输出添加了等长的轴,反映了提升操作的形状。
Broadcasting naturally represents element-wise operations. A function on values

f : R^{1} \to R^{1}

, when broadcast, gives an operation

R^{1 \times a} \to R^{1 \times a}

. One length axes do not change the shape of data, and can be freely amended or removed from pre-existing shapes by arrows. This means we diagram element-wise functions by drawing incoming and outgoing arrows, which represent the amendment and removal of a 1-length axis. This is shown in Figure 12.
广播自然代表按元素进行的操作。当一个作用于值

f : R^{1} \to R^{1}

的函数被广播时，给出了一个操作

R^{1 \times a} \to R^{1 \times a}

。长度为 1 的轴不会改变数据的形状，可以通过箭头自由地修改或删除现有的形状。这意味着我们通过绘制传入和传出的箭头来图解按元素进行的函数,这些箭头代表对 1 长度轴的修改和删除。在图 12 中展示了这一点。

Figure 11: Lifting an operation within a tuple segment gives inner broadcasting. We diagram it by having a wire from the target tuple segment over the function, reflecting the shape of the lifted function. (See cell 6, Jupyter notebook.)
图 11：将一个操作提升到元组段中会产生内部广播。我们通过从目标元组段到函数的一条线来表示，反映了提升函数的形状。（见 Jupyter notebook 第 6 个单元）。

Figure 12: Element-wise operations can be naturally shown with broadcasting. (See cell 7, Jupyter notebook.)
图 12：元素级运算可以自然地通过广播来展示。（请参见 Jupyter notebook 的第 7 个单元格。）

2.4 Linearity 线性

Linear functions are an important class of operations for deep learning. Linear functions can be highly parallelized, especially with GPUs. Previous works have shown how graphically modeling linear functions, and reimplementing algorithms can improve performance (Xu et al., 2023). Linear functions have immense regularity. Standard monoidal string diagrams rely on these properties to provide elegant graphical languages for various fields (Baez & Stay, 2010).
线性函数是深度学习中非常重要的一类操作。线性函数可以高度并行化,特别是在 GPU 上。前期工作已经表明,通过图形建模线性函数,并重新实现算法可以提高性能(Xu et al., 2023)。线性函数具有巨大的规律性。标准的单幺半群字符串图依赖这些属性,为各个领域提供优雅的图形语言(Baez & Stay, 2010)。

Linear functions are required to obey additivity and homogeneity, as shown in Figure 13. These operations are closed under composition, so applying linear maps onto each other gives another linear map. Importantly to us, they are natural with respect to broadcasting. This means for any two linear functions

f

and

g

, the equality in Figure 14 holds. This means they can be simultaneously broadcast. This lets a series of linear functions be efficiently parallelized and flexibly rearranged.
线性函数必须遵循加和性和齐次性,如图 13 所示。这些运算在组合下是封闭的,所以对线性映射进行运算会得到另一个线性映射。对我们来说,它们与广播是自然的。这意味着对于任何两个线性函数

f

和

g

,图 14 中的等式成立。这意味着它们可以同时进行广播。这使得一系列线性函数可以高效并行化和灵活重排。

Figure 13: A subset of functions between

R^{a}

R^{b}

are linear, obeying additivity and homogeneity. This class of functions are closed under composition and has many important composition properties.
图 13：

R^{a}

与

R^{b}

之间的一些函数是线性的,遵循可加性和齐次性。这类函数在复合下是闭合的,具有许多重要的复合性质。

Figure 14: Linear functions are natural with respect to each other and broadcasting. This means the above equality holds, letting expressions be flexibly rearranged.
图 14：线性函数在彼此和广播方面是自然的。这意味着上述等式成立,允许表达式灵活重新排列。

However, a pure monoidal string diagram has difficulty representing non-linear operations, a noted issue (Chiang et al., 2023). Neural circuit diagrams have Cartesian products and broadcasting, which are not generally analogous to how monoidal string diagrams combine linear functions. If we know functions are linear, we can use diagrams to efficiently reason about algorithms. By focusing on linear functions, we can take advantage of their parallelization properties.
然而，纯单对象字符串图表很难表示非线性操作，这是一个著名的问题(Chiang et al., 2023)。神经电路图有笛卡尔积和广播,这些通常不等同于单对象字符串图表如何组合线性函数。如果我们知道函数是线性的,我们可以使用图表来有效地推导算法。通过关注线性函数,我们可以利用它们的并行化特性。

2.4.1 Multilinearity 多线性

There is an important distinction between linear and multilinear operations. Inner products, for example, are multilinear. The inner product

u (x, y) = x \cdot y = Σ_{i} x [i] \cdot y [i]

is linear with respect to each input. So,

u (x + z, y) = u (x, y) + u (z, y)

, and similarly for the second input. However, it is not linear with respect to element-wise addition over its entire input and output, as

u (x_{1} + x_{2}, y_{1} + y_{2}) \neq u (x_{1}, y_{1}) + u (x_{2}, y_{2})

. Compare this to copying

Δ

, which we can show is linear.
线性和多线性运算之间存在重要的区别。比如，内积是多线性的。内积

u (x, y) = x \cdot y = Σ_{i} x [i] \cdot y [i]

对每个输入都是线性的。所以，

u (x + z, y) = u (x, y) + u (z, y)

，对第二个输入也是如此。然而，它对整个输入和输出的逐元素加法不是线性的，因为

u (x_{1} + x_{2}, y_{1} + y_{2}) \neq u (x_{1}, y_{1}) + u (x_{2}, y_{2})

。将其与我们可以证明是线性的复制

Δ

进行比较。

\begin{aligned} Δ : R^{a} \to R^{a \times a} & and x, y \in R^{a}, λ \in R \\ Δ (x) & := (x, x) \\ Δ (x + y) & = (x + y, x + y) = (x, x) + (y + y) \\ = Δ (x) + Δ (y) \\ Δ (λ \cdot x) & = (λ \cdot x, λ \cdot x) = λ \cdot (x, x) \\ = λ \cdot Δ (x) \end{aligned}

To simultaneously broadcast multilinear functions, we note that every multilinear operation equals an outer product followed by a linear function. The outer product is the ur-multilinear operation, taking a tuple input and returning a tensor, which takes the product over one element from each tuple segment. It is given by

\otimes : R^{a} \times R^{b} \to R^{a \times b}

. All tuple-multilinear functions

M : R^{a} \times R^{b} \to R^{c}

have an associated tensor-linear form

M_{λ} : R^{a \times b} \to R^{c}

such that

\otimes; M_{λ} = M

. We diagram the outer product by simply having a tuple line ending, which will often occur before a host of linear operations are simultaneously applied.
同时广播多线性函数时，我们注意到每个多线性操作都等于一个外积 followed by a 线性函数。外积是原始多线性操作，它接受一个元组输入并返回一个张量，该张量对每个元组段的一个元素进行乘积。它由

\otimes : R^{a} \times R^{b} \to R^{a \times b}

给出。所有元组多线性函数

M : R^{a} \times R^{b} \to R^{c}

都有一个相关联的张量线性形式

M_{λ} : R^{a \times b} \to R^{c}

，使得

\otimes; M_{λ} = M

。我们通过简单地在元组线的末端结束来描述外积，这种情况下通常会在一系列线性操作同时应用之前出现。

2.4.2 Implementing Linearity and Common Operations
实现线性化和常见操作

Key linear and multilinear operations can be implemented by the einops package (Rogozhnikov, 2021), leading to elegant implementations of algorithms. Some key linear operations are inner products, which sum over an axis, transposing, which swaps axes, views, which rearranges axes, and diagonalization, which makes axes take the same index.
关键的线性和多线性操作可以通过 einops 包(Rogozhnikov, 2021)实现，从而导致算法的优雅实现。一些关键的线性操作包括内积,它沿一个轴求和,转置,它交换轴,视图,它重新安排轴,以及对角化,它使轴采取相同的索引。
With neural circuit diagrams, we can clearly show these operations. We show inner products with cups, transposing by crossing wires, views by solid lines consuming and producing their respective shapes, and diagonalization by wires merging. As these operations are linear, they can be simultaneously applied. The interaction of wires shows how incoming axes coordinate to produce outgoing axes. The einops package symbolically implements these operations by having incoming and outgoing axes correspond to symbols.
使用神经电路图，我们可以清楚地展示这些操作。我们用杯子展示内积,通过交叉线缆实现转置,用实线表示视图,通过线缆汇合实现对角化。因为这些操作都是线性的,它们可以同时应用。线缆的相互作用展示了输入坐标轴如何协调产生输出坐标轴。Einops 包通过让输入和输出轴对应到符号来符号化实现这些操作。

An example that combines many of these operations is a section of multi-head attention shown in 16 . It employs an outer product, a transpose, a diagonalization, an inner product, and an element-wise operation. The input to this algorithm is a tuple of tensors. Axes with an overline are a width, representing the amount of rather than detail per thing. Though a complex expression, we can break this figure up as in Figure 7 and implement the interaction of wires using einops, shown in Figure 17.
一个结合这些操作的示例是 16 中展示的多头注意力机制的一个部分。它使用了外积、转置、对角化、内积和逐元素运算。这个算法的输入是一组张量。带有上划线的轴表示的是数量而不是每个事物的细节。尽管这是一个复杂的表达式,但我们可以像图 7 中那样将其分解,并使用图 17 中显示的 einops 来实现线之间的交互。

Figure 15: An in-depth example of matrix multiplication, a key multilinear inner broadcast operation. Inner products are defined on vectors

R^{n} \times R^{n} \to R^{1}

. Then, we inner broadcast them to act over matrices. The new

R^{p \times n} \times R^{n \times q} \to R^{p \times q}

operation is matrix multiplication. Therefore, we see that matrix multiplication is an instance of an inner broadcast operation.
图 15：矩阵乘法的深入示例,这是一个关键的多元内部广播操作。内积是在向量

R^{n} \times R^{n} \to R^{1}

上定义的。然后,我们对其进行内部广播以作用于矩阵。新的

R^{p \times n} \times R^{n \times q} \to R^{p \times q}

操作是矩阵乘法。因此,我们看到矩阵乘法是内部广播操作的一个实例。

Figure 16: We can diagram a portion of multi-head attention, a sophisticated algorithm, with clarity using neural circuit diagrams.
图 16：我们可以使用神经电路图清晰地图示 multi-head attention 这种复杂算法的一部分。

Figure 17: This section of multi-head attention can be implemented using the einsum operation. Note the close relationship between diagrams and implementation and how diagrams reflect the memory states and operations of algorithms. (See cell 8, Jupyter notebook.)
图 17：这部分多头注意力可以使用 einsum 操作来实现。注意图示和实现之间的密切关系,以及图示如何反映算法的内存状态和操作。(见 Jupyter 笔记本第 8 单元格。)

2.4.3 Linear Algebra 线性代数

All linear functions

f : R^{a} \to R^{b}

have an associated

R^{a \times b}

tensor that uniquely identifies them. This hints at the ability to transpose this associated tensor to get a new linear function,

f^{T} : R^{b} \to R^{a}

. To extract these associated transposes, we use the unit. The unit for a shape

a

, given by

η : R^{1} \to R^{a \times a}

, is a linear map which returns

r

times the

R^{a \times a}

identity matrix, for

r \in R

.
所有线性函数

f : R^{a} \to R^{b}

都有一个关联的

R^{a \times b}

张量,它们唯一地标识它们。这暗示了通过转置这个关联的张量来获得一个新的线性函数

f^{T} : R^{b} \to R^{a}

的能力。为了提取这些关联的转置,我们使用单位。给定形状

a

的单位,由

η : R^{1} \to R^{a \times a}

给出,是一个线性映射,它返回

r

倍的

R^{a \times a}

单位矩阵,对于

r \in R

。
Note that the associated transpose, which sends a linear function

f : R^{n} \to R^{m}

f^{T} : R^{m} \to R^{n}

by transposing the associated

R^{n \times m}

tensor, is different to a transpose operation which sends

R^{n \times m}

R^{m \times n}

. Associated transposes are used for mathematical rearrangement and are not usually directly implemented in code, though I provide code examples in cell 9 of the Jupyter notebook.
请注意,相关转置(将线性函数

f : R^{n} \to R^{m}

转换为

f^{T} : R^{m} \to R^{n}

通过转置相关

R^{n \times m}

张量)与将

R^{n \times m}

转换为

R^{m \times n}

的转置操作是不同的。关联转置用于数学重排,通常不直接在代码中实现,尽管我在 Jupyter 笔记本的第 9 个单元格中提供了代码示例。

The unit and the inner product can be arranged to give the identity map

R^{a} \to R^{a}

, as in Figure 18. This identity map can be freely introduced, split into a unit and the identity matrix, and then used to rearrange operations. For example, this allows us to convert the linear map

F : R^{a} \to R^{b \times c}

into

F^{T} : R^{b \times a} \to R^{c}

. These associated tensors and transposes can be used to better understand convolution (Section 3.3) and backpropagation (Section 3.6).
单位和内积可以排列以给出恒等映射

R^{a} \to R^{a}

,如图 18 所示。这个恒等映射可以自由引入,分解成单位和恒等矩阵,然后用来重新排列操作。例如,这使我们能够将线性映射

F : R^{a} \to R^{b \times c}

转换为

F^{T} : R^{b \times a} \to R^{c}

。这些相关的张量和转置可用于更好地理解卷积(第 3.3 节)和反向传播(第 3.6 节)。

Figure 18: Linear operations have a flexible algebra. Simultaneous operations may increase efficiency (Xu et al., 2023). As the height of diagrams is related to the amount of data stored in independent segments, it gives a rough idea of memory usage. This is further explored in Section 3.6. (See cell 9, Jupyter notebook.)
图 18:线性操作具有灵活的代数。同时操作可以提高效率(Xu et al., 2023)。因为图表的高度与独立段中存储的数据量有关,这大致反映了内存使用情况。这一点在第 3.6 节中进一步探讨。(见 Jupyter 笔记本单元 9。)

These rearrangements can transpose specific axes. A linear operation

R^{a \times b} \to R^{c}

has an associated

R^{a \times b \times c}

tensor. This tensor can be associated with various linear operations, such as

R^{b \times a} \to R^{c}

. These different forms are often of interest to us, as they can efficiently implement the reverse of operations (see Figure 25, 30). To extract these rearrangements, we can selectively apply units and the inner product to reorient the direction of wires for linear operations.
这些重排可以转置特定轴。线性运算

R^{a \times b} \to R^{c}

有一个相关联的

R^{a \times b \times c}

张量。这个张量可以与各种线性运算相关联,例如

R^{b \times a} \to R^{c}

。这些不同的形式通常对我们很感兴趣,因为它们可以有效地实现操作的逆向(见图 25、30)。为了提取这些重排,我们可以选择性地应用单元和内积来重定向线性运算中线的方向。

3 Results: Key Applied Cases
3 结果：关键应用案例

3.1 Basic Multi-Layer Perceptron
基本多层感知机

Diagramming a basic multi-layer perceptron will help consolidate knowledge of neural circuit diagrams and show their value as a teaching and implementation tool, as shown in Figure 19. We use pictograms to represent components analogous to traditional circuit diagrams and to create more memorable diagrams (Borkin et al., 2016).
绘制基本多层感知机的示意图将有助于巩固神经电路图示的知识,并展示它们作为教学和实现工具的价值,如图 19 所示。我们使用象形图来表示与传统电路图类似的组件,并创建更易记的示意图(Borkin 等人,2016 年)。
import torch.nn as nn
导入 torch.nn as nn
# Basic Image Recogniser
基本图像识别器
# This is a close copy of an introductory PyTorch tutorial:
这是一个 PyTorch 入门教程的接近副本:
# https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html
我知道了。以下是相应的翻译: # https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html
class BasicImageRecogniser(nn.Module):
基本图像识别器(nn.Module)
def init(self): 初始化(self)
super().init() 超级().init()
self.flatten

=

nn.Flatten()
自身.扁平化 nn.Flatten()
self.linear_relu_stack

=

nn. Sequential(
自线性求和_relu_堆叠 nn.顺序(
nn. Linear (28*28, 512),
线性层 (28*28, 512)
nn. ReLU(), 非线性激活函数
nn.Linear(512, 512),
nn.ReLU(), 神经网络.整流线性单元(),
nn.Linear(512, 10),
)
def forward(self,

x

):
这是一个 Python 函数的定义。其中 `forward(self,

x

)` 是函数的签名,其中 `forward` 是函数名, `self` 是隐藏参数,用于访问类内部的属性和方法, `

x

` 是另一个参数

x =

self.flatten(

x

)
自己.扁平化(_)

x =

self.linear_relu_stack

(x)

自定义 self.linear_relu_stack
y_pred

=

nn.Softmax (x)
预测输出 nn.Softmax(x)
return y_pred 返回 y_pred

Basic Image Recogniser for Digits
基本图像识别器用于数字

return y_pred 返回 y_pred

Figure 19: PyTorch code and a neural circuit diagram for a basic MNIST (digit recognition) neural network taken from an introductory PyTorch tutorial. Note the close correspondence between neural circuit diagrams and PyTorch code. (See cell 10, Jupyter notebook.)
图 19：一个基本的 MNIST（数字识别）神经网络的 PyTorch 代码和神经电路图,摘自一个入门级的 PyTorch 教程。请注意神经电路图和 PyTorch 代码之间的密切对应关系。(见 Jupyter 笔记本的第 10 个单元格。)

Fully connected layers are shown as boldface

L

, with boldface indicating a component with internal learned weights. Their input and output sizes are inferred from the diagrams. If a fully connected layer is biased, we add a " + " in the bottom right. Traditional presentations easily miss this detail. For example, many implementations of the transformer, including those from PyTorch and Harvard NLP, have a bias in the query, key, and value fully-connected layers despite Attention is All You Need (Vaswani et al., 2017) not indicating the presence of bias.
全连接层以加粗字体

L

显示,加粗表示具有内部学习权重的组件。它们的输入和输出大小根据图表推断。如果全连接层有偏差,我们在右下角添加一个"+"。传统的表示很容易忽略这一细节。例如,包括 PyTorch 和 Harvard NLP 在内的许多 transformer 实现都在查询、密钥和值的全连接层中使用偏差,尽管《Attention is All You Need》(Vaswani 等,2017)没有指示存在偏差。

Activation functions are just element-wise operations. Though traditionally ReLU (Krizhevsky et al., 2012), other choices may yield superior performance (Lee, 2023). With neural circuit diagrams, the activation function employed can be checked at a glance. SoftMax is a common operation that converts scores into probabilities, and we represent it with a left-facing triangle (

◃

), indicating values being “spread” to sum to 1.
激活函数只是逐元素操作。尽管传统上使用 ReLU(Krizhevsky 等人，2012)，但其他选择可能会产生更好的性能(Lee, 2023)。使用神经电路图,可以一眼就看到所使用的激活函数。SoftMax 是一种常见的操作,将得分转换为概率,我们用一个朝左的三角形(

◃

)表示,表示值被"展开"以求和等于 1。

As mentioned in Section 1.2, how operations such as SoftMax are broadcast can be ambiguous in traditional presentations. This is especially worrisome as SoftMax can be applied to shapes of arbitrary size. On the other hand, the neural circuit diagram method of displaying broadcasting makes it clear how SoftMax is applied.
正如第 1.2 节中提到的,诸如 SoftMax 之类的操作如何进行广播可能在传统表述中存在模糊性。这尤其令人担忧,因为 SoftMax 可以应用于任意大小的形状。另一方面,用神经电路图的方法展示广播,可以清楚地说明如何应用 SoftMax。

3.2 Neural Circuit Diagrams for the Transformer Architecture
3.2 Transformer 架构的神经电路图

In Section 1.2, we covered shortfalls in Attention is All You Need. We now have the tools to address these shortcomings using neural circuit diagrams. Figure 20 shows scaled-dot product attention. Unlike the approach from Attention is All You Need, the size of variables and the axes over which matrix multiplication and broadcasting occur is clearly shown. Figure 21 shows multi-head attention. The origin of queries, keys, and values are clear, and concatenating the separate attention heads using einsum naturally follows. Finally, we show the full transformer model in Figure 22 using neural circuit diagrams. Introducing such a large architecture requires an unavoidable level of description, and we take some artistic license and notate all the additional details.
在第 1.2 节中，我们介绍了注意力全你所需中存在的缺陷。现在我们有了使用神经电路图解决这些缺陷的工具。图 20 展示了缩放点积注意力。与注意力全你所需的方法不同，变量大小和矩阵乘法及广播发生的轴线都显示得很清楚。图 21 展示了多头注意力。查询、键和值的来源很清楚,使用 einsum 自然地连接了独立的注意力头。最后,我们在图 22 中使用神经电路图展示了完整的 transformer 模型。介绍如此庞大的架构需要一定程度的描述,我们采取了一些艺术性的许可,标注了所有额外的细节。

Figure 20: The original equation for attention against a neural circuit diagram. The descriptions are unnecessary but clarify what is happening. Corresponds to Equation 1 and Figure 1.1. (See cell 11, Jupyter notebook.)
图 20：对神经电路图的注意力的原始方程式。这些描述是不必要的,但阐明了正在发生的情况。对应于公式 1 和图 1.1。(见 Jupyter 笔记本的单元格 11。)

Figure 21: Neural circuit diagram for multi-head attention. Implementing matrix multiplication is clear with the cross-platform the einops package (Rogozhnikov, 2021). Corresponds to Equation 2 and 3 and Figure 1.2. (See cell 12, Jupyter notebook.)
图 21：多头注意力的神经电路图。使用跨平台的 einops 包（Rogozhnikov, 2021）实现矩阵乘法很明确。对应等式 2 和 3 以及图 1.2。（参见 Jupyter 笔记本中的单元格 12。）

Neural Circuit Diagram for Transformers
基于变换器的神经电路图

Neural circuit diagrams are a visual and explicit framework for representing deep learning models. Transformer architectures have changed the world, and we provide a novel and comprehensive diagram for the original architecture from Attention is All You Need. We describe all necessary components, enabling technically profiicent novices w
神经电路图是一种视觉和明确的框架,用于表示深度学习模型。变换器架构改变了世界,我们为"注意力就是一切"中的原始架构提供了一种全新的综合性图表。我们描述了所有必要的组件,使技术熟练的初学者能够

Dropouts

↓

are element-wise functions that have a chance of simply returning their input like an identity function, but other times return 0 , deleting information. They reduce overreliance on any one neuron, stabilising the model.
删除层(Dropouts)是一种逐元素(element-wise)的函数,它有一定概率就简单地返回其输入,就像恒等函数一样,但有时也会返回 0,从而删除信息。它们减少了对任何一个神经元的过度依赖,从而稳定了模型。

This greatly stabilizes the network during training.
这极大地稳定了训练期间的网络。

As data travels through each of the 6 decoder stacks, information from the encoded input

A

is repeatedly injected.
当数据通过 6 个解码器堆栈时，来自编码输入

A

的信息被反复注入。

Figure 22: The fully diagrammed architecture from Attention is All You Need (Vaswani et al., 2017).
图 22：从《注意力就是你所需要的全部》（Vaswani 等人，2017 年）中获得的完全图解的架构。

3.3 Convolution 卷积

Convolutions are critical to understanding computer vision architectures. Different architectures extend and use convolution in various ways, so implementing and understanding these architectures requires convolution and its variations to be accurately expressed. However, these extensions are often hard to explain. For example, PyTorch concedes that dilation is “harder to describe”. Transposed convolution is similarly challenging to communicate (Zeiler et al., 2010). A standardized means of notating convolution and its variations would aid in communicating the ideas already developed by the machine learning community and encourage more innovation of sophisticated architectures such as vision transformers (Dosovitskiy et al., 2021; Khan et al., 2022).
卷积对于理解计算机视觉架构至关重要。不同的架构以各种方式扩展和使用卷积，因此实现和理解这些架构需要准确表达卷积及其变体。然而,这些扩展通常很难解释。例如,PyTorch 承认膨胀是"更难描述"的。转置卷积也同样难以沟通(Zeiler et al., 2010)。一种标准化的方式来标注卷积及其变体将有助于传达机器学习社区已经发展的想法,并鼓励更多对复杂架构(如视觉变换器)的创新(Dosovitskiy et al., 2021; Khan et al., 2022)。

In deep learning, convolutions alter a tensor by taking weighted sums over nearby values. With standard bracket notation to access values, a convolution over vector

v

of length

\bar{x}

by a kernel

w

of length

k

is given by, (Note: we subscript indexes by the axis over which they act.)
在深度学习中，卷积通过对附近值求加权和来改变张量。使用标准括号符号访问值，对长度为

\bar{x}

的向量

v

进行长度为

k

的核

w

的卷积可以表示为，（注：我们在索引上添加下标，用来表示它们作用的轴。）

Conv (v, w) [i_{\bar{y}}] = \sum_{j_{k}} v [i_{\bar{y}} + j_{k}] \cdot w [j_{k}]

The maximum

i_{\bar{y}}

value is such that it does not exceed the maximum index for

v [i_{\bar{y}} + j_{k}]

. Starting indexing at 0 , we get

\bar{x} - 1 = i_{max} + j_{max} = \bar{y} + k - 2

, so the length of the output is therefore

\bar{y} = \bar{x} - k + 1

. Note how convolution is a multilinear operation; it is linear concerning each vector input

v

and

w

. Therefore, it has a tensor-linear form with an associated tensor, the convolution tensor, that uniquely identifies it.
最大的

i_{\bar{y}}

值是这样的,它不会超过

v [i_{\bar{y}} + j_{k}]

的最大索引。从 0 开始索引,我们得到

\bar{x} - 1 = i_{max} + j_{max} = \bar{y} + k - 2

,因此输出的长度为

\bar{y} = \bar{x} - k + 1

。请注意,卷积是一种多线性运算;它关于每个矢量输入

v

和

w

都是线性的。因此,它具有与卷积张量相关联的张量线性形式,该卷积张量唯一地标识它。

\begin{aligned} Conv (v, w) [i_{\bar{y}}] = \sum_{j_{k}} \sum_{ℓ_{x}} (⋆) [i_{\bar{y}}, j_{k}, ℓ_{\bar{x}}] \cdot v [ℓ_{\bar{x}}] \cdot w [j_{k}] \\ (⋆) [i_{\bar{y}}, j_{k}, ℓ_{\bar{x}}] = {\begin{cases} 1 & , if ℓ_{\bar{x}} = i_{\bar{y}} + j_{k} . \\ 0 & , else. \end{cases} \end{aligned}

We diagram convolution with the below diagram, Figure 23 . We then transpose the linear operation into a more standard form, letting the input be to the left, and the kernel be to the right.
我们使用以下图表 Figure 23 来描绘卷积操作。然后我们将线性运算转换为更标准的形式,输入位于左侧,核心位于右侧。

Figure 23: Convolution is a multilinear operation, with an associated tensor. This tensor is transposed into a standard form.
图 23：卷积是一种多线性运算，具有相关的张量。该张量被转置为标准形式。

We typically work with higher dimensional convolutions, in which case the indexes act like tuples of indexes. We diagram axes that act in this tandem manner by placing them especially close to each other and labeling their length by one bolded symbol akin to a vector. In 2 dimensions the convolution tensor becomes;
我们通常使用高维卷积,在这种情况下,索引就像索引元组。我们通过将它们彼此特别靠近并用一个加粗的符号标记其长度(类似于向量)来描述以这种协同方式工作的轴。在 2 维中,卷积张量变为:

(⋆ 2 D) [i_{\overset{―}{y 0}}, i_{\overset{―}{y 1}}, j_{k 0}, j_{k 1}, ℓ \overset{―}{x_{0}}, ℓ_{\overset{―}{x 1}}] = {\begin{cases} 1 & , if (ℓ \overset{―}{\overset{―}{x 0}}, ℓ_{\overset{―}{x 1}}) = (i_{\overset{―}{y 0}}, i_{\overset{―}{y 1}}) + (j_{k 0}, j_{k 1}) . \\ 0 & , else \end{cases}

Figure 24 shows what convolution does. It takes an input, uses a linear operation to separate it into overlapping blocks, and then broadcasts an operation over each block. Using neural circuit diagrams, we
图 24 展示了卷积的作用。它接受输入,使用线性运算将其分割成重叠的块,然后在每个块上执行操作。使用神经回路图,我们
now easily show the extensions of convolution. A standard convolution operation tensors the input with a channel depth axis, and feeds each block and the channel axis through a learned linear map.
现在可以轻松显示卷积的扩展。标准卷积操作将输入与通道深度轴张量化,并通过学习的线性映射将每个块和通道轴馈送到输出。

Additionally, we can take an average, maximum, or some other operation rather than a linear map on each block. This lets us naturally display average or max pooling, among other operations. Displaying convolutions like this has further benefits for understanding. For example,

1 \times 1

convolution tensors give a linear operation

R^{\overset{―}{x}} \to R^{\overset{―}{x} \times 1}

, which we recognize to be the identity. Therefore,

1 \times 1

kernels are the same as broadcasting over the input.
此外，我们可以采取平均值、最大值或其他操作，而不是对每个块进行线性映射。这使我们能够自然地显示平均池化或最大池化等其他操作。以这种方式显示卷积还有助于理解。例如，

1 \times 1

卷积张量给出了一个线性操作

R^{\overset{―}{x}} \to R^{\overset{―}{x} \times 1}

，我们认为这是恒等式。因此，

1 \times 1

核与广播到输入是相同的。

Figure 24: Convolution and related operations, clearly shown using neural circuit diagrams.
图 24: 利用神经电路图清晰地展示卷积和相关操作。

Stride and dilation scale the contribution of

i_{y}

j_{k}

in the convolution tensor, increasing the speed at which the convolution scans over its inputs. This changes the convolution tensor into the form of Equation 4. We diagram these changes by adding the

s

d

multiplier where the axis meets the tensor as in Figure 25. These multipliers also change the size of the output, allowing for downscaling operations.
步幅和扩张缩放了卷积张量中

i_{y}

或

j_{k}

的贡献,提高了卷积扫描其输入的速度。这将卷积张量转换为方程 4 的形式。我们在图 25 中添加

s

或

d

倍数器来图示这些变化,其中轴线与张量相交。这些倍数器也改变了输出大小,从而允许进行缩小操作。

\begin{aligned} (⋆ s, d) [i_{\bar{y}}, j_{k}, ℓ_{\bar{x}}] & = {\begin{cases} 1 & , if ℓ_{\bar{x}} = s * i_{\bar{y}} + d * j_{k} . \\ 0 & , else. \end{cases} \\ \bar{y} & = ⌊ \frac{\bar{x} - d * (k - 1) - 1}{s} + 1 ⌋ \end{aligned}

We often want to make slight adjustments to the output size. This is done by padding the input with zeros around its borders. We can explicitly show the padding operation, but we make it implicit when the output dimension does not match the expectation given the input dimension, kernel dimension, stride, and dilation used.
我们通常希望对输出尺寸进行小幅调整。这是通过在输入周围填充零来实现的。我们可以明确展示填充操作,但是当输出尺寸与输入尺寸、核尺寸、步幅和膨胀率给出的预期不匹配时,我们会将其隐式化。

Stride can make the output axis have a far lower dimension than the input axis. This is perfect for downscaling. We implement upscaling by transposing strided convolution, resulting in an operation with many more output blocks than actual inputs. We broadcast over these blocks to get our high-dimensional output.
步幅可以使输出轴的维度远远低于输入轴的维度。这对于降低分辨率非常完美。我们通过转置步幅卷积来实现升级，从而得到一个比实际输入有更多输出块的操作。我们对这些块进行广播以获得高维输出。

Figure 25: Stride, dilation, padding, and transposed convolution shown with neural circuit diagrams.
图 25：使用神经电路图展示步长、膨胀、填充和转置卷积。

Transposed convolution is challenging to intuit in the typical approach to convolutions, which focuses on visualizing the scanning action rather than the decomposition of an image’s data structure into overlapping blocks. The blocks generated by transposed convolution can be broadcast with linear maps, maximum, average, or other operations, all easily shown using neural circuit diagrams.
转置卷积在典型的卷积方法中很难理解,该方法侧重于可视化扫描操作,而不是将图像数据结构分解为重叠块。转置卷积生成的块可以使用线性映射、最大值、平均值或其他操作进行广播,所有这些都可以使用神经电路图轻松显示。

3.4 Computer Vision 计算机视觉

In computer vision, the design of deep learning architectures is critical. Computer vision tasks often have enormous inputs that are only tractable with a high degree of parallelization (Krizhevsky et al., 2012). Architectures can relate information at different scales (Luo et al., 2017), making architecture design taskdependant. Sophisticated architectures such as vision transformers combine the complexity of convolution and transformer architectures (Khan et al., 2022; Dehghani et al., 2023).
在计算机视觉中，深度学习架构的设计至关重要。计算机视觉任务通常具有巨大的输入，只有通过高度并行化才能处理这些输入(Krizhevsky 等人，2012 年)。架构可以在不同尺度上关联信息(Luo 等人，2017 年)，使架构设计依赖于任务。诸如视觉变压器的复杂架构结合了卷积和变压器架构的复杂性(Khan 等人，2022 年;Dehghani 等人，2023 年)。

Figure 26: Residual networks with identity mappings and full pre-activation (IdResNet) (He et al., 2016) offered improvements over the original ResNet architecture. These improvements, however, are often missing from implementations. By making the design of the improved model clear, neural circuit diagrams can motivate common packages to be updated. (See cell 13-15, Jupyter notebook.)
图 26：带有恒等映射和完全预激活的剩余网络(IdResNet)(He 等人,2016)相比原始 ResNet 架构有所改进。但这些改进通常在实现中缺失。通过明确改进模型的设计,神经电路图可以促进常见软件包的更新。(见 Jupyter 笔记本 13-15 单元。)

These cases show why clear architecture design is promising for enhancing computer vision research. Neural circuit diagrams, therefore, are in a unique position to accelerate computer vision research, motivating parallelization, task-appropriate architecture design, and further innovation of sophisticated architectures.
这些案例表明明确的体系结构设计有助于提升计算机视觉研究。因此,神经电路图处于加速计算机视觉研究的独特位置,激励并行化、任务适当的架构设计以及对复杂架构的进一步创新。

As examples of neural circuit diagrams applied to computer vision architectures, I have diagrammed the identity residual network architecture (He et al., 2016) in Figure 26, which shows many innovations of ResNets not included in common implementations, as well as the UNet architecture (Ronneberger et al., 2015) in Figure 27, which shows how saving and loading variables may be displayed.
以神经电路图应用于计算机视觉架构的示例,我在图 26 中绘制了恒等残差网络架构(He 等人,2016 年),该图显示了 ResNets 中包含的许多创新,未包含在常见实现中,以及在图 27 中绘制了 U 网络架构(Ronneberger 等人,2015 年),该图显示了如何显示保存和加载变量。

Figure 27: The UNet architecture (Ronneberger et al., 2015) forms the basis of probabilistic diffusion models, state-of-the-art image generation tools (Rombach et al., 2022). UNets rearrange data in intricate ways, which we can show with neural circuit diagrams. Note that in this diagram we have modified the UNet architecture to pad the input of convolution layers. To get the original UNet architecture, the

{\overset{―}{x}}_{λ}

values can be further distinguished as

{\overset{―}{x}}_{λ, j}

, the sizes of which can be added to the legend. (See cell 16, Jupyter notebook.)
图 27：U 网架构（Ronneberger 等人，2015 年）构成了概率扩散模型的基础，这些模型是最先进的图像生成工具（Rombach 等人，2022 年）。 U 网以复杂的方式重新排列数据，我们可以用神经电路图来展示。请注意，在此图中，我们已修改了 U 网架构以填充卷积层的输入。要获得原始的 U 网架构，

{\overset{―}{x}}_{λ}

值可进一步区分为

{\overset{―}{x}}_{λ, j}

，其大小可添加到图例中。（参见 Jupyter 笔记本中的 cell 16。）

Architectures often comprise sub-components, which we show as blocks that accept configurations. This is analogous to classes or functions that may appear in code. The code associated with this work implements these algorithms guided by the blocks from the diagrams.
架构通常由子组件组成,我们将其显示为接受配置的块。这类似于可能出现在代码中的类或函数。与这项工作相关的代码实现了这些算法,由图表中的块指导。

3.5 Vision Transformer 视觉变换器

Neural circuit diagrams reveal the degrees of freedom of architectures, motivating experimentation and innovation. A case study that reveals this is the vision transformer, which brings together many of the cases we have already covered. Its explanations (Khan et al., 2022, See Figure 2) suffer from the same issues as explanations of the original transformer (see Section 1.2), made worse by even more axes being present.
神经电路图揭示了架构的自由度,激发了实验和创新。一个 case study 揭示了这一点,那就是视觉变换器,它集合了我们已经讨论过的许多案例。它的解释(Khan 等人,2022,参见图 2)与原始变换器的解释(参见第 1.2 节)存在相同的问题,而且由于出现了更多的轴,问题更加严重。

With neural circuit diagrams, visual attention mechanisms are as simple as replacing the

\bar{y}

and

\bar{x}

axes in Figure 21 with tandem

\overset{―}{y}

and

\overset{―}{x}

axes and setting

h = 1

. As

1 \times 1

convolutions are simply the identity map,

Conv (v, [1]) = v

, broadcasting a linear map

R^{c} \to R^{k}

for each of

\overset{―}{y}

pixels is a

1 \times 1

-convolution. This leaves us with Figure 28 for a visual attention mechanism.
使用神经电路图时，视觉注意力机制就像是用辅助的

\overset{―}{y}

和

\overset{―}{x}

轴替换了图 21 中的

\bar{y}

和

\bar{x}

轴，并将

h = 1

设置为所需值那样简单。由于

1 \times 1

卷积只是

Conv (v, [1]) = v

，对每个

\overset{―}{y}

个像素广播一个线性映射

R^{c} \to R^{k}

就是一个

1 \times 1

卷积。这就得到了图 28 所示的视觉注意力机制。

Figure 28: Using neural circuit diagrams, visual attention (Dosovitskiy et al., 2021) is shown to be a simple modification of multi-head attention (See Figure 21, Figure 16, cell 17, Jupyter notebook.)
图 28：使用神经电路图，已经显示视觉注意力（Dosovitskiy 等，2021）是对多头注意力的简单修改（参见图 21、图 16、单元 17、Jupyter 笔记本）。

This highly suggestive diagram calls us to experiment with the convolutions’ stride, dilation, and kernel sizes, potentially streamlining models. The diagram clarifies how to implement multi-head visual attention with

h \neq 1

, especially using einsum similar to Figure 16 . Additionally,

\overset{―}{y}

does not need to match

\overset{―}{x}

. We could have

\overset{―}{y}

be image data, and

\bar{x}

be textual data without convolutions.
这种高度暗示性的图表呼吁我们尝试卷积的步长、扩张和核大小,可能简化模型。该图解清楚了如何使用

h \neq 1

实现多头视觉注意力,特别是使用类似图 16 的 einsum。此外,

\overset{―}{y}

不需要与

\overset{―}{x}

匹配。我们可以让

\overset{―}{y}

成为图像数据,

\bar{x}

成为文本数据,而不使用卷积。

This case study shows how neural circuit diagrams reveal the degrees of freedom of architectures and, therefore, motivate innovation while being precise in how algorithms should be implemented.
这个案例研究展示了神经电路图如何揭示架构的自由度,从而激励创新,同时也精确地说明了算法应该如何实施。

3.6 Differentiation: A Clear Improvement over Prior Methods
3.6 微分：相比之前的方法有明显改进

I leave the most mathematically dense part of this work for last. Neural circuit diagrams intend to be used for the communication, implementation, tinkering, and analysis of architectures. These aims appeal to distinct audiences, and each should conceptualize neural circuit diagrams differently. The theoretical study of deep learning models requires understanding how individual components are composed into models and how properties scale during composition. Neural circuit diagrams are highly composed systems (see Figure 7) and thus provide a framework for studying composition. They have an underlying category, which is not the focus of this work.
我留下了这项工作最数学密集的部分作为最后。神经电路图旨在用于建筑的沟通、实现、调整和分析。这些目标吸引了不同的受众,每一个人都应该以不同的方式概念化神经电路图。深度学习模型的理论研究需要理解个别组件如何组成模型,以及属性在组合过程中如何扩展。神经电路图是高度复杂的系统(见图 7),因此提供了一个研究组合的框架。它们有一个潜在的类别,这不是本文的重点。

Differentiation is an example of a property that is agreeable under composition. Differentiation is key to understanding information flows through architectures (He et al., 2016). The chain rule relates the derivative of composed functions to the composition of their derivatives and, therefore, provides a case study of how studying composition allows models to be understood. This analysis, however, is hampered by the fact that symbolically expressing the chain rule has quadratic length complexity relative to the number of composed
微分是在组合下可接受的属性的一个例子。微分是理解通过体系结构的信息流的关键(He et al., 2016)。链式法则将组合函数的导数与其导数的组合联系起来,因此提供了一个案例研究,说明研究组合如何允许模型被理解。然而,这种分析受到象征性表达链式法则的事实的阻碍,该事实具有与组合数量呈二次长度复杂度的特点。
functions. 函数。

\begin{aligned} h^{'} (x) & = h^{'} (x) \\ (g \circ h)^{'} (x) & = (g^{'} \circ h) (x) \cdot h^{'} (x) \\ (f \circ g \circ h)^{'} (x) & = (f^{'} \circ g \circ h) (x) \cdot (g^{'} \circ h) (x) \cdot h^{'} (x) \end{aligned}

This issue of symbolic methods proliferating symbols to keep track of relationships between objects was noted in the introduction. To understand how differentiation is composed and encourage more innovations like that of identity ResNets, which used differentiation to understand data flows (He et al., 2016), we need a graphical differentiation method.
这个问题是关于符号方法增加符号来跟踪对象之间关系的问题,在介绍中已经提到了。为了理解微分是如何组成的,并鼓励更多像恒等 ResNets 这样的创新,它们使用微分来理解数据流(He et al., 2016),我们需要一种图形微分方法。

Some graphical methods have been developed and applied to understanding differentiation in the context of deep learning, drawing on monoidal string diagrams from category theory (Shiebler et al., 2021; Cockett et al., 2019). As linearity cannot be completely ensured, these graphical methods are Cartesian, not expressing the details of axes. Other graphical approaches to neural networks could not incorporate differentiation, showing the significance of neural circuit diagrams being able to incorporate differentiation (Xu & Maruyama, 2022).
某些图形方法已经被开发并应用于深度学习中的微分理解,借鉴了范畴论中的单 fibroid 字符串图(Shiebler et al., 2021; Cockett et al., 2019)。由于线性性无法完全保证,这些图形方法是笛卡尔的,不表达轴的细节。其他对神经网络的图形方法无法包含微分,这突显了神经电路图能够包含微分的意义(Xu & Maruyama, 2022)。
Differentiation, however, has key linear properties. Transposing differentiation is very important. These prior graphical methods require redefining differentiation for each transpose, making the relationships between these forms unclear. By detailing tensors and Cartesian products, our graphical presentation can show these linear relationships clearly. While drawing on their many theoretical contributions (Shiebler et al., 2021; Cockett et al., 2019), this work provides a significant advantage over these previous works.
微分虽然具有关键的线性特性。转置微分是非常重要的。这些之前的图形方法需要为每个转置重新定义微分,使这些形式之间的关系变得不清楚。通过详细介绍张量和笛卡尔积,我们的图形表述可以清楚地展示这些线性关系。虽然借鉴了他们许多理论贡献(Shiebler 等人,2021 年;Cockett 等人,2019 年),但这项工作相比之前的工作提供了显著的优势。

In addition to theoretical understanding, clearly expressing differentiation is key to efficient implementations. Mathematically equivalent algorithms may have different time or memory complexities. The rules of linear algebra we have developed (see Figure 18) allow mathematically equivalent algorithms to be rearranged into more time or memory-efficient forms. To show the potential of neural circuit diagrams, we focus on backpropagation and analyze its time and memory complexity with neural circuit diagrams.
除了理论理解之外,清晰表达差异化是高效实施的关键。数学上等价的算法可能具有不同的时间或内存复杂性。我们已经开发的线性代数规则(请参见图 18)允许将数学等价的算法重新排列为更加时间或内存高效的形式。为了展示神经电路图的潜力,我们专注于反向传播并使用神经电路图分析其时间和内存复杂性。

3.6.1 Modeling Differentiation
建模差异化

To model differentiation, consider a once differentiable function

F : R^{a} \to R^{b}

. It has a Jacobian which assigns to every point in

R^{a}

R^{a \times b}

tensor that describes its derivative, JF:

R^{a} \to R^{b \times a}

. Functions answer questions, and

J F

answers how much a function responds to an infinitesimal change. The questions we ask

J F

are where is the change happening (

R^{a}

input), how much is it changing by (

R^{b}

output axis), and which direction are we moving in (

R^{a}

output axis). Inner products over the output axes “ask” these questions. The chain rule can be defined with respect to the Jacobian and is diagrammed in Figure 29.
对微分进行建模时,我们考虑一个可微分函数

F : R^{a} \to R^{b}

。它有一个雅可比矩阵,它将每个点

R^{a}

分配一个

R^{a \times b}

张量,描述其导数,JF:

R^{a} \to R^{b \times a}

。函数回答问题,而

J F

回答函数对无穷小变化的响应程度。我们对

J F

提出的问题是:变化发生在哪里(

R^{a}

输入),变化幅度有多大(

R^{b}

输出轴),以及我们沿哪个方向移动(

R^{a}

输出轴)。输出轴上的内积"询问"这些问题。链式法则可以用雅可比矩阵来定义,如图 29 所示。

\begin{aligned} {\frac{\partial}{\partial x^{a}} (G F)^{c} |}_{x} = \sum_{b} ({{\frac{\partial}{\partial x^{b}} G^{c} |}_{F (x)} \cdot \frac{\partial}{\partial x^{a}} F^{b} |}_{x}) \end{aligned}

Figure 29: The chain rule expressed symbolically with index notation, and with neural circuit diagrams.
图 29:用符号表示法和神经电路图表达的链式法则。

This expression is convoluted, and will struggle to scale. Instead, we transpose JF into the forward derivative as per Cockett et al. (2019)'s definition 4. This form is more agreeable for the chain rule, and is the first transpose we employ.
这个表达式是复杂的,在扩展时会很困难。相反,我们按照 Cockett 等人(2019)定义 4 将 JF 转置为前向导数。这种形式对于链式法则更加友好,也是我们采用的第一个转置。

Figure 30: Definition of the forward derivative, and how functions compose under it.
图 30：正向导数的定义,以及函数在其下的复合。

This naturally scales with depth. Furthermore, we can define a (, $D_{_}$ ) functor, a composition preserving map, from once differentiable functions $F : R^{a} \to R^{b}$ to $(F, D F) : R^{a} \times R^{a} \to R^{b} \times R^{b}$ (Fong et al., 2019; Cruttwell et al., 2021; Cockett et al., 2019). Per the chain rule, (,

D_{-}) [F; G] = (_, D_{-}) F; (_, D_{-}) G

. This is shown in Figure 31.
这种做法自然与深度成比例。此外，我们可以定义一个 (,

D_{_}

) 函子，一个保持组合的映射，从一次可微分函数

F : R^{a} \to R^{b}

到

(F, D F) : R^{a} \times R^{a} \to R^{b} \times R^{b}

(Fong et al., 2019; Cruttwell et al., 2021; Cockett et al., 2019)。根据链式法则, (,

D_{-}) [F; G] = (_, D_{-}) F; (_, D_{-}) G

。这在图 31 中显示。

Figure 31: We have a composition preserving map, the (_,

D_{-}

) functor, that maps vertical sections to vertical sections, implementing the chain rule.
图 31：我们有一个组成保持映射，(_,

D_{-}

)函子，将垂直截面映射到垂直截面，实现链式规则。

This composes elegantly. However, when optimizing an algorithm, we are not interested in how much a known infinitesimal change will alter an output. Rather, given some target change in output, we are interested in which direction will best achieve it. We can do this by calculating the change in the output for each element of

a

in parallel, effectively running the algorithm multiple times. This is done by applying the unit and broadcasting. Furthermore, we sum the infinitesimal change over some target

R^{c}

value. The inner product does this. For an algorithm

F; L

, where

L

is a loss function to

R^{1}

, we can do optimization according to Figure 32.
这种方法优雅地构成。然而，在优化算法时，我们并不关心微小变化对输出的影响。相反，我们关心如何根据输出的目标变化来选择最佳方向。我们可以通过并行计算

a

的每个元素对输出的变化来实现这一目标。这可以通过应用单位和广播来完成。此外，我们还可以对某个目标

R^{c}

值进行微小变化求和。内积可以实现这一功能。对于算法

F; L

，其中

L

是损失函数到

R^{1}

，我们可以根据图 32 进行优化。

Figure 32: We turn a small chain rule expression into an optimization function by applying the inner product over the target direction and derivative output. An inner product over an axis of length 1 is just multiplication. Using the unit, we run this algorithm for every input degree of freedom, broadcasting over the

a

axis.
图 32：我们将一个小的链式法则表达式转换为一个优化函数,通过将目标方向和导数输出之间的内积应用于此。对长度为 1 的轴进行内积运算就是简单的乘法。利用该单元,我们对每个自由度的输入运行该算法,在

a

轴上广播。

However, the forward derivative has large time complexity. A linear function gives matrix multiplication. Therefore, a linear map

f : R^{a} \to R^{b}

applied onto

R^{a}

will require

a \times b

operations. In general, broadcasting multiplies time and memory complexity. The memory usage of an algorithm is related to the number of elements it stores at any step in the algorithm. We use these tricks to analyze the order of the time and space complexity for the above process.
然而，正向导数具有较大的时间复杂度。线性函数给出矩阵乘法。因此，将线性映射

f : R^{a} \to R^{b}

应用于

R^{a}

将需要

a \times b

个操作。一般来说，广播会增加时间和内存复杂度。算法的内存使用量与其在算法中任何一步存储的元素数量有关。我们使用这些技巧来分析上述过程的时间和空间复杂度的顺序。

Figure 33: An analysis of the space and time complexity of the naive optimization algorithm.
图 33：对朴素优化算法的空间复杂度和时间复杂度的分析。

We observe that this has a high time complexity, quadratic with respect to the size of

X

. In practice, we avoid the forward derivative, also called the Jacobian-vector product or JVP, in favor of the reverse derivative, or VJP, which more directly implements the above process. We define it in relation to the Jacobian and forward derivative in Figure 34. In Figure 36, we use our rules of linear algebra to re-express the optimization algorithm in terms of the forward derivative and show the far lower memory and time complexity required.
我们观察到这具有较高的时间复杂度,与

X

Figure 34: The definition of the forward and reverse derivative with respect to the Jacobian. This aligns with the Jacobian-vector product and the vectorJacobian product, respectively.
图 34：关于雅可比矩阵的正向和反向派生的定义。这与雅可比-向量乘积和向量-雅可比乘积保持一致。

Figure 35: A full expression of how the unit and the forward derivative give the Jacobian. This demonstrates how linear algebra principles can illustrate the relationships between different forms of the derivative.
图 35：单位和前向导数如何给出雅可比的完整表达。这演示了线性代数原理如何说明不同形式导数之间的关系。

Figure 36: Using the previously developed linear algebra rules (see Figure 18, we rearrange our optimization algorithm to use the reverse instead of the forward derivative. The diagrams then reveal the computational advantages of the new algorithm, called backpropagation.
图 36：利用之前开发的线性代数规则（见图 18），我们将优化算法重新安排为使用反向而非正向导数。这些图表然后揭示了新算法（称为反向传播）的计算优势。

4 Conclusion 结论

Neural circuit diagrams are a method of addressing the lingering problem of unclear communication in the deep learning community. My introduction showed why this is a concern and argued why a system of axis wires and dashed lines is needed, solving the challenge of reconciling the detail of tensor axes and the
神经回路图是一种解决深度学习社区中模糊沟通问题的方法。我的介绍阐述了为什么这是一个问题,并论证了需要一个由轴线和虚线组成的系统,从而解决了张量轴的细节和
flexibility of tuples. This work covered a host of architectures to breed familiarity, encourage adoption, and evidence the utility of neural circuit diagrams for understanding and implementing models.
元组的灵活性。这项工作涵盖了一系列架构,旨在培养熟悉度,鼓励采用,并证明神经电路图在理解和实现模型方面的实用性。

Neural circuit diagrams are appealing to diverse users, from students first learning neural networks to specialized theoretical researchers investigating their mathematical foundations. This leads to immense future potential. Future work can see neural circuit diagrams explained in a concise and accessible manner for a lay audience. More diagrams can be modelled and formal standards developed. Finally, their mathematical foundation can be more fully expressed. The underlying category theory structure can be fully investigated (Abbott, 2023, Chapter 3), allowing models to incorporate probabilistic functions (Perrone, 2022; Fritz et al., 2023), additional data types, or even quantum circuits.
神经电路图对于从学习神经网络的学生到专门研究其数学基础的理论研究人员等各种用户都很有吸引力。这将产生巨大的未来潜力。未来的工作可以以简洁和易懂的方式,为普通受众解释神经电路图。可以建模更多的图表,并制定正式标准。最后,其数学基础可以得到更充分的表达。底层的范畴论结构可以得到全面的研究(Abbott, 2023, Chapter 3),允许模型包含概率函数(Perrone, 2022; Fritz et al., 2023)、其他数据类型,甚至量子电路。

Acknowledgements 致谢

Mathcha was used to write equations and draw diagrams. The Harvard NLP annotated transformer was invaluable for drawing Figure 22. I thank the anonymous TMLR reviewers for providing useful feedback on the paper throughout its many rewrites, and my supervisor Dr Yoshihiro Maruyama for his support, and pointing me towards applied category theory for machine learning. This work was supported by JST (JPMJMS2033-02; JPMJFR206P).
用 Mathcha 来编写方程式和绘制图表。哈佛 NLP 标注的变换器对绘制图 22 非常有价值。我感谢匿名的 TMLR 评审员在论文多次修改过程中提供的有用反馈,以及我的导师丸山良弘博士的支持,并指引我了解机器学习的应用范畴理论。本工作得到了 JST(JPMJMS2033-02; JPMJFR206P)的支持。

References 参考文献

Vincent Abbott. Robust Diagrams for Deep Learning Architectures: Applications and Theory. Honours Thesis, The Australian National University, Canberra, October 2023.
文: 文尼森特·阿伯特. 深度学习架构的稳健图表:应用和理论. 荣誉论文, 澳大利亚国立大学, 堪培拉, 2023 年 10 月.

Steve Awodey. Category Theory. Oxford University Press, Inc., USA, 2nd edition, July 2010. ISBN 978-0-19-923718-0.
史蒂夫·阿沃迪。范畴论。牛津大学出版社, 美国, 第 2 版, 2010 年 7 月。ISBN 978-0-19-923718-0。

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. CoRR, abs/1607.06450, 2016. URL http://arxiv.org/abs/1607.06450. arXiv: 1607.06450.
雷吉米·巴、杰米·瑞安·基罗斯和杰弗里·E·亨顿。层规范化。CoRR, abs/1607.06450, 2016。URL http://arxiv.org/abs/1607.06450。arXiv: 1607.06450。

John C. Baez and Mike Stay. Physics, Topology, Logic and Computation: A Rosetta Stone. volume 813, pp. 95-172. 2010. doi: 10.1007/978-3-642-12821-9_2. URL http://arxiv.org/abs/0903.0340. arXiv:0903.0340 [quant-ph].
约翰·C·贝茨和迈克·斯泰。物理、拓扑、逻辑和计算:罗塞塔石碑。第 813 卷,第 95-172 页。2010 年。doi: 10.1007/978-3-642-12821-9_2。网址 http://arxiv.org/abs/0903.0340。arXiv:0903.0340 [quant-ph]。

Jacob Biamonte and Ville Bergholm. Tensor Networks in a Nutshell, July 2017. URL http://arxiv.org/abs/ 1708.00006. arXiv:1708.00006 [cond-mat, physics:gr-qc, physics:hep-th, physics:math-ph, physics:quantph].
雅各布·比亚蒙特和威尔·伯格霍尔姆。张量网络纵览,2017 年 7 月。链接 http://arxiv.org/abs/1708.00006。arXiv:1708.00006 [cond-mat, physics:gr-qc, physics:hep-th, physics:math-ph, physics:quantph]。

Michelle A. Borkin, Zoya Bylinskii, Nam Wook Kim, Constance May Bainbridge, Chelsea S. Yeh, Daniel Borkin, Hanspeter Pfister, and Aude Oliva. Beyond Memorability: Visualization Recognition and Recall. IEEE Transactions on Visualization and Computer Graphics, 22(1):519-528, January 2016. ISSN 19410506. doi: 10.1109/TVCG.2015.2467732. Conference Name: IEEE Transactions on Visualization and Computer Graphics.
米歇尔·A·伯金、佐亚·拜林斯基、南·午克·金、康斯坦斯·梅·宾布里奇、切尔西·S·耶、丹尼尔·伯金、汉斯彼得·普菲斯特和奥德·奥利瓦。超越记忆力:可视化识别和回忆。IEEE Transactions on Visualization and Computer Graphics, 22(1):519-528, January 2016. ISSN 19410506. doi: 10.1109/TVCG.2015.2467732. Conference Name: IEEE Transactions on Visualization and Computer Graphics.

David Chiang, Alexander M. Rush, and Boaz Barak. Named Tensor Notation, January 2023. URL http: //arxiv.org/abs/2102.13196. arXiv:2102.13196 [cs].
大卫·蒋、亚历山大·拉什和博阿兹·巴拉克. 命名张量记法, 2023 年 1 月. URL http: //arxiv.org/abs/2102.13196. arXiv:2102.13196 [cs].

Robin Cockett, Geoffrey Cruttwell, Jonathan Gallagher, Jean-Simon Pacaud Lemay, Benjamin MacAdam, Gordon Plotkin, and Dorette Pronk. Reverse derivative categories, October 2019. URL http://arxiv.org/ abs/1910.07065. arXiv:1910.07065 [cs, math].
罗宾·科克特、杰弗里·克鲁特韦尔、乔纳森·加拉格尔、让-西蒙·帕卡德·勒梅、本杰明·马卡当、戈登·普洛特金和多蕾特·普朗克。反向导数范畴,2019 年 10 月。网址 http://arxiv.org/ abs/1910.07065。arXiv:1910.07065 [cs,数学]。
G. S. H. Cruttwell, Bruno Gavranović, Neil Ghani, Paul Wilson, and Fabio Zanasi. Categorical Foundations of Gradient-Based Learning, July 2021. URL http://arxiv.org/abs/2103.01931. arXiv:2103.01931 [cs, math].
齐格蒙特·赫尔曼·克鲁特韦尔、布鲁诺·加夫拉诺维奇、尼尔·甘尼、保罗·威尔逊和法比奥·扎纳西。《梯度学习的分类基础》,2021 年 7 月。URL http://arxiv.org/abs/2103.01931. arXiv:2103.01931 [cs, math].

Geoffrey S. H. Cruttwell, Bruno Gavranovic, Neil Ghani, Paul W. Wilson, and Fabio Zanasi. Categorical Foundations of Gradient-Based Learning. In Ilya Sergey (ed.), Programming Languages and Systems - 31st European Symposium on Programming, ESOP 2022, Held as Part of the European Joint Conferences on
迟瑞,布鲁诺·加夫拉诺维奇,尼尔·甘尼,保罗·W·威尔逊和法比奥·扎纳西。梯度学习的范畴基础。收录于 Ilya Sergey 编辑的《编程语言与系统 - 第 31 届欧洲编程研讨会,ESOP 2022,作为欧洲联合会议的一部分》。

Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2-7, 2022, Proceedings, volume 13240 of Lecture Notes in Computer Science, pp. 1-28. Springer, 2022. doi: 10.1007/978-3-030-99336-8_1. URL https://doi.org/10.1007/978-3-030-99336-8_1.
软件理论与实践，ETAPS 2022，慕尼黑，德国，2022 年 4 月 2 日-7 日，论文集，计算机科学讲义第 13240 卷，第 1-28 页。施普林格，2022 年。doi: 10.1007/978-3-030-99336-8_1。网址 https://doi.org/10.1007/978-3-030-99336-8_1。

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, and Neil Houlsby. Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, July 2023. URL http://arxiv.org/abs/2307.06304. arXiv:2307.06304 [cs].
德罕尼、穆斯塔法、约西普·约尔奥卡、约纳森·希克、马蒂亚斯·明德勒、玛蒂尔德·卡隆、安德烈亚斯·施泰纳、约安·普吉塞尔弗、罗伯特·格罗斯、易卜拉欣·阿拉布阿穆罕辛、阿维塔尔·奥利弗、皮奥特 r·帕德莱夫斯基、阿列克谢·格里滕科、马里奥·卢西奇和尼尔·豪斯比。Patch n' Pack: NaViT, 一种适用于任何长宽比和分辨率的视觉变换器, 2023 年 7 月。网址 http://arxiv.org/abs/2307.06304。arXiv:2307.06304 [cs]。

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、Xiaohua Zhai、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly、Jakob Uszkoreit 和 Neil Houlsby。一张图像相当于 16x16 个单词：大规模图像识别的 Transformers。在 2021 年 5 月 3 日至 7 日在奥地利举行的第 9 届国际学习表示大会(ICLR 2021)上。OpenReview.net，2021 年。

Chris Drummond. Replicability is not reproducibility: Nor is it good science. Proceedings of the Evaluation Methods for Machine Learning Workshop at the 26th ICML, January 2009.
克里斯·德鲁蒙德。可复制性并非可重现性:这也不是良好的科学实践。2009 年 1 月第 26 届 ICML 机器学习评估方法研讨会论文集。

Brendan Fong and David I. Spivak. An Invitation to Applied Category Theory: Seven Sketches in Compositionality. Cambridge University Press, 1 edition, July 2019. ISBN 978-1-108-66880-4 978-1-108-48229-5 978-1-108-71182-1. doi: 10.1017/9781108668804. URL https://www.cambridge.org/core/product/ identifier/9781108668804/type/book.
冯布伦丹和斯皮瓦克·大卫 I.《应用范畴论入门:关于组合性的七个素描》。剑桥大学出版社,2019 年 7 月第 1 版。ISBN 978-1-108-66880-4 978-1-108-48229-5 978-1-108-71182-1。doi:10.1017/9781108668804。网址:https://www.cambridge.org/core/product/identifier/9781108668804/type/book。

Brendan Fong, David I. Spivak, and Rémy Tuyéras. Backprop as Functor: A compositional perspective on supervised learning. In 34th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2019, Vancouver, BC, Canada, June 24-27, 2019, pp. 1-13. IEEE, 2019. doi: 10.1109/LICS.2019.8785665.
冯兴业、斯皮瓦克和图埃拉斯。反向传播作为函子:监督学习的组合视角。载于 2019 年 6 月 24-27 日在加拿大温哥华召开的第 34 届 ACM/IEEE 计算机逻辑学研讨会(LICS 2019)论文集,页码 1-13。IEEE,2019。doi: 10.1109/LICS.2019.8785665.

Tobias Fritz, Tomáš Gonda, Paolo Perrone, and Eigil Fjeldgren Rischel. Representable Markov Categories and Comparison of Statistical Experiments in Categorical Probability. Theoretical Computer Science, 961: 113896, June 2023. ISSN 03043975. doi: 10.1016/j.tcs.2023.113896. URL http://arxiv.org/abs/2010.07416. arXiv:2010.07416 [cs, math, stat].
托比亚斯·弗里茨、托马斯·冈达、保罗·佩罗内和埃吉尔·菲尔德格伦·里舍尔。可表示的马尔可夫范畴和分类概率中统计实验的比较。《理论计算机科学》, 961: 113896, 2023 年 6 月。ISSN 03043975。doi: 10.1016/j.tcs.2023.113896。网址 http://arxiv.org/abs/2010.07416。arXiv:2010.07416 [cs, math, stat]。

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
伊恩·古德费洛, 吕绍·本吉奥, 亚伦·考维尔, 吕绍·本吉奥。深度学习, 第 1 卷。麻省理工学院出版社, 2016 年。

John Hayes and Diana Bajzek. Understanding and Reducing the Knowledge Effect: Implications for Writers. Written Communication, 25:104-118, January 2008.
约翰·海耶斯和戴安娜·巴泽克。了解和减少知识效应:对作者的影响。《成文交流》,2008 年 1 月,25:104-118。

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385. arXiv: 1512.03385.
何恺明、张祥宇、任少卿和孙剑。图像识别的深度残差学习。CoRR, abs/1512.03385, 2015。URL http://arxiv.org/abs/1512.03385。arXiv: 1512.03385。

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), Computer Vision - ECCV 2016-14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pp. 630-645. Springer, 2016. doi: 10.1007/978-3-319-46493-0_ 38. URL https://doi.org/10.1007/978-3-319-46493-0_38.
何恺明、张祥宇、任少卿和孙坚。深度残差网络中的身份映射。在巴斯蒂安·莱布、吉日·马塔斯、尼古拉斯·赛布和马克斯·维林（编）的《计算机视觉 - ECCV 2016-第 14 届欧洲会议》中，阿姆斯特丹,荷兰,2016 年 10 月 11 日-14 日,论文集第 IV 部分,第 9908 卷计算机科学讲座丛书,第 630-645 页。斯普林格,2016 年。doi: 10.1007/978-3-319-46493-0_38。网址 https://doi.org/10.1007/978-3-319-46493-0_38。

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
何俊杰、Ajay Jain 和皮特·阿贝尔。扰噪扩散概率模型。在 Hugo Larochelle、Marc'Aurelio Ranzato、Raia Hadsell、Maria-Florina Balcan 和 Hsuan-Tien Lin (编辑)的《神经信息处理系统进展 33:2020 年神经信息处理系统年会,NeurIPS 2020,2020 年 12 月 6 日至 12 日,虚拟》中。网址 https://proceedings.neurips.cc/paper/2020/ hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html。

Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167. arXiv: 1502.03167.
Sergey Ioffe 和 Christian Szegedy。批量归一化：通过减少内部协变量偏移来加速深度网络培训。CoRR, abs/1502.03167, 2015。网址 http://arxiv.org/abs/1502.03167。arXiv: 1502.03167。

Sayash Kapoor and Arvind Narayanan. Leakage and the Reproducibility Crisis in ML-based Science, July 2022. URL http://arxiv.org/abs/2207.07048. arXiv:2207.07048 [cs, stat].
Sayash Kapoor 和 Arvind Narayanan。"ML 领域科学再现性危机与信息泄漏",2022 年 7 月。网址 http://arxiv.org/abs/2207.07048。arXiv:2207.07048 [cs, stat].

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in Vision: A Survey. ACM Computing Surveys, 54(10s):1-41, January 2022. ISSN 0360-0300, 1557-7341. doi: 10.1145/3505244. URL https://dl.acm.org/doi/10.1145/3505244.
萨尔曼·汗、穆扎马尔·纳西尔、穆纳瓦尔·海亚特、塞义德·瓦卡斯·扎米尔、法哈德·沙赫巴兹·汗和穆巴拉克·沙。视觉中的变压器：综述。ACM Computing Surveys, 54(10s):1-41, January 2022. ISSN 0360-0300, 1557-7341. doi: 10.1145/3505244. URL https://dl.acm.org/doi/10.1145/3505244.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in neural information processing systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings. neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
亚历克斯·克里日夫斯基、伊利亚·萨斯科维尔和杰弗里·E·辛顿。使用深度卷积神经网络进行 ImageNet 分类。《神经信息处理系统进展》，第 25 卷。Curran Associates，Inc.，2012 年。网址 https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf。

Minhyeok Lee. GELU Activation Function in Deep Learning: A Comprehensive Mathematical Analysis and Performance, May 2023. URL http://arxiv.org/abs/2305.12073. arXiv:2305.12073 [cs].
李敏赫。深度学习中的 GELU 激活函数：综合数学分析与性能表现，2023 年 5 月。链接 http://arxiv.org/abs/2305.12073。arXiv:2305.12073 [cs]。

Yuqing Li, Tao Luo, and Nung Kwan Yip. Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH). CSIAM Transactions on Applied Mathematics, 3(4):692-760, June 2022. ISSN 2708-0560, 2708-0579. doi: 10.4208/csiam-am.SO-2021-0053. URL http://arxiv.org/abs/2007.03714. arXiv:2007.03714 [cs, math, stat].
李玉清、罗韬和叶农坤。基于神经切向层级(NTH)的残差网络理解。CSIAM 应用数学交易, 3(4):692-760, 2022 年 6 月。ISSN 2708-0560, 2708-0579。 doi: 10.4208/csiam-am.SO-2021-0053。网址 http://arxiv.org/abs/2007.03714。 arXiv:2007.03714 [cs, math, stat]。

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A Survey of Transformers, June 2021. URL http://arxiv.org/abs/2106.04554. arXiv:2106.04554 [cs].
林天洋、王宇心、刘祥阳和邱希鹏。《变换器综述》，2021 年 6 月。URL http://arxiv.org/abs/2106.04554。arXiv:2106.04554 [cs]。

Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. Pay Attention to MLPs, June 2021. URL http://arxiv.org/abs/2105.08050. arXiv:2105.08050 [cs].
刘寒晓、戴子杭、David R. So 和 Quoc V. Le。关注 MLPs，2021 年 6 月。网址 http://arxiv.org/abs/2105.08050。 arXiv:2105.08050 [cs]。

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks, January 2017. URL https://arxiv.org/abs/1701.04128v2.
罗文杰、李宇佳、拉克尔·乌塔松和理查德·泽梅尔。理解深度卷积神经网络中的有效感受野,2017 年 1 月。网址 https://arxiv.org/abs/1701.04128v2。

José Meseguer and Ugo Montanari. Petri nets are monoids. Information and Computation, 88(2):105-155, October 1990. ISSN 0890-5401. doi: 10.1016/0890-5401(90)90013-8. URL https://www.sciencedirect. com/science/article/pii/0890540190900138.
佩特里网络是独幕剧。
T. Murata. Petri nets: Properties, analysis and applications. Proceedings of the IEEE, 77(4):541-580, April 1989. ISSN 1558-2256. doi: 10.1109/5.24143. Conference Name: Proceedings of the IEEE.
T. Murata. Petri 网络：性质、分析和应用。IEEE 论文集, 77(4):541-580, 1989 年 4 月。ISSN 1558-2256。doi: 10.1109/5.24143。会议名称: IEEE 论文集。

Alex Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models, February 2021. URL http://arxiv.org/abs/2102.09672. arXiv:2102.09672 [cs, stat].
改善去噪扩散概率模型

Paolo Perrone. Markov Categories and Entropy. CoRR, abs/2212.11719, 2022. doi: 10.48550/ARXIV.2212. 11719. URL https://doi.org/10.48550/arXiv.2212.11719. arXiv: 2212.11719.
帕奥洛·佩罗内。马尔可夫范畴与熵。CoRR, abs/2212.11719, 2022。doi: 10.48550/ARXIV.2212.11719。URL https://doi.org/10.48550/arXiv.2212.11719。arXiv: 2212.11719。

Mary Phuong and Marcus Hutter. Formal Algorithms for Transformers. CoRR, abs/2207.09238, 2022. doi: 10.48550/ARXIV.2207.09238. URL https://doi.org/10.48550/arXiv.2207.09238. arXiv: 2207.09238.
玛丽·彭和马库斯·赫特。变形金刚的正式算法。CoRR, abs/2207.09238, 2022。doi: 10.48550/ARXIV.2207.09238。URL https://doi.org/10.48550/arXiv.2207.09238。arXiv: 2207.09238。
S. Pinker. The sense of style: The thinking person’s guide to writing in the 21st century. Penguin Publishing Group, 2014. ISBN 978-0-698-17030-8. URL https://books.google.com.au/books?id=FzRBAwAAQBAJ.
斯蒂芬·平克. 写作的智慧: 21 世纪思考者的写作指南. 企鹅出版集团, 2014. ISBN 978-0-698-17030-8. URL https://books.google.com.au/books?id=FzRBAwAAQBAJ.

Edward Raff. A Step Toward Quantifying Independently Reproducible Machine Learning Research. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https:// proceedings.neurips.cc/paper_files/paper/2019/hash/c429429bf1f2af051f2021dc92a8ebea-Abstract.html.
爱德华·拉夫。量化独立可重复的机器学习研究的一步。在《神经信息处理系统进展》第 32 卷中。Curran Associates, Inc.，2019 年。网址 https://proceedings.neurips.cc/paper_files/paper/2019/hash/c429429bf1f2af051f2021dc92a8ebea-Abstract.html。

Alex Rogozhnikov. Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation. October 2021. URL https://openreview.net/forum?id=oapKSVM2bcj.
罗高日尼科夫. Einops: 使用类爱因斯坦符号的清晰和可靠的张量操作. 2021 年 10 月. URL https://openreview.net/forum?id=oapKSVM2bcj.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022. URL http://arxiv.org/abs/2112.10752. arXiv:2112.10752 [cs].
罗宾·罗姆巴赫, 安德烈亚斯·布拉特曼, 多米尼克·洛伦兹, 帕特里克·埃塞尔和比约恩·奥马。高分辨率图像合成与潜在扩散模型, 2022 年 4 月。网址 http://arxiv.org/abs/2112.10752。arXiv:2112.10752 [cs]。

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR, abs/1505.04597, 2015. URL http://arxiv.org/abs/1505.04597. arXiv: 1505.04597.
奥拉夫·罗内贝格尔、菲利浦·费舍尔和托马斯·布罗克斯。U-Net：生物医学图像分割的卷积网络。CoRR，abs/1505.04597，2015 年。网址 http://arxiv.org/abs/1505.04597。arXiv: 1505.04597。

Lee Ross, David Greene, and Pamela House. The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13(3):279-301, 1977.
李•罗斯、大卫•格林和帕米拉•豪斯。"虚假共识效应":社会感知和归因过程中的自我中心偏差。《实验社会心理学杂志》,13(3):279-301, 1977。

Sadoski. Impact of concreteness on comprehensibility, interest. Journal of Educational Psychology, 85(2): 291-304, 1993.
萨多斯基。具体性对可理解性和兴趣的影响。教育心理学杂志,85(2):291-304,1993。

Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, December 2019. ISSN 1742-5468. doi: 10.1088/1742-5468/ ab3985. URL https://iopscience.iop.org/article/10.1088/1742-5468/ab3985.
萨克斯、班萨尔、达佩洛、阿德瓦尼、科尔钦斯基、特雷西和科克斯关于深度学习信息瓶颈理论的论文。发表于 2019 年 12 月的《统计物理学报:理论与实验》，ISSN 1742-5468, DOI 10.1088/1742-5468/ab3985。网址 https://iopscience.iop.org/article/10.1088/1742-5468/ab3985。

Peter Selinger. A survey of graphical languages for monoidal categories, August 2009. URL https://arxiv. org/abs/0908.3347v1.
彼得·塞林格。单位范畴的图形语言综述,2009 年 8 月。URL https://arxiv.org/abs/0908.3347v1。

Dan Shiebler, Bruno Gavranovic, and Paul W. Wilson. Category Theory in Machine Learning. CoRR, abs/2106.07032, 2021. URL https://arxiv.org/abs/2106.07032. arXiv: 2106.07032.
丹·希布勒、布鲁诺·加夫拉诺维奇和保罗·W·威尔逊。机器学习中的范畴论。CoRR, abs/2106.07032, 2021. URL https://arxiv.org/abs/2106.07032. arXiv: 2106.07032.

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway Networks. CoRR, abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387. arXiv: 1505.00387.
卢佩什·库马尔·斯里瓦斯塔瓦、克劳斯·格雷夫和尤尔根·施密德胡柏。Highway Networks。CoRR, abs/1505.00387, 2015。网址 http://arxiv.org/abs/1505.00387。arXiv: 1505.00387。

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive Network: A Successor to Transformer for Large Language Models, August 2023. URL http://arxiv.org/abs/2307.08621. arXiv:2307.08621 [cs].
孙宇涛、董力、黄绍汉、马舒铭、夏玉清、薛建龙、王建勇和魏富如。保持性网络:一个大型语言模型的继任者,2023 年 8 月。 URL http://arxiv.org/abs/2307.08621. arXiv:2307.08621 [cs].

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998-6008, 2017. URL https://proceedings.neurips.cc/ paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa- Abstract.html.
阿什什·瓦萨尼、诺亚·沙泽、尼基·帕玛尔、雅各·乌斯库雷特、利安·琼斯、艾登·恩·戈麦斯、卢卡斯·凯撒和伊利亚·波罗苏金。注意力是你所需要的一切。收录在伊莎贝尔·吉恩、乌尔里克·冯·卢克斯堡、萨米·本吉奥、汉娜·M·沃勒奇、罗布·弗格斯、S.V.N.维什瓦纳坦和罗曼·加内特(编)的《神经信息处理系统进展 30》中,2017 年 12 月 4-9 日在美国加利福尼亚州长滩召开的神经信息处理系统年会上,第 5998-6008 页,2017 年。URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html。

Paul Wilson and Fabio Zanasi. Categories of Differentiable Polynomial Circuits for Machine Learning, May 2022. URL http://arxiv.org/abs/2203.06430. arXiv:2203.06430 [cs, math].
保罗·威尔逊和法比奥·扎纳西。用于机器学习的可微分多项式电路类别,2022 年 5 月。 URL http://arxiv.org/abs/2203.06430。 arXiv:2203.06430 [cs, math]。

Tom Xu and Yoshihiro Maruyama. Neural String Diagrams: A Universal Modelling Language for Categorical Deep Learning. In Ben Goertzel, Matthew Iklé, and Alexey Potapov (eds.), Artificial General Intelligence, Lecture Notes in Computer Science, pp. 306-315, Cham, 2022. Springer International Publishing. ISBN 978-3-030-93758-4. doi: 10.1007/978-3-030-93758-4_32.
徐凯和丸山义弘。神经弦图:面向分类深度学习的通用建模语言。收录于 Ben Goertzel、Matthew Iklé和 Alexey Potapov 主编的《人工通用智能》,计算机科学讲座论文集,第 306-315 页,2022 年,斯普林格国际出版集团。ISBN 978-3-030-93758-4。doi: 10.1007/978-3-030-93758-4_32。

Yao Lei Xu, Kriton Konstantinidis, and Danilo P. Mandic. Graph Tensor Networks: An Intuitive Framework for Designing Large-Scale Neural Learning Systems on Multiple Domains. CoRR, abs/2303.13565, 2023. doi: 10.48550/ARXIV.2303.13565. URL https://doi.org/10.48550/arXiv.2303.13565. arXiv: 2303.13565 .
姚磊栩、Kriton Konstantinidis 和 Danilo P. Mandic。图张量网络：一种直观的框架,用于在多个领域设计大规模神经网络学习系统。CoRR,abs/2303.13565,2023。doi: 10.48550/ARXIV.2303.13565。URL https://doi.org/10.48550/arXiv.2303.13565。arXiv: 2303.13565。

Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Rob Fergus. Deconvolutional networks. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2528-2535, San Francisco, CA, USA, June 2010. IEEE. ISBN 978-1-4244-6984-0. doi: 10.1109/CVPR.2010.5539957. URL http://ieeexplore.ieee.org/document/5539957/.
马修·D·泽勒、迪利普·克里希南、格雷厄姆·W·泰勒和罗布·弗格斯。去卷积网络。在 2010 年 6 月于旧金山,加利福尼亚州,美国召开的 2010 IEEE 计算机视觉和模式识别学会会议上,第 2528-2535 页。IEEE。ISBN 978-1-4244-6984-0。doi: 10.1109/CVPR.2010.5539957。网址 http://ieeexplore.ieee.org/document/5539957/。

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, February 2017. URL http://arxiv.org/abs/1611.03530. arXiv:1611.03530 [cs].
张驰远、萨米·本吉奥、莫里茨·哈尔特、本杰明·雷赫特和奥里奥尔·维尼亚尔斯。理解深度学习需要重新思考泛化,2017 年 2 月。网址 http://arxiv.org/abs/1611.03530。arXiv:1611.03530 [cs]。

A Jupyter Notebook (see: vtabbott/Neural-Circuit-Diagrams)
一个 Jupyter Notebook (see: vtabbott/Neural-Circuit-Diagrams)

import torch
import typing
import functorch
import itertools

2.3 Tensors 张量

We diagrams tensors, which can be vertically and horizontally decomposed.
我们对张量进行图表处理,可以进行垂直和水平分解。

# This diagram shows a function

h : 3, 42, 6

-> 12 constructed out of

f : 42, 6 \to 33

and

g : 3, 3

-> 1 2
这个图显示了一个函数

h : 3, 42, 6

-> 12 由

f : 42, 6 \to 33

和

g : 3, 3

-> 1 2 构成的
# We use assertions and random outputs to represent
我们使用断言和随机输出来表示
generic functions, and how diagrams relate to code.
泛型函数和图表与代码的关系。
T = torch. Tensor
张量
def

f (x 0

T, x 1 : T)

:
定义

f (x 0

：

T, x 1 : T)

：
“”" f: 4 2, 6 -> 3 “”"
见原文
assert x0.size() == torch.Size([4,2])
断言 x0.size() == torch.Size([4,2])
assert x1.size() == torch.Size([6])
断言 x1.size() == torch.Size([6])
return torch.rand([3,3])
返回 torch.rand([3,3])
def

g (x 0 : T, x 1 : T) :

定义

g (x 0 : T, x 1 : T) :

“”" g: 3, 3 3-> 12 “”"
以下是译文: """g: 3, 3 3-> 12"""
assert

x 0 . size () ==

torch.Size([3])
断言

x 0 . size () ==

torch.Size([3])
assert x1.size() == torch.Size([3, 3])
断言 x1.size() == torch.Size([3, 3])
return torch.rand([1,2])
返回 torch.rand([1,2])
def

h (x 0 : T, x 1 : T, x 2 : T)

定义

h (x 0 : T, x 1 : T, x 2 : T)

“”" h: 3, 4 2, 6 -> 1 2""
坐标为 (3, 4) 和 (2, 6) 的点到原点 (0, 0) 的欧氏距离分别为 1 和 2
assert

x 0 . size () ==

torch.Size([3])
assert x1.size()

==

torch.Size([4, 2])
断言 x1.size() < torch.Size([4, 2])
assert x2.size() == torch.Size([6])
断言 x2.size() == torch.Size([6])
return

g (x 0, f (x 1, x 2))

g (x 0, f (x 1, x 2))

h

(torch.rand([3]), torch.rand([4, 2]), torch.rand([6]))
(torch.rand([3]), torch.rand([4, 2]), torch.rand([6]))

tensor([[0.6837, 0.6853]])

2.3.1 Indexes 索引

Figure 8: Indexes 图 8：指标

We express subtensor extractions, grabbing

A [2, :]

, by an index applied on the relevant axis.
我们通过对相关轴应用索引来表达子张量提取,捕获

A [2, :]

。

# Extracting a subtensor is a process we are familiar
提取子张量是我们熟悉的过程
with. Consider, 考虑。
# A (4 3) tensor
一个 (4 3) 张量
table

=

torch. arange

(0, 12)

. view

(4, 3)

表

=

torch. 生成

(0, 12)

. 查看

(4, 3)

row = table[2,:] 行 = 表[2,:]
row 行

tensor([6, 7, 8])

Figure 9: Subtensors 图 9:子张量

# Different orders of access give the same result.
不同访问顺序产生相同结果。
# Set up a random (5 7) tensor
设置一个随机的 (5 7) 张量
a,

b = 5, 7

一个，

b = 5, 7

Xab = torch.rand([a] + [b])
生成一个大小为[a]+[b]的随机张量。
# Show that all pairs of indexes give the same result
证明所有索引对给出相同的结果
for ia,

j b

in itertools.product(range(a), range(b)): assert Xab[ia, jb] == Xab[ia, :][jb] assert

Xab [i a, j b] == X a b [:, j b] [i a]

以下是翻译结果: for ia,

j b

in itertools.product(range(a), range(b)): assert Xab[ia, jb] == Xab[ia, :][jb] assert

Xab [i a, j b] == X a b [:, j b] [i a]

2.3.2 Broadcasting 广播

Figure 10: Broadcasting 图 10:广播

a, b, c, d = [3], [2], [4], [3]
T = torch.Tensor
# We have some function from a to b;
def G(Xa: T) -> T:
    """ G: a -> b """
    return sum(Xa**2) + torch.ones(b)

# We could bootstrap a definition of broadcasting,
我们可以引导广播的定义，
# Note that we are using spaces to indicate tensoring.
注意,我们使用空格来表示张量化。
# We will use commas for tupling, which is in line with
我们将使用逗号进行元组化,这符合 Python 的惯例
standard notation while writing code
在编写代码时使用标准表示法
def

Gc (X a c : T) - > T

定义

Gc (X a c : T) - > T

“” G c : a c -> b c
以下是翻译结果： ""G c:a c->b c"

Y b c = torch . z e r o s (b + c)

for

j

in range

(c [0])

:
对于

j

在范围

(c [0])

内:

Ybc [:, jc] = G (Xac [:, jc])

return Ybc 返回 Ybc
# Or use a PyTorch command,
或使用 PyTorch 命令
# G *: a * -> b *
请您再次发送要翻译的文本
Gs = torch. vmap

(G, - 1, - 1)

这个是一个 PyTorch 函数,用于实现张量的广播映射。它可以对张量的最外层元素进行逐元素操作。
# We feed a random input, and see whether applying an
我们输入随机数据,并观察应用
index before or after
索引在之前还是之后
# gives the same result.
输出: # 给出相同的结果。
Xac

=

torch.

rand (a + c)

以下是翻译结果: Xac 照明。
for jc in range(c[0]):
对于循环变量 jc 在 0 到 c[0] 之间进行迭代
assert torch.allclose(G(Xac[:,jc]), Gc(Xac)[:,jc])
断言 torch.allclose(G(Xac[:,jc]), Gc(Xac)[:,jc])
assert torch.allclose(G(Xac[:,jc]), Gs(Xac)[:,jc])
断言 torch.allclose(G(Xac[:,jc]), Gs(Xac)[:,jc])
# This shows how our definition of broadcasting lines up with that used by PyTorch vmap.
这是我们的广播定义如何与 PyTorch vmap 使用的定义相匹配。

Figure 11: Inner Broadcasting
图 11：内部广播

a, b, c, d = [3], [2], [4], [3]

T = torch. Tensor
张量
# We have some function which can be inner broadcast,
我们有一些可以内部广播的函数
def

H (X a : T, X d : T)

T

:
以下是翻译后的结果: def

H (X a : T, X d : T)

T

:
“”" H: a, d -> b “”"
这是一条图论模型的输入。 "H: a, d -> b"
return torch.sum(torch.sqrt(Xa**2)) +
返回 torch.sum(torch.sqrt(Xa**2)) +
torch.sum(torch.sqrt(Xd ** 2)) + torch.ones(b)
# We can bootstrap inner broadcasting,
我们可以引导内部广播
def Hce(Xca:

T, X d : T)

T

:
定义函数 Hce(Xca:

T, X d : T)

T

)
“”" c0 H: c a, d -> c d “”"
「c0 H: c a, d -> c d」
# Recall that we defined a, b, c, d in [_] arrays.
请记住,我们在[_]数组中定义了 a、b、c、d。
Ycb

=

torch.zeros (

c + b

)
以下是翻译结果: torch.zeros()
for ic in range(c[0]):
对于输入的这行文本: for ic in range(c[0]): 翻译结果为: for ic in range(c[0]):
Ycb[ic, :] = H(Xca[ic, :], Xd)
阳春白雪
return Ycb 返回 Ycb
# But vmap offers a clear way of doing it,
但 vmap 提供了一种清晰的方法来实现它,
# *0 H: * a, d -> * c
「# *0 H: * a, d -> * c」
Hs0 = torch.vmap(H, (0, None), 0)
将该内容翻译为简体中文: Hs0 = torch.vmap(H, (0, None), 0)
# We can show this satisfies Definition 2.14 by,
我们可以通过以下方式证明这满足定义 2.14:
Xca

=

torch.

rand (c + a)

张照着炬火。

Xd =

torch.rand(d)
生成随机张量。
for ic in range(c[0]):
以下为翻译结果: for ic in range(c[0]):
assert torch.allclose(Hc0(Xca, Xd)[ic, :], H(Xca[ic, :], Xd))
断言 torch.allclose(Hc0(Xca, Xd)[ic, :], H(Xca[ic, :], Xd))
assert torch.allclose(Hse(Xca, Xd)[ic, :], H(Xca[ic, :], Xd))
断言 torch.allclose(Hse(Xca, Xd)[ic, :], H(Xca[ic, :], Xd))

Figure 12 Elementwise operations
图 12 逐元素操作

# Elementwise operations are implemented as usual ie
def f(x):
    "f : 1 -> 1"
    return x ** 2
# We broadcast an elementwise operation,
# f *: * -> *
fs = torch.vmap(f)
Xa = torch.rand(a)
for i in range(a[0]):
    # And see that it aligns with the index before =
index after framework.
    assert torch.allclose(f(Xa[i]), fs(Xa)[i])
    # But, elementwise operations are implied, so no
special implementation is needed.
    assert torch.allclose(f(Xa[i]), f(Xa)[i])

2.4 Linearity 线性

2.4.2 Implementing Linearity and Common Operations
实现线性化和常见操作

Figure 17: Multi-head Attention and Einsum
图 17：多头注意力机制和 Einsum
# Local memory contains,
本地内存包含
# Q: y k h # K: x k h
问: y k h 答: x k h
# Transpose K, 转置 K
Q,

k = Q

, einops.einsum (

K

, ’

\times kh \to k \times h

')
将 Q,

k = Q

, einops.einsum(

K

, '

\times kh \to k \times h

') 翻译为简体中文: Q,

k = Q

, einops.einsum(

K

, '

\times kh \to k \times h

x =

einops outer product and diagonalize

x =

外积和对角化

\Rightarrow y k 1 k 2 \times h h^{'})

# Inner product, 内积

x =

einops.einsum (

x, y k k \times h \to y \times h^{'}

),
这是一个 einops.einsum 表达式。

import math
import einops
x, y, k, h = 5, 3, 4, 2
Q = torch.rand([y, k, h])
K = torch.rand([x, k, h])
# Local memory contains,
# Q: y k h # k: x k h
# Outer products, transposes, inner products, and
# diagonalization reduce to einops expressions.
# Transpose K,
K = einops.einsum(K, 'x k h >> k x h')
# Outer product and diagonalize,
X = einops.einsum(Q, K, 'y k1 h, k2 x h -> y k1 k2 x
h')
# Inner product,
X = einops.einsum(X, 'y k k x h -> y x h')
# Scale,
X = X / math.sqrt(k)
Q = torch.rand([y, k, h])
K = torch.rand([x, k, h])
# Local memory contains,
# Q: y k h # K: x k h
X = einops.einsum(Q, K, 'y k h, x k h -> y x h')
X = X / math.sqrt(k)

2.4.3 Linear Algebra
线性代数

Figure 18: Graphical Linear Algebra
图 18:图形线性代数

# We will do an exercise implementing some of these equivalences.
我们将进行一个练习来实现这些等价关系中的一些。
# The reader can follow this exercise to get a better sense of how linear functions can be implemented, # and how different forms are equivalent.
读者可以通过这个练习来更好地了解如何实现线性函数,以及不同形式是等价的。

a, b, c, d = [3], [4], [5], [3]

# We will be using this function a lot
我们将大量使用这个函数
es = einops.einsum 易数 = einops.einsum
# F: a b c
以下是翻译结果: # F: a b c
F_matrix

=

torch. rand

(a + b + c)

张量
# As an exericse we will show that the linear map F: a
作为一个练习,我们将展示线性映射 F：a
-> b c can be transposed in two ways.
b c 可以以两种方式进行转置。
# Either, we can broadcast, or take an outer product.
我们可以进行广播操作，或者计算外积。
We will show these are the same.
我们将展示这些是相同的。
# Transposing by broadcasting
通过广播进行转置
#
def F_func(Xa: T): 定义 F_func(Xa: T):
“”" F: a -> b c “”"
「 F: a -> b c 」
return es(Xa,F_matrix, ‘a, a b c->b c’,)
返回 es(Xa,F_matrix,'a, a b c->b c',)
# * F: * a -> * b c
原文不变: # * F: * a -> * b c
F_broadcast

=

torch.vmap(F_func, 0, 0)
广播

=

torch.vmap(函数, 0, 0)
# We then reduce it, as in the diagram,
我们然后将其减少,如图所示
# b a

\to

b b c

- >

c
def F_broadcast_transpose(Xba: T):
定义 F_broadcast_transpose(Xba: T):
“”" (b F) (.b c): b a -> c “”"
「(b F) (.b c): b a -> c」

Xbbc = F

_broadcast

(Xba)

广播
return es(Xbbc, ‘b b c -> c’)
返回 es(Xbbc, 'b b c -> c')
# Transpoing by linearity
通过线性变换
#
# We take the outer product of

Id (b)

and

F

, and follow up with a inner product.
我们计算

Id (b)

和

F

的外积,然后进行内积运算。
# This gives us,
这给了我们,
F_outerproduct

=

es(torch.eye(b[0]), F_matrix,‘b0 b1, a b2 c->b0 b1 a b2 c’, )
F_outerproduct(torch.eye(b[0]), F_matrix,'b0 b1, a b2 c->b0 b1 a b2 c')
# Think of this as Id(b) F: b0 a -> b1 b2 c arranged into an associated b0 b1 a b2 c tensor.
把这个视为 Id(b) F: b0 a -> b1 b2 c 放在一个关联的 b0 b1 a b2 c 张量中。
# We then take the inner product. This gives

a (b a c)

matrix, which can be used for a (b a -> c) map.
我们然后计算内积。这给出了一个

a (b a c)

矩阵，可用于(b a -> c)映射。
F_linear_transpose

=

es(F_outerproduct, ‘b B a B c->b a (’, )
线性转置

=

es(外积, 'b B a B c->b a (', )

# We contend that these are the same.

#
Xba

=

torch.

rand (b + a)

燃亮

=

火炬。

rand (b + a)

assert torch.allclose( 断言 torch.allclose(

F_broadcast_transpose(Xba),
广播转置(Xba)
es(Xba,F_linear_transpose, ‘b a, b a c -> c’))
输出: es(Xba,F_linear_transpose, 'b a, b a c -> c'))
# Furthermore, lets prove the unit-inner product
此外，我们证明单位内积

identity.
#
# The first step is an outer product with the unit,
outerUnit = lambda Xb: es(Xb, torch.eye(b[0]), 'b0, b1
b2 -> b0 b1 b2')
# The next is a inner product over the first two axes,
dotOuter = lambda Xbbb: es(Xbbb, 'b0 b0 b1 -> b1')
# Applying both of these *should* be the identity, and
hence leave any input unchanged.
Xb = torch.rand(b)
assert torch.allclose(
    Xb,
    dotOuter(outerUnit(Xb)))
# Therefore, we can confidently use the expressions in
Figure 18 to manipulate expressions.

3.1 Basic Multi-Layer Perceptron
基本多层感知机

Figure 19: Implementing a Basic Multi-Layer Perceptron
图 19:实现基本多层感知机

3.2 Neural Circuit Diagrams for the Transformer Architecture
3.2 Transformer 架构的神经电路图

Figure 20: Scaled Dot-Product Attention
图 20：缩放点积注意力

# Note, that we need to accomodate batches, hence the … to capture additional axes.
请注意,我们需要容纳批次,因此使用...来捕捉其他轴。
# We can do the algorithm step by step,
我们可以逐步完成算法
def ScaledDotProductAttention (

q : T, k : T, V : T

) -> T :
缩放的点积注意力
‘’’ yk, xk, xk -> yk ‘’’
这种情况不需要翻译
klength

= k

.size()[-1]
长度

= k

.size()[-1]
# Transpose 转置
k = einops.einsum(k, ‘… x k -> … k x’)
把 k 转换为 einops.einsum 形式: k = einops.einsum(k, '... x k -> ... k x')
# Matrix Multiply / Inner Product
矩阵乘法 / 内积

x = einops . einsum (q, k,^{'} . . . y k, . . . k x - > . . . y

x’) x')
# Scale 缩放

x = x /

math.sqrt(klength)
函数 math.sqrt(klength)
# SoftMax 软最大化

x =

torch.nn.Softmax (-1)(x)
torch.nn.Softmax(-1)(x)
# Matrix Multiply / Inner Product
矩阵乘法/内积

x =

einops.einsum(x, v, ‘… y

x, \dots x k

-> … y

k^{'}

)

x =

einops.einsum(x, v, '… y

x, \dots x k

-> … y

k^{'}

)
return x 返回 x
# Alternatively, we can simultaneously broadcast linear functions.
作为替代方案，我们可以同时广播线性函数。
def ScaledDotProductAttention(q: T, K: T, V: T) -> T:
缩放点积注意力
‘’’ yk, xk, xk -> yk ‘’
多线性函数
klength = k.size()[-1] 长度 = k.size()[-1]
# Inner Product and Scale
# 内积和缩放

x =

einops.einsum(q, k, ‘… y k, … x k -> … y

x =

einops.einsum(q, k, '… y k, … x k -> … y
x’) x')
# Scale and SoftMax
缩放和 SoftMax

x =

torch.nn.Softmax (-1)(x / math.sqrt(klength))
torch.nn.Softmax(-1)(x / math.sqrt(klength))
# Final Inner Product
最终内积

x =

einops.einsum(x,

v,^{'} \dots y x, \dots x k

\to \dots

y
einops.einsum(x, y)

k^{'}

)
return

x

x

Figure 21: Multi-Head Attention
图 21：多头注意力机制

We will be implementing this algorithm. This shows us how we go from diagrams to implementations, and begins to give an idea of how organized diagrams leads to organized code.
我们将实现这种算法。这向我们展示了我们如何从图表过渡到实现,并开始让我们明白有组织的图表如何导致有组织的代码。

def MultiHeadDotProductAttention(q: T, k: T, v: T) ->
T:
    ''' ykh, xkh, xkh -> ykh '''
    klength = k.size()[-2]
    x = einops.einsum(q, k, '... y k h, ... x k h ->
... y x h')
    x = torch.nn.Softmax(-2)(x / math.sqrt(klength))
    x = einops.einsum(x, v, '... y x h, ... x k h ->

… yk h’)  这里是您输入的原文: … yk h') 翻译为简体中文后如下: … yk h')
return x  返回 x
# We implement this component as a neural network model.
我们将此组件实现为一个神经网络模型。
# This is necessary when there are bold, learned components that need to be initialized.
这是必要的，当存在需要初始化的粗体、学习组件时。
class MultiHeadAttention(nn.Module):
多头注意力机制(nn.Module)
# Multi-Head attention has various settings, which
多头注意力机制有不同的设置,其中包括
become variables  成为变量
# for the initializer.
对于初始化器。
def init(self, m, k, h):
初始化(self, m, k, h)
super().init()  超级().init()
self.m, self.k, self.h = m, k, h
自体.m, 自体.k, 自体.h = m, k, h
# Set up all the boldface, learned components
设置所有加粗、学术化的组件
# Note how they bind axes we want to split,
注意它们如何绑定我们想要拆分的轴
which we do later with einops.
稍后我们将使用 einops 执行该操作。
self.Lq = nn.Linear(m, kh, False)
自我.Lq = nn.Linear(m, kh, False)
self.Lk $=$ nn.Linear (m, kh, False)
自定义 self.Lk nn.Linear (m, kh, False)
self.Lv = nn.Linear(m, k*h, False)
自 self.Lv = nn.Linear(m, k*h, False)
self.Lo

=

nn.Linear (

k * h, m

, False

)

自我.隆 nn.线性 ( , 错误
# We have endogenous data (Eym) and external /
我们有内源性数据(Eym)和外部 /
injected data (Xxm) 注入数据 (Xxm)
def forward(self, Eym, Xxm):
前向(self, Eym, Xxm)
“”" y m, x m -> y m “”"
「 y m, x m -> y m 」
# We first generate query, key, and value
我们首先生成查询、键和值
vectors. 向量
# Linear layers are automatically broadcast.
线性层会自动广播。
# However, the k and h axes are bound. We
然而，k 和 h 轴是有界的。我们
define an unbinder to handle the outputs,
定义一个 unbinder 来处理输出
unbind = lambda x : einops.rearrange( x , ‘… (k
解开 = lambda x : einops.重新安排(x, '… (k
h)->… k h’, h=self.h)
返回 (h)->... k h', h=self.h)

q =

unbind(self.Lq(Eym))

k =

unbind(self.Lk(Xxm))
解绑(self.Lk(Xxm))

v =

unbind(self.Lv(Xxm))
解除绑定(self.Lv(Xxm))
# We feed

q

, k, and

v

to standard Multi-Head inner product Attention
我们输入

q

、k 和

v

到标准的多头内积注意力机制中

0 =

MultiHeadDotProductAttention(

q, k

v

)
多头点积注意力(

q, k

v

)
# Rebind to feed to the final learned layer,
# 绑定到最终学习层的馈送

0 =

einops.rearrange(o, ‘… k h-> … (k h)’,
将

0 =

einops.rearrange(o, '… k h-> … (k h)' 翻译为简体中文: 调整 einops.rearrange(o, '… k h-> … (k h)'
h=self.h) 自定义参数 h=self.h
return self.Lo(o) 返回 self.Lo(o)
# Now we can run it on fake data;
现在我们可以在模拟数据上运行它
y, x, m, jc, heads

=

[20], [22], [128], [16], 4
在这里,y、x、m、jc、heads 分别代表一些变量或参数,

=

表示一段代码,后面的 [20]、[22]、[128]、[16] 和 4 则是对应的值
# Internal Data 内部数据
Eym

=

torch.

rand (y + m)

我是

=

火炬。

rand (y + m)

# External Data 外部数据
Xxm

=

torch.

rand (x + m)

写下一个

=

火炬。

rand (x + m)

mha

=

MultiHeadAttention(

m [0], j c [0]

,heads)
多头注意力
assert list(mha.forward(Eym, Xxm).size()) == y + m
断言 `list(mha.forward(Eym, Xxm).size())` 等于 `y + m`

3.4 Computer Vision 计算机视觉

Here, we really start to understand why splitting diagrams into “fenced off” blocks aids implementation. In addition to making diagrams easier to understand and patterns more clearn, blocks indicate how code can structured and organized.
在这里,我们真正开始理解为什么将图表拆分成"隔离"块有助于实施。除了使图表更易理解和模式更清晰外,块还表示代码可以如何结构化和组织。

Figure 26: Identity Residual Network
图 26：恒等残差网络

# For Figure 26, every fenced off region is its own
module.
# Batch norm and then activate is a repeated motif,
class NormActivate(nn.Sequential):
        def __init__(self, nf, Norm=nn.BatchNorm2d,
Activation=nn.ReLU):
            super().__init__(Norm(nf), Activation())
def size_to_string(size):
    return " ".join(map(str,list(size)))
# The Identity ResNet block breaks down into a
manageable sequence of components.
class IdentityResNet(nn.Sequential):
    def __init__(self, N=3, n_mu=[16,64,128, 256],
y=10):
        super().__init__(
            nn.Conv2d(3, n_mu[0], 3, padding=1),
            Block(1, N, n_mu[0], n_mu[1]),
            Block(2, N, n_mu[1], n_mu[2]),
            Block(2, N, n_mu[2], n_mu[3]),
            NormActivate(n_mu[3]),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Linear(n_mu[3], y),
            nn.Softmax(-1),
            )

The Block can be defined in a seperate model, keeping the code manageable and closely connected to the diagram.
区块可以在单独的模型中定义,使代码管理更简单,并与图表紧密相连。

# We then follow how diagrams define each '`block''
class Block(nn.Sequential):
    def __init__(self, s, N, n0, n1):
        """ n0 and n1 as inputs to the initializer are
implicit from having them in the domain and codomain in
the diagram. """
        nb = n1 // 4
        super().__init__(
            *[
            NormActivate(n0),
            ResidualConnection(
            nn.Sequential(
                nn.Conv2d(n0, nb, 1, s),
                NormActivate(nb),
                nn.Conv2d(nb, nb, 3, padding=1),
                NormActivate(nb),
                nn.Conv2d(nb, n1, 1),
            ),
            nn.Conv2d(n0, n1, 1, s),
            )
            ] + [
            ResidualConnection(
            nn.Sequential(
                NormActivate(n1),
                nn.Conv2d(n1, nb, 1),
                NormActivate(nb),

    \(n n . C o n v 2 d(n b, n b, 3\), padding=1),
        NormActivate(nb),
        nn. Conv2d(nb, n1, 1)
        ),
    ] \({ }^{\text {? }} \mathrm{N}\)
)
\# Residual connections are a repeated pattern in the
diagram. So, we are motivated to encapsulate them

# as a seperate module.
作为一个独立的模块。
class ResidualConnection(nn.Module):
残差连接(nn.Module)
def init(self, mainline : nn.Module, connection
初始化(self, mainline: nn.Module, connection)
nn.Module | None = None) -> None:
没有必要翻译。
super().init() 超级().init()
self.main = mainline 自身.主要 = 主线
self.secondary

=

nn. Identity() if connection

==

自我副次

=

nn. 身份() 如果连接

==

None else connection 没有其他连接
def forward(self,

x

):
这是一个 Python 函数的定义。其中 `forward(self,

x

)` 是函数的签名,其中 `forward` 是函数名, `self` 是隐藏参数,用于访问类内部的属性和方法, `

x

` 是另一个参数
return self.main(x) + self.secondary(x)
返回 self.main(x) + self.secondary(x)

# A standard image processing algorithm has inputs
shaped b c h w.
b, c, hw = [3], [3], [16, 16]
idresnet = IdentityResNet()
Xbchw = torch.rand (b + c + hw)
# And we see if the overall size is maintained,
assert list(idresnet.forward(Xbchw).size()) == b + [10]

The UNet is a more complicated algorithm than residual networks. The “fenced off” sections help keep our code organized. Diagrams streamline implementation, and helps keep code organized.
卷积神经网络（U 网）是一种比残差网络更复杂的算法。"分隔"部分有助于保持代码的组织性。图表简化了实现过程,并有助于保持代码的组织性。

Figure 27: The UNet architecture
图 27:UNet 架构

Activation(),
nn.Conv2d(c0, c1, 3, padding=1),
Activation(),
)

# The model is specified for a very specific number of layers,
该模型针对特定数量的层进行了指定
# so we will not make it very flexible.
所以我们不会使它非常灵活。
class UNet(nn.Module): 类 UNet(nn.Module):
def init(self,

y = 2

):
定义初始化(self,

y = 2

)
super().init() 超级().init()
# Set up the channel sizes;
设置通道大小;

c = [1

i == 0

else

64^{*} 2^{* *}

i for

i

in
如果

i == 0

则

64^{*} 2^{* *}

i 为

i

循环
range(6)] [0, 1, 2, 3, 4, 5]
# Saving and loading from memory means we can
保存和从内存中加载意味着我们可以
not use a single,
不使用单个
# sequential chain. 顺序链。
# Set up and initialize the components; self.DownScaleBlocks = [
设置并初始化组件；self.DownScaleBlocks = [

DownScaleBlock(c[i], c[i+1])
降低缩放块(c[i], c[i+1])
for i in range(0,4)
对于下面的文本: for i in range(0,4) 翻译为: for i in range(0,4)
] # Note how this imitates the lambda operators in the diagram.
注意这如何模仿图表中的 lambda 操作符。
self.middleDoubleConvolution

=

DoubleConvolution(c[4], c[5])
双重卷积(c[4], c[5])
self.middleUpscale

= n n

. ConvTranspose2d(c[5],
自我.middleUpscale

= n n

.ConvTranspose2d(c[5],
c[4], 2, 2, 1)
self.upScaleBlocks = [ 自我.上钳块 = [
UpScaleBlock(c[5-i], c[4-i])
上采样块(c[5-i], c[4-i])
for i in range

(1, 4)

对于下列文本: for i in range

(1, 4)

翻译如下: for i in range

(1, 4)

]
self.finalConvolution

= n n . C o n v 2 d (c [1], y)

自身.finalConvolution

= n n . C o n v 2 d (c [1], y)

def forward(self,

x

):
这是一个 Python 函数的定义。其中 `forward(self,

x

)` 是函数的签名,其中 `forward` 是函数名, `self` 是隐藏参数,用于访问类内部的属性和方法, `

x

` 是另一个参数
cLambdas

=

[]
简体中文翻译如下: λ函数

=

[]
for dsb in self.DownScaleBlocks:
对于 dsb in self.DownScaleBlocks:

x, c

Lambda

= dsb (x)

λ
cLambdas.append(cLambda)
代码保持原文不变。

x =

self.middleDoubleConvolution(

x

)
自己.middleDoubleConvolution(

x =

self.middleUpscale(x)
自`self.middleUpscale(x)`
for usb in self.upScaleBlocks:
对于 self.upScaleBlocks 中的每个 usb
cLambda

=

cLambdas.pop()
以下是翻译后的结果: λ

=

λs.pop()

x = u s b (x

, cLambda)

x =

self.finalConvolution(x)
自 self.finalConvolution(x)
class DownScaleBlock(nn.Module):
下缩放块(nn.Module):
def init(self, c0, c1) -> None:
初始化(self, c0, c1) -> 无
super().init() self.doubleConvolution

=

DoubleConvolution

(c 0

,
超级().init()自我.双重卷积

=

双重卷积

(c 0

,
c1) 亚洲和平
self.downScaler

= nn . MaxPool 2 d (2, 2, 1)

自身.下调分辨率

= nn . MaxPool 2 d (2, 2, 1)

def forward(self, x): 定义前向传播
cLambda

=

self.doubleConvolution( x )
自 Lambda

=

self.doubleConvolution(x)

x =

self.downScaler(cLambda)
自上而下缩放(cLambda)
return x, cLambda 返回 x, cLambda
class UpScaleBlock(nn.Module):
上升块(nn.Module)
def init(self, c1, c0) -> None:
定义初始化(self, c1, c0) -> 无
super().init() 超级().init()
self.doubleConvolution =
self.doubleConvolution
DoubleConvolution(2*c1, c1)
双重卷积(2*c1, c1)
self.upScaler

= n n

. ConvTranspose2d(c1, c0,

2, 2, 1

)
自己.上采样器

= n n

. ConvTranspose2d(c1, c0,

2, 2, 1

)
def forward(self,

x

, cLambda):
定义前向函数(self,

x

, cLambda)
# Concatenation occurs over the C channel axis
拼接发生在 C 通道轴上
(dim=1) 如下是翻译结果: (dim=1)

x =

torch.concat (

x

, cLambda, 1)
将输入文本翻译为简体中文: torch.concat(, cLambda, 1)

x =

self.doubleConvolution(

x

)
自己.双重卷积(self.doubleConvolution())

x =

self.upScaler

(x)

自我.upScaler
return

x

x

3.5 Vision Transformer 视觉变换器

We adapt our code for Multi-Head Attention to apply it to the vision case. This is a good exercise in how neural circuit diagrams allow code to be easily adapted for new modalities.
我们调整我们的代码以应用多头注意力机制于视觉领域。这是一个很好的练习,展示神经电路图如何让代码能够轻易地适用于新的模态。

Figure 28: Visual Attention
图 28：视觉注意力

class VisualAttention(nn.Module):
    def __init__(self, c, k, heads = 1, kernel = 1,
stride = 1):
    super().__init__()
        # w gives the kernel size, which we make
adjustable.
            self.c, self.k, self.h, self.w = c, k, heads,
kernel
            # Set up all the boldface, learned components
    # Note how standard components may not have
axes bound in
    # the same way as diagrams. This requires us to
rearrange
    # using the einops package.
    # The learned layers form convolutions
    self.Cq = nn.Conv2d(c, k * heads, kernel,
stride)
    self.Ck = nn.Conv2d(c, k * heads, kernel,
stride)
    self.Cv = nn.Conv2d(c, k * heads, kernel,
stride)
    self.Co = nn.ConvTranspose2d(
                                    k * heads, c, kernel,
stride)
    # Defined previously, closely follows the diagram.
    def MultiHeadDotProductAttention(self, q: T, k: T,
v: T) -> T:
        ''' ykh, xkh, xkh -> ykh '''
        klength = k.size()[-2]
        x = einops.einsum(q, k, '... y k h, ... x k h -
> ... y x h')
    x = torch.nn.Softmax(-2)(x /
math.sqrt(klength))
    x = einops.einsum(x, v, '... y x h, ... x k h -
> ... y k h')
    return x
    # We have endogenous data (EYC) and external /
injected data (XXc)
    def forward(self, EcY, XcX):
        """ cY, cX -> cY
        The visual attention algorithm. Injects
information from Xc into Yc. """
            # query, key, and value vectors.
            # We unbind the k h axes which were produced by
the convolutions, and feed them
    # in the normal manner to
MultiHeadDotProductAttention.
    unbind = lambda x: einops.rearrange(x, 'N (k h)
H W -> N (H W) k h', h=self.h)
    # Save size to recover it later
    q = self.Cq(EcY)
    W = q.size()[-1]

# By appropriately managing the axes, minimal
通过恰当管理坐标轴，最小化
changes to our previous code
我们之前的代码发生了更改
# is necessary. 必要的。

q =

unbind

(q)

解除绑定

k =

unbind(self.Ck(XcX))

k =

解绑(self.Ck(XcX))

v =

unbind(self.Cv(XcX))
解除绑定(self.Cv(XcX))

o =

self.MultiHeadDotProductAttention(q, k, v)
self.MultiHeadDotProductAttention(q, k, v)
# Rebind to feed to the transposed convolution
重新绑定到转置卷积的输入
layer. 层。

0 =

einops.rearrange(o, ‘N(HW) k h

\to N (k h)

使用 einops.rearrange(o, 'N(HW) k h'
H W’,  您好，世界
h=self.h, W=W)  这是原文: h=self.h, W=W)
return self.Co(o)  返回 self.Co(o)
# Single batch element,
单个批次元素，

b =

[1] 以下是简体中文翻译:

b =

[1]

Y, X, c, k = [16, 16], [16, 16], [33], 8

# The additional configurations,
额外的配置,
heads, kernel, stride

= 4, 3, 3

头，内核，步幅
# Internal Data, 内部数据

E Y c =

torch.

rand (b + c + Y)

使用

E Y c =

torch.

rand (b + c + Y)

# External Data, 外部数据

X X c =

torch.

rand (b + c + X)

火炬。
# We can now run the algorithm,
我们现在可以运行算法
visualAttention

=

VisualAttention(c[0], k, heads, kernel, stride)
可视注意力

=

VisualAttention(c[0], k, heads, kernel, stride)
# Interestingly, the height/width reduces by 1 for stride
有趣的是,对于步长,高度/宽度会减少 1
# values above 1. Otherwise, it stays the same.
值大于 1。否则，它保持不变。
visualAttention.forward(EYc, XXc).size()
可视注意力.前向(EYc, XXc).尺寸()

torch.Size([1, 33, 15, 15])

Appendix 附录

# A container to track the size of modules,
# Replace a module definition eg.
# > self.Cq = nn.Conv2d(c, k * heads, kernel, stride)
# With;
# > self.Cq = Tracker(nn.Conv2d(c, k * heads, kernel,
stride), "Query convolution")
# And the input / output sizes (to check diagrams) will
be printed.
class Tracker(nn.Module):
    def __init__(self, module: nn.Module, name : str =
""):
            super().__init__()
            self.module = module
            if name:
            self.name = name
            else:
            self.name = self.module._get_name()
            def forward(self, x):
            x_size = size_to_string(x.size())
            x = self.module.forward(x)
            y_size = size_to_string(x.size())
            print(f"{self.name}: \t {x_size} -> {y_size}")
            return x

$^{1}$ Using $i$ and $k$ to index over data, we have $SoftMax (v) [i] = \exp (v [i]) / Σ_{k} \exp (v [k])$ .
使用 $i$ 和 $k$ 来索引数据,我们有 $SoftMax (v) [i] = \exp (v [i]) / Σ_{k} \exp (v [k])$ 。
# We notice that double convolution where the numbers of channels change is a repeated motif.
我们注意到通道数变化的双卷积是一个重复的模式。
# We denote the input with c0 and output with c1. # This can also be done for subsequent members of an iteration.
我们用 c0 表示输入,用 c1 表示输出。这也可以应用于迭代的后续成员。
# When we go down an iteration eg. 5, 4, etc. we may have the input be c1 and the output c0.
当我们向下迭代时，例如从 5 到 4 等，输入可能是 c1，输出可能是 c0。
class DoubleConvolution(nn.Sequential):
双重卷积(nn.Sequential):

$\begin{matrix} def __init__(self, c0,c1, Activation=nn. ReLU): \\ super () \cdot init ( \\ n n \cdot Conv 2 d (c 0, c 1, 3, padding = 1), \end{matrix}$

Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures 神经电路图：深度学习架构的通信、实施和分析的可靠图表

Abstract 摘要

1 Introduction 1 引言

1.1 Necessity of Improved Communication in Deep Learning深度学习中改善沟通的必要性

1.2 Case Study: Shortfalls of Attention is All You Need1.2 案例研究:注意力单一模型的不足

1.3 Current Approaches and Related Works当前方法和相关工作

1.4 The Philosophy of My Approach我的方法论

1.5 Contributions 贡献

2 Reading Neural Circuit Diagrams阅读神经电路图

2.1 Commutative Diagrams2.1 可交换图

2.1.1 Tuples and Memory元组和内存

2.2 String Diagrams 2.2 字符串图

2.3 Tensors 张量

2.3.1 Indexes 索引

2.3.2 Broadcasting 广播

2.4 Linearity 线性

2.4.1 Multilinearity 多线性

2.4.2 Implementing Linearity and Common Operations实现线性化和常见操作

2.4.3 Linear Algebra 线性代数

3 Results: Key Applied Cases3 结果：关键应用案例

3.1 Basic Multi-Layer Perceptron基本多层感知机

3.2 Neural Circuit Diagrams for the Transformer Architecture3.2 Transformer 架构的神经电路图

Neural Circuit Diagram for Transformers基于变换器的神经电路图

3.3 Convolution 卷积

3.4 Computer Vision 计算机视觉

3.5 Vision Transformer 视觉变换器

3.6 Differentiation: A Clear Improvement over Prior Methods3.6 微分：相比之前的方法有明显改进

3.6.1 Modeling Differentiation建模差异化

4 Conclusion 结论

Acknowledgements 致谢

References 参考文献

A Jupyter Notebook (see: vtabbott/Neural-Circuit-Diagrams)一个 Jupyter Notebook (see: vtabbott/Neural-Circuit-Diagrams)

2.3 Tensors 张量

2.3.1 Indexes 索引

Figure 8: Indexes 图 8：指标

2.3.2 Broadcasting 广播

2.4 Linearity 线性

2.4.2 Implementing Linearity and Common Operations实现线性化和常见操作

2.4.3 Linear Algebra线性代数

3.1 Basic Multi-Layer Perceptron基本多层感知机

3.2 Neural Circuit Diagrams for the Transformer Architecture3.2 Transformer 架构的神经电路图

3.4 Computer Vision 计算机视觉

3.5 Vision Transformer 视觉变换器

Appendix 附录

Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures
神经电路图：深度学习架构的通信、实施和分析的可靠图表

1.1 Necessity of Improved Communication in Deep Learning
深度学习中改善沟通的必要性

1.2 Case Study: Shortfalls of Attention is All You Need
1.2 案例研究:注意力单一模型的不足

1.3 Current Approaches and Related Works
当前方法和相关工作

1.4 The Philosophy of My Approach
我的方法论

2 Reading Neural Circuit Diagrams
阅读神经电路图

2.1 Commutative Diagrams
2.1 可交换图

2.1.1 Tuples and Memory
元组和内存

2.4.2 Implementing Linearity and Common Operations
实现线性化和常见操作

3 Results: Key Applied Cases
3 结果：关键应用案例

3.1 Basic Multi-Layer Perceptron
基本多层感知机

3.2 Neural Circuit Diagrams for the Transformer Architecture
3.2 Transformer 架构的神经电路图

Neural Circuit Diagram for Transformers
基于变换器的神经电路图

3.6 Differentiation: A Clear Improvement over Prior Methods
3.6 微分：相比之前的方法有明显改进

3.6.1 Modeling Differentiation
建模差异化

A Jupyter Notebook (see: vtabbott/Neural-Circuit-Diagrams)
一个 Jupyter Notebook (see: vtabbott/Neural-Circuit-Diagrams)

2.4.2 Implementing Linearity and Common Operations
实现线性化和常见操作

2.4.3 Linear Algebra
线性代数

3.1 Basic Multi-Layer Perceptron
基本多层感知机

3.2 Neural Circuit Diagrams for the Transformer Architecture
3.2 Transformer 架构的神经电路图