这是用户在 2025-3-14 23:46 为 https://poloclub.github.io/cnn-explainer/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
image option
image option
image option
image option
image option
image option
image option
image option
image option
image option
plus button
lifeboatladybugpizzabell pepperschool buskoalaespressored pandaorangesport carRed channelGreenBlueMove to article sectioninputMove to article sectionconvMove to article sectionreluMove to article sectionconvMove to article sectionreluMove to article sectionmax_poolMove to article sectionconvMove to article sectionreluMove to article sectionconvMove to article sectionreluMove to article sectionmax_poolMove to article sectionoutput

What is a Convolutional Neural Network?
什么是卷积神经网络?

In machine learning, a classifier assigns a class label to a data point. For example, an image classifier produces a class label (e.g, bird, plane) for what objects exist within an image. A convolutional neural network, or CNN for short, is a type of classifier, which excels at solving this problem!
在机器学习中,分类器为数据点分配一个类标签。例如,图像分类器会为图像中存在的物体生成一个类标签(如鸟、飞机)。卷积神经网络(简称 CNN)是分类器的一种,擅长解决这一问题!

A CNN is a neural network: an algorithm used to recognize patterns in data. Neural Networks in general are composed of a collection of neurons that are organized in layers, each with their own learnable weights and biases. Let’s break down a CNN into its basic building blocks.
CNN 是一种神经网络:一种用于识别数据模式的算法。一般来说,神经网络由神经元集合组成,神经元按层排列,每个神经元都有自己的可学习权重和偏置。让我们将 CNN 分解为各个基本构件。

  1. A tensor can be thought of as an n-dimensional matrix. In the CNN above, tensors will be 3-dimensional with the exception of the output layer.
    张量可以看作一个 n 维矩阵。在上述 CNN 中,除了输出层之外,张量都是三维的。
  2. A neuron can be thought of as a function that takes in multiple inputs and yields a single output. The outputs of neurons are represented above as the redblue activation maps.
    神经元可以看作是一个接收多个输入并产生一个输出的函数。神经元的输出用红色 → 蓝色激活图表示。
  3. A layer is simply a collection of neurons with the same operation, including the same hyperparameters.
    层只是具有相同操作(包括相同超参数)的神经元的集合。
  4. Kernel weights and biases, while unique to each neuron, are tuned during the training phase, and allow the classifier to adapt to the problem and dataset provided. They are encoded in the visualization with a yellowgreen diverging colorscale. The specific values can be viewed in the Interactive Formula View by clicking a neuron or by hovering over the kernel/bias in the Convolutional Elastic Explanation View.
    虽然每个神经元的内核权重和偏置都是独一无二的,但它们在训练阶段会进行调整,使分类器能够适应所提供的问题和数据集。它们在可视化中以黄色→绿色的发散色标进行编码。具体数值可在交互式公式视图中点击神经元查看,或在卷积弹性解释视图中将鼠标悬停在内核/偏置上查看。
  5. A CNN conveys a differentiable score function, which is represented as class scores in the visualization on the output layer.
    CNN 传递一个可变的分数函数,在输出层的可视化中表示为类分数。

If you have studied neural networks before, these terms may sound familiar to you. So what makes a CNN different? CNNs utilize a special type of layer, aptly named a convolutional layer, that makes them well-positioned to learn from image and image-like data. Regarding image data, CNNs can be used for many different computer vision tasks, such as image processing, classification, segmentation, and object detection.
如果你以前学过神经网络,这些术语可能听起来很熟悉。那么,是什么让 CNN 与众不同呢?CNN 利用一种特殊类型的层,恰如其分地命名为卷积层,这使得 CNN 能够很好地学习图像和类似图像的数据。在图像数据方面,CNN 可用于许多不同的计算机视觉任务,如图像处理、分类、分割和物体检测。

In CNN Explainer, you can see how a simple CNN can be used for image classification. Because of the network’s simplicity, its performance isn’t perfect, but that’s okay! The network architecture, Tiny VGG, used in CNN Explainer contains many of the same layers and operations used in state-of-the-art CNNs today, but on a smaller scale. This way, it will be easier to understand getting started.
在 CNN Explainer 中,您可以看到如何使用简单的 CNN 进行图像分类。由于网络的简单性,它的性能并不完美,但这没有关系!CNN Explainer 中使用的网络架构 Tiny VGG 包含许多与当今最先进的 CNN 相同的层和操作,但规模较小。这样,它将更容易理解和入门。

What does each layer of the network do?
网络的每一层做什么?

Let’s walk through each layer in the network. Feel free to interact with the visualization above by clicking and hovering over various parts of it as you read.
让我们浏览一下网络中的每一层。在阅读过程中,请随意点击和悬停上面的可视化图像的各个部分,与之互动。

Input Layer  输入层

The input layer (leftmost layer) represents the input image into the CNN. Because we use RGB images as input, the input layer has three channels, corresponding to the red, green, and blue channels, respectively, which are shown in this layer. Use the color scale when you click on the network details icon icon above to display detailed information (on this layer, and others).
输入层(最左一层)表示 CNN 的输入图像。由于我们使用 RGB 图像作为输入,因此输入层有三个通道,分别对应红色、绿色和蓝色通道,显示在本层中。点击上面的 network details icon 图标时,使用色标显示详细信息(关于该层和其他层)。

Convolutional Layers  卷积层

The convolutional layers are the foundation of CNN, as they contain the learned kernels (weights), which extract features that distinguish different images from one another—this is what we want for classification! As you interact with the convolutional layer, you will notice links between the previous layers and the convolutional layers. Each link represents a unique kernel, which is used for the convolution operation to produce the current convolutional neuron’s output or activation map.
卷积层是 CNN 的基础,因为它们包含学习到的内核(权重),而内核提取的特征可以将不同的图像区分开来,这正是我们想要的分类效果!当你与卷积层交互时,你会注意到前几层与卷积层之间的链接。每个链接都代表一个独特的内核,用于卷积操作,生成当前卷积神经元的输出或激活图。

The convolutional neuron performs an elementwise dot product with a unique kernel and the output of the previous layer’s corresponding neuron. This will yield as many intermediate results as there are unique kernels. The convolutional neuron is the result of all of the intermediate results summed together with the learned bias.
卷积神经元与唯一的内核和上一层相应神经元的输出进行元素点乘。有多少个独特的核,就会产生多少个中间结果。卷积神经元是所有中间结果与学习偏置相加的结果。

For example, let’s look at the first convolutional layer in the Tiny VGG architecture above. Notice that there are 10 neurons in this layer, but only 3 neurons in the previous layer. In the Tiny VGG architecture, convolutional layers are fully-connected, meaning each neuron is connected to every other neuron in the previous layer. Focusing on the output of the topmost convolutional neuron from the first convolutional layer, we see that there are 3 unique kernels when we hover over the activation map.
例如,让我们看看上述 Tiny VGG 架构中的第一个卷积层。请注意,这一层有 10 个神经元,而上一层只有 3 个神经元。在 Tiny VGG 架构中,卷积层是全连接的,这意味着每个神经元都与上一层中的其他神经元相连。我们将注意力集中在第一个卷积层最顶端卷积神经元的输出上,当我们将鼠标悬停在激活图上时,会发现有 3 个独特的内核。

clicking on topmost first conv. layer activation map
Figure 1. As you hover over the activation map of the topmost node from the first convolutional layer, you can see that 3 kernels were applied to yield this activation map. After clicking this activation map, you can see the convolution operation occuring with each unique kernel.
图 1将鼠标悬停在第一个卷积层最顶端节点的激活图上,可以看到该激活图应用了 3 个内核。点击激活图后,可以看到每个内核的卷积操作。

The size of these kernels is a hyper-parameter specified by the designers of the network architecture. In order to produce the output of the convolutional neuron (activation map), we must perform an elementwise dot product with the output of the previous layer and the unique kernel learned by the network. In TinyVGG, the dot product operation uses a stride of 1, which means that the kernel is shifted over 1 pixel per dot product, but this is a hyperparameter that the network architecture designer can adjust to better fit their dataset. We must do this for all 3 kernels, which will yield 3 intermediate results.
这些核的大小是网络架构设计人员指定的一个超参数。为了生成卷积神经元的输出(激活图),我们必须将上一层的输出与网络学习到的唯一核进行元素点乘。在 TinyVGG 中,点乘操作使用的跨距为 1,这意味着每一次点乘都会将内核移动 1 个像素,但这是一个超参数,网络架构设计师可以根据自己的数据集进行调整。我们必须对所有 3 个内核都这样做,这将产生 3 个中间结果。

clicking on topmost first conv. layer activation map
Figure 2. The kernel being applied to yield the topmost intermediate result for the discussed activation map.
图 2.应用内核生成所讨论的激活图的最顶层中间结果。

Then, an elementwise sum is performed containing all 3 intermediate results along with the bias the network has learned. After this, the resulting 2-dimensional tensor will be the activation map viewable on the interface above for the topmost neuron in the first convolutional layer. This same operation must be applied to produce each neuron’s activation map.

With some simple math, we are able to deduce that there are 3 x 10 = 30 unique kernels, each of size 3x3, applied in the first convolutional layer. The connectivity between the convolutional layer and the previous layer is a design decision when building a network architecture, which will affect the number of kernels per convolutional layer. Click around the visualization to better understand the operations behind the convolutional layer. See if you can follow the example above!
通过一些简单的数学计算,我们可以推算出第一个卷积层有 3 x 10 = 30 个独特的内核,每个内核的大小为 3x3。卷积层与上一层之间的连接是构建网络架构时的一个设计决定,这将影响每个卷积层的内核数量。点击可视化图,更好地了解卷积层背后的操作。看看你是否能跟上上面的示例!

Understanding Hyperparameters
了解超参数

Input (5, 5)   输入 (5, 5)
After-padding (5, 5)  后填充 (5, 5)
Output (4, 4)  输出 (4, 4)
 
pointer icon
Hover over the matrices to change kernel position.
将鼠标悬停在矩阵上可更改内核位置。

  1. Padding is often necessary when the kernel extends beyond the activation map. Padding conserves data at the borders of activation maps, which leads to better performance, and it can help preserve the input's spatial size, which allows an architecture designer to build deeper, higher performing networks. There exist many padding techniques, but the most commonly used approach is zero-padding because of its performance, simplicity, and computational efficiency. The technique involves adding zeros symmetrically around the edges of an input. This approach is adopted by many high-performing CNNs such as AlexNet.
    当内核超出激活图时,通常需要进行填充。填充可以保留激活图边界的数据,从而提高性能,还有助于保留输入的空间大小,使架构设计人员能够构建更深、性能更高的网络。填充技术有很多种,但最常用的方法是零填充,因为它性能高、操作简单、计算效率高。这种技术是在输入边缘对称地添加零。许多高性能的 CNN(如 AlexNet)都采用了这种方法。
  2. Kernel size, often also referred to as filter size, refers to the dimensions of the sliding window over the input. Choosing this hyperparameter has a massive impact on the image classification task. For example, small kernel sizes are able to extract a much larger amount of information containing highly local features from the input. As you can see on the visualization above, a smaller kernel size also leads to a smaller reduction in layer dimensions, which allows for a deeper architecture. Conversely, a large kernel size extracts less information, which leads to a faster reduction in layer dimensions, often leading to worse performance. Large kernels are better suited to extract features that are larger. At the end of the day, choosing an appropriate kernel size will be dependent on your task and dataset, but generally, smaller kernel sizes lead to better performance for the image classification task because an architecture designer is able to stack more and more layers together to learn more and more complex features!
    内核大小通常也称为滤波器大小,指的是输入上滑动窗口的尺寸。选择这个超参数会对图像分类任务产生巨大影响。例如,较小的内核尺寸能够从输入中提取出更多包含高度局部特征的信息。从上面的可视化图中可以看到,较小的内核尺寸还能减少层的维数,从而实现更深的架构。相反,内核尺寸越大,提取的信息就越少,从而导致图层维度缩小得越快,性能也就越差。大内核更适合提取较大的特征。归根结底,选择合适的内核大小取决于你的任务和数据集,但一般来说,较小的内核大小能为图像分类任务带来更好的性能,因为架构设计人员能够将越来越多的层堆叠在一起,学习更多更复杂的特征!
  3. Stride indicates how many pixels the kernel should be shifted over at a time. For example, as described in the convolutional layer example above, Tiny VGG uses a stride of 1 for its convolutional layers, which means that the dot product is performed on a 3x3 window of the input to yield an output value, then is shifted to the right by one pixel for every subsequent operation. The impact stride has on a CNN is similar to kernel size. As stride is decreased, more features are learned because more data is extracted, which also leads to larger output layers. On the contrary, as stride is increased, this leads to more limited feature extraction and smaller output layer dimensions. One responsibility of the architecture designer is to ensure that the kernel slides across the input symmetrically when implementing a CNN. Use the hyperparameter visualization above to alter stride on various input/kernel dimensions to understand this constraint!
    跨距表示内核每次要移动多少像素。例如,如上文卷积层示例所述,Tiny VGG 的卷积层使用的跨距为 1,这意味着在输入的 3x3 窗口上执行点乘以产生输出值,然后在随后的每次操作中向右移动一个像素。跨距对 CNN 的影响类似于内核大小。随着跨度的减小,由于提取的数据越多,学习到的特征也就越多,这也会导致输出层的增大。相反,当步幅增大时,特征提取会更加有限,输出层尺寸也会变小。架构设计人员的责任之一是在实现 CNN 时确保内核在输入端对称滑动。使用上面的超参数可视化来改变不同输入/内核维度上的跨距,以了解这一限制!

Activation Functions  激活功能

ReLU  再卢

Neural networks are extremely prevalent in modern technology—because they are so accurate! The highest performing CNNs today consist of an absurd amount of layers, which are able to learn more and more features. Part of the reason these groundbreaking CNNs are able to achieve such tremendous accuracies is because of their non-linearity. ReLU applies much-needed non-linearity into the model. Non-linearity is necessary to produce non-linear decision boundaries, so that the output cannot be written as a linear combination of the inputs. If a non-linear activation function was not present, deep CNN architectures would devolve into a single, equivalent convolutional layer, which would not perform nearly as well. The ReLU activation function is specifically used as a non-linear activation function, as opposed to other non-linear functions such as Sigmoid because it has been empirically observed that CNNs using ReLU are faster to train than their counterparts.
神经网络在现代技术中极为普遍--因为它们非常精确!当今性能最高的 CNN 由数量惊人的层组成,能够学习越来越多的特征。这些开创性的 CNN 能够达到如此高的精确度,部分原因在于它们的非线性。ReLU 在模型中应用了亟需的非线性。非线性是产生非线性决策边界的必要条件,因此输出不能写成输入的线性组合。如果没有非线性激活函数,深度 CNN 架构就会退化为单一的等价卷积层,其性能也不会太好。之所以特别使用 ReLU 激活函数作为非线性激活函数,而不是 Sigmoid 等其他非线性函数,是因为根据经验观察,使用 ReLU 的 CNN 比同类产品的训练速度更快。

The ReLU activation function is a one-to-one mathematical operation: ReLU(x)=max(0,x)
ReLU 激活函数是一种一一对应的数学运算: ReLU(x)=max(0,x)

relu graph
Figure 3. The ReLU activation function graphed, which disregards all negative data.
图 3.忽略所有负数据的 ReLU 激活函数图。

This activation function is applied elementwise on every value from the input tensor. For example, if applied ReLU on the value 2.24, the result would be 2.24, since 2.24 is larger than 0. You can observe how this activation function is applied by clicking a ReLU neuron in the network above. The Rectified Linear Activation function (ReLU) is performed after every convolutional layer in the network architecture outlined above. Notice the impact this layer has on the activation map of various neurons throughout the network!
该激活函数对输入张量中的每个值进行逐元素激活。例如,如果在数值 2.24 上应用 ReLU,结果将是 2.24,因为 2.24 大于 0。你可以点击上面网络中的 ReLU 神经元,观察该激活函数是如何应用的。在上述网络架构中,整流线性激活函数(ReLU)是在每个卷积层之后执行的。请注意该层对整个网络中各神经元激活图的影响!

Softmax  软磁

Softmax(xi)=exp(xi)jexp(xj) A softmax operation serves a key purpose: making sure the CNN outputs sum to 1. Because of this, softmax operations are useful to scale model outputs into probabilities. Clicking on the last layer reveals the softmax operation in the network. Notice how the logits after flatten aren’t scaled between zero to one. For a visual indication of the impact of each logit (unscaled scalar value), they are encoded using a light orangedark orange color scale. After passing through the softmax function, each class now corresponds to an appropriate probability!
Softmax(xi)=exp(xi)jexp(xj) softmax 操作有一个关键目的:确保 CNN 输出总和为 1。因此,softmax 操作可用于将模型输出缩放为概率。点击最后一层,即可看到网络中的 softmax 操作。请注意扁平化后的对数并没有在 0 到 1 之间缩放。为了直观地显示每个对数(未缩放的标量值)的影响,我们使用浅橙色→深橙色的色标对它们进行了编码。在通过软最大值函数后,现在每个类别都对应一个适当的概率!

You might be thinking what the difference between standard normalization and softmax is—after all, both rescale the logits between 0 and 1. Remember that backpropagation is a key aspect of training neural networks—we want the correct answer to have the largest “signal.” By using softmax, we are effectively “approximating” argmax while gaining differentiability. Rescaling doesn’t weigh the max significantly higher than other logits, whereas softmax does. Simply put, softmax is a “softer” argmax—see what we did there?
你可能会想,标准归一化和 softmax 有什么区别--毕竟两者都是在 0 和 1 之间调整对数的大小。请记住,反向传播是训练神经网络的一个关键方面--我们希望正确答案具有最大的 "信号"。通过使用 softmax,我们可以有效地 "逼近 "argmax,同时获得可微分性。重缩放不会使最大值的权重明显高于其他对数,而 softmax 则会。简而言之,softmax 是一种 "更柔和 "的 argmax,明白我们的意思了吗?

softmax interactive formula view
Figure 4. The Softmax Interactive Formula View allows a user to interact with both the color encoded logits and formula to understand how the prediction scores after the flatten layer are normalized to yield classification scores.
图 4:Softmax 交互式公式视图Softmax 交互式公式视图允许用户与彩色编码的对数和公式进行交互,以了解扁平化层之后的预测分数是如何归一化以产生分类分数的。

Pooling Layers  汇集层

There are many types of pooling layers in different CNN architectures, but they all have the purpose of gradually decreasing the spatial extent of the network, which reduces the parameters and overall computation of the network. The type of pooling used in the Tiny VGG architecture above is Max-Pooling.
在不同的 CNN 架构中,池化层有多种类型,但它们的目的都是逐渐缩小网络的空间范围,从而减少网络的参数和整体计算量。上述 Tiny VGG 架构中使用的池化类型是最大池化。

The Max-Pooling operation requires selecting a kernel size and a stride length during architecture design. Once selected, the operation slides the kernel with the specified stride over the input while only selecting the largest value at each kernel slice from the input to yield a value for the output. This process can be viewed by clicking a pooling neuron in the network above.
最大池化运算要求在架构设计时选择内核大小和步长。选定后,该操作会在输入上滑动具有指定步长的内核,同时只从输入中选择每个内核切片上的最大值,从而得出输出值。点击上面网络中的池化神经元,即可查看这一过程。

In the Tiny VGG architecture above, the pooling layers use a 2x2 kernel and a stride of 2. This operation with these specifications results in the discarding of 75% of activations. By discarding so many values, Tiny VGG is more computationally efficient and avoids overfitting.
在上述 Tiny VGG 架构中,池化层使用 2x2 内核,步长为 2。通过丢弃如此多的值,Tiny VGG 的计算效率更高,并能避免过度拟合。

Flatten Layer  压平层

This layer converts a three-dimensional layer in the network into a one-dimensional vector to fit the input of a fully-connected layer for classification. For example, a 5x5x2 tensor would be converted into a vector of size 50. The previous convolutional layers of the network extracted the features from the input image, but now it is time to classify the features. We use the softmax function to classify these features, which requires a 1-dimensional input. This is why the flatten layer is necessary. This layer can be viewed by clicking any output class.
这一层将网络中的三维层转换为一维向量,以适应全连接层的分类输入。例如,一个 5x5x2 的张量将被转换成一个大小为 50 的向量。网络的前几个卷积层从输入图像中提取了特征,现在是对特征进行分类的时候了。我们使用 softmax 函数对这些特征进行分类,这需要一维输入。这就是为什么需要扁平化层。点击任意输出类即可查看该层。

Interactive features  互动功能

  1. Upload your own image by selecting upload image icon to understand how your image is classified into the 10 classes. By analyzing the neurons throughout the network, you can understand the activations maps and extracted features.
    选择 upload image icon 上传自己的图像,了解图像是如何被分为 10 个类别的。通过分析整个网络的神经元,您可以了解激活图和提取的特征。
  2. Change the activation map colorscale to better understand the impact of activations at different levels of abstraction by adjusting heatmap.
    通过调整 heatmap ,改变激活图的色阶,以更好地了解激活在不同抽象程度上的影响。
  3. Understand network details such as layer dimensions and colorscales by clicking the network details icon icon.
    点击 network details icon 图标,了解图层尺寸和色标等网络细节。
  4. Simulate network operations by clicking the play icon button or interact with the layer slice in the Interactive Formula View by hovering over portions of the input or output to understand the mappings and underlying operations.
    单击 play icon 按钮模拟网络操作,或将鼠标悬停在输入或输出的部分内容上,与交互式公式视图中的图层切片进行交互,以了解映射和底层操作。
  5. Learn layer functions by clicking info icon from the Interactive Formula View to read layer details from the article.
    点击交互式公式视图中的 info icon 阅读文章中的图层详细信息,了解图层功能。

Video Tutorial  视频教程

How is CNN Explainer implemented?
CNN Explainer 是如何实现的?

CNN Explainer uses TensorFlow.js, an in-browser GPU-accelerated deep learning library to load the pretrained model for visualization. The entire interactive system is written in Javascript using Svelte as a framework and D3.js for visualizations. You only need a web browser to get started learning CNNs today!
CNN Explainer 使用 TensorFlow.js(浏览器内 GPU 加速深度学习库)加载预训练模型,以实现可视化。整个交互系统是用 Javascript 编写的,使用 Svelte 作为框架,D3.js 用于可视化。您现在只需要一个网络浏览器就可以开始学习 CNN!

Who developed CNN Explainer?
谁开发了 CNN Explainer?

CNN Explainer was created by Jay Wang, Robert Turko, Omar Shaikh, Haekyu Park, Nilaksh Das, Fred Hohman, Minsuk Kahng, and Polo Chau, which was the result of a research collaboration between Georgia Tech and Oregon State. We thank Anmol Chhabria, Kaan Sancak, Kantwon Rogers, and the Georgia Tech Visualization Lab for their support and constructive feedback. This work was supported in part by NSF grants IIS-1563816, CNS-1704701, NASA NSTRF, DARPA GARD, gifts from Intel, NVIDIA, Google, Amazon.
CNN Explainer 由 Jay Wang、Robert Turko、Omar Shaikh、Haekyu Park、Nilaksh Das、Fred Hohman、Minsuk Kahng 和 Polo Chau 制作,是佐治亚理工学院和俄勒冈州立大学合作研究的成果。我们感谢 Anmol Chhabria、Kaan Sancak、Kantwon Rogers 和佐治亚理工学院可视化实验室的支持和建设性反馈。这项工作部分得到了国家自然科学基金 IIS-1563816、CNS-1704701、NASA NSTRF、DARPA GARD 的资助,以及英特尔、英伟达、谷歌和亚马逊的馈赠。