In machine learning, a classifier assigns a class label to a data point. For example, an image classifier produces a class label (e.g, bird, plane) for what objects exist within an image. A convolutional neural network, or CNN for short, is a type of classifier, which excels at solving this problem!
在机器学习中,分类器为数据点分配一个类标签。例如,图像分类器会为图像中存在的物体生成一个类标签(如鸟、飞机)。卷积神经网络(简称 CNN)是分类器的一种,擅长解决这一问题!
A CNN is a neural network: an algorithm used to recognize patterns in data. Neural Networks in general are composed of a collection of neurons that are organized in layers, each with their own learnable weights and biases. Let’s break down a CNN into its basic building blocks.
CNN 是一种神经网络:一种用于识别数据模式的算法。一般来说,神经网络由神经元集合组成,神经元按层排列,每个神经元都有自己的可学习权重和偏置。让我们将 CNN 分解为各个基本构件。
If you have studied neural networks before, these terms may sound familiar to you. So what makes a CNN different? CNNs utilize a special type of layer, aptly named a convolutional layer, that makes them well-positioned to learn from image and image-like data. Regarding image data, CNNs can be used for many different computer vision tasks, such as image processing, classification, segmentation, and object detection.
如果你以前学过神经网络,这些术语可能听起来很熟悉。那么,是什么让 CNN 与众不同呢?CNN 利用一种特殊类型的层,恰如其分地命名为卷积层,这使得 CNN 能够很好地学习图像和类似图像的数据。在图像数据方面,CNN 可用于许多不同的计算机视觉任务,如图像处理、分类、分割和物体检测。
In CNN Explainer, you can see how a simple CNN can be used for image classification. Because of the network’s simplicity, its performance isn’t perfect, but that’s okay! The network architecture, Tiny VGG, used in CNN Explainer contains many of the same layers and operations used in state-of-the-art CNNs today, but on a smaller scale. This way, it will be easier to understand getting started.
在 CNN Explainer 中,您可以看到如何使用简单的 CNN 进行图像分类。由于网络的简单性,它的性能并不完美,但这没有关系!CNN Explainer 中使用的网络架构 Tiny VGG 包含许多与当今最先进的 CNN 相同的层和操作,但规模较小。这样,它将更容易理解和入门。
Let’s walk through each layer in the network. Feel free to interact with the visualization above by clicking and hovering over various parts of it as you read.
让我们浏览一下网络中的每一层。在阅读过程中,请随意点击和悬停上面的可视化图像的各个部分,与之互动。
The input layer (leftmost layer) represents the input image into the CNN. Because we use RGB images as input, the input layer has three channels, corresponding to the red, green, and blue channels, respectively, which are shown in this layer. Use the color scale when you click on the icon above to display detailed information (on this layer, and others).
输入层(最左一层)表示 CNN 的输入图像。由于我们使用 RGB 图像作为输入,因此输入层有三个通道,分别对应红色、绿色和蓝色通道,显示在本层中。点击上面的 图标时,使用色标显示详细信息(关于该层和其他层)。
The convolutional layers are the foundation of CNN, as they contain the learned kernels (weights), which extract features that distinguish different images from one another—this is what we want for classification! As you interact with the convolutional layer, you will notice links between the previous layers and the convolutional layers. Each link represents a unique kernel, which is used for the convolution operation to produce the current convolutional neuron’s output or activation map.
卷积层是 CNN 的基础,因为它们包含学习到的内核(权重),而内核提取的特征可以将不同的图像区分开来,这正是我们想要的分类效果!当你与卷积层交互时,你会注意到前几层与卷积层之间的链接。每个链接都代表一个独特的内核,用于卷积操作,生成当前卷积神经元的输出或激活图。
The convolutional neuron performs an elementwise dot product with a unique kernel and the output of the previous layer’s corresponding neuron. This will yield as many intermediate results as there are unique kernels. The convolutional neuron is the result of all of the intermediate results summed together with the learned bias.
卷积神经元与唯一的内核和上一层相应神经元的输出进行元素点乘。有多少个独特的核,就会产生多少个中间结果。卷积神经元是所有中间结果与学习偏置相加的结果。
For example, let’s look at the first convolutional layer in the Tiny VGG architecture above. Notice that there are 10 neurons in this layer, but only 3 neurons in the previous layer. In the Tiny VGG architecture, convolutional layers are fully-connected, meaning each neuron is connected to every other neuron in the previous layer. Focusing on the output of the topmost convolutional neuron from the first convolutional layer, we see that there are 3 unique kernels when we hover over the activation map.
例如,让我们看看上述 Tiny VGG 架构中的第一个卷积层。请注意,这一层有 10 个神经元,而上一层只有 3 个神经元。在 Tiny VGG 架构中,卷积层是全连接的,这意味着每个神经元都与上一层中的其他神经元相连。我们将注意力集中在第一个卷积层最顶端卷积神经元的输出上,当我们将鼠标悬停在激活图上时,会发现有 3 个独特的内核。
The size of these kernels is a hyper-parameter specified by the designers of the network architecture. In order to produce the output of the convolutional neuron (activation map), we must perform an elementwise dot product with the output of the previous layer and the unique kernel learned by the network. In TinyVGG, the dot product operation uses a stride of 1, which means that the kernel is shifted over 1 pixel per dot product, but this is a hyperparameter that the network architecture designer can adjust to better fit their dataset. We must do this for all 3 kernels, which will yield 3 intermediate results.
这些核的大小是网络架构设计人员指定的一个超参数。为了生成卷积神经元的输出(激活图),我们必须将上一层的输出与网络学习到的唯一核进行元素点乘。在 TinyVGG 中,点乘操作使用的跨距为 1,这意味着每一次点乘都会将内核移动 1 个像素,但这是一个超参数,网络架构设计师可以根据自己的数据集进行调整。我们必须对所有 3 个内核都这样做,这将产生 3 个中间结果。
Then, an elementwise sum is performed containing all 3 intermediate results along with the bias the network has learned. After this, the resulting 2-dimensional tensor will be the activation map viewable on the interface above for the topmost neuron in the first convolutional layer. This same operation must be applied to produce each neuron’s activation map.
With some simple math, we are able to deduce that there are 3 x 10 = 30 unique kernels, each of size 3x3, applied in the first convolutional layer. The connectivity between the convolutional layer and the previous layer is a design decision when building a network architecture, which will affect the number of kernels per convolutional layer. Click around the visualization to better understand the operations behind the convolutional layer. See if you can follow the example above!
通过一些简单的数学计算,我们可以推算出第一个卷积层有 3 x 10 = 30 个独特的内核,每个内核的大小为 3x3。卷积层与上一层之间的连接是构建网络架构时的一个设计决定,这将影响每个卷积层的内核数量。点击可视化图,更好地了解卷积层背后的操作。看看你是否能跟上上面的示例!
Neural networks are extremely prevalent in modern technology—because they are so accurate! The highest performing CNNs today consist of an absurd amount of layers, which are able to learn more and more features. Part of the reason these groundbreaking CNNs are able to achieve such tremendous accuracies is because of their non-linearity. ReLU applies much-needed non-linearity into the model. Non-linearity is necessary to produce non-linear decision boundaries, so that the output cannot be written as a linear combination of the inputs. If a non-linear activation function was not present, deep CNN architectures would devolve into a single, equivalent convolutional layer, which would not perform nearly as well. The ReLU activation function is specifically used as a non-linear activation function, as opposed to other non-linear functions such as Sigmoid because it has been empirically observed that CNNs using ReLU are faster to train than their counterparts.
神经网络在现代技术中极为普遍--因为它们非常精确!当今性能最高的 CNN 由数量惊人的层组成,能够学习越来越多的特征。这些开创性的 CNN 能够达到如此高的精确度,部分原因在于它们的非线性。ReLU 在模型中应用了亟需的非线性。非线性是产生非线性决策边界的必要条件,因此输出不能写成输入的线性组合。如果没有非线性激活函数,深度 CNN 架构就会退化为单一的等价卷积层,其性能也不会太好。之所以特别使用 ReLU 激活函数作为非线性激活函数,而不是 Sigmoid 等其他非线性函数,是因为根据经验观察,使用 ReLU 的 CNN 比同类产品的训练速度更快。
The ReLU activation function is a one-to-one mathematical operation:
ReLU 激活函数是一种一一对应的数学运算:
This activation function is applied elementwise on every value from the input tensor. For example, if applied ReLU on the value 2.24, the result would be 2.24, since 2.24 is larger than 0. You can observe how this activation function is applied by clicking a ReLU neuron in the network above. The Rectified Linear Activation function (ReLU) is performed after every convolutional layer in the network architecture outlined above. Notice the impact this layer has on the activation map of various neurons throughout the network!
该激活函数对输入张量中的每个值进行逐元素激活。例如,如果在数值 2.24 上应用 ReLU,结果将是 2.24,因为 2.24 大于 0。你可以点击上面网络中的 ReLU 神经元,观察该激活函数是如何应用的。在上述网络架构中,整流线性激活函数(ReLU)是在每个卷积层之后执行的。请注意该层对整个网络中各神经元激活图的影响!
You might be thinking what the difference between standard normalization and softmax is—after all, both rescale the logits between 0 and 1. Remember that backpropagation is a key aspect of training neural networks—we want the correct answer to have the largest “signal.” By using softmax, we are effectively “approximating” argmax while gaining differentiability. Rescaling doesn’t weigh the max significantly higher than other logits, whereas softmax does. Simply put, softmax is a “softer” argmax—see what we did there?
你可能会想,标准归一化和 softmax 有什么区别--毕竟两者都是在 0 和 1 之间调整对数的大小。请记住,反向传播是训练神经网络的一个关键方面--我们希望正确答案具有最大的 "信号"。通过使用 softmax,我们可以有效地 "逼近 "argmax,同时获得可微分性。重缩放不会使最大值的权重明显高于其他对数,而 softmax 则会。简而言之,softmax 是一种 "更柔和 "的 argmax,明白我们的意思了吗?
There are many types of pooling layers in different CNN architectures, but they all have the purpose of gradually decreasing the spatial extent of the network, which reduces the parameters and overall computation of the network. The type of pooling used in the Tiny VGG architecture above is Max-Pooling.
在不同的 CNN 架构中,池化层有多种类型,但它们的目的都是逐渐缩小网络的空间范围,从而减少网络的参数和整体计算量。上述 Tiny VGG 架构中使用的池化类型是最大池化。
The Max-Pooling operation requires selecting a kernel size and a stride length during architecture design. Once selected, the operation slides the kernel with the specified stride over the input while only selecting the largest value at each kernel slice from the input to yield a value for the output. This process can be viewed by clicking a pooling neuron in the network above.
最大池化运算要求在架构设计时选择内核大小和步长。选定后,该操作会在输入上滑动具有指定步长的内核,同时只从输入中选择每个内核切片上的最大值,从而得出输出值。点击上面网络中的池化神经元,即可查看这一过程。
In the Tiny VGG architecture above, the pooling layers use a 2x2 kernel and a stride of 2. This operation with these specifications results in the discarding of 75% of activations. By discarding so many values, Tiny VGG is more computationally efficient and avoids overfitting.
在上述 Tiny VGG 架构中,池化层使用 2x2 内核,步长为 2。通过丢弃如此多的值,Tiny VGG 的计算效率更高,并能避免过度拟合。
This layer converts a three-dimensional layer in the network into a one-dimensional vector to fit the input of a fully-connected layer for classification. For example, a 5x5x2 tensor would be converted into a vector of size 50. The previous convolutional layers of the network extracted the features from the input image, but now it is time to classify the features. We use the softmax function to classify these features, which requires a 1-dimensional input. This is why the flatten layer is necessary. This layer can be viewed by clicking any output class.
这一层将网络中的三维层转换为一维向量,以适应全连接层的分类输入。例如,一个 5x5x2 的张量将被转换成一个大小为 50 的向量。网络的前几个卷积层从输入图像中提取了特征,现在是对特征进行分类的时候了。我们使用 softmax 函数对这些特征进行分类,这需要一维输入。这就是为什么需要扁平化层。点击任意输出类即可查看该层。
CNN Explainer uses TensorFlow.js, an in-browser GPU-accelerated deep learning library to load the pretrained model for visualization. The entire interactive system is written in Javascript using Svelte as a framework and D3.js for visualizations. You only need a web browser to get started learning CNNs today!
CNN Explainer 使用 TensorFlow.js(浏览器内 GPU 加速深度学习库)加载预训练模型,以实现可视化。整个交互系统是用 Javascript 编写的,使用 Svelte 作为框架,D3.js 用于可视化。您现在只需要一个网络浏览器就可以开始学习 CNN!
CNN Explainer was created by
Jay Wang,
Robert Turko,
Omar Shaikh,
Haekyu Park,
Nilaksh Das,
Fred Hohman,
Minsuk Kahng, and
Polo Chau,
which was the result of a research collaboration between
Georgia Tech and Oregon State. We thank Anmol Chhabria, Kaan Sancak, Kantwon Rogers, and the Georgia Tech Visualization Lab for their support and constructive feedback. This work was supported in part by NSF grants IIS-1563816, CNS-1704701, NASA NSTRF, DARPA GARD, gifts from Intel, NVIDIA, Google, Amazon.
CNN Explainer 由 Jay Wang、Robert Turko、Omar Shaikh、Haekyu Park、Nilaksh Das、Fred Hohman、Minsuk Kahng 和 Polo Chau 制作,是佐治亚理工学院和俄勒冈州立大学合作研究的成果。我们感谢 Anmol Chhabria、Kaan Sancak、Kantwon Rogers 和佐治亚理工学院可视化实验室的支持和建设性反馈。这项工作部分得到了国家自然科学基金 IIS-1563816、CNS-1704701、NASA NSTRF、DARPA GARD 的资助,以及英特尔、英伟达、谷歌和亚马逊的馈赠。
Add Input Image