The Mystery of Machine Learning
机器学习的奥秘
It’s surprising how little is known about the foundations of machine learning. Yes, from an engineering point of view, an immense amount has been figured out about how to build neural nets that do all kinds of impressive and sometimes almost magical things. But at a fundamental level we still don’t really know why neural nets “work”—and we don’t have any kind of “scientific big picture” of what’s going on inside them.
令人惊讶的是,关于机器学习的基础知识知之甚少。是的,从工程的角度来看,已经有大量的研究揭示了如何构建能够执行各种令人印象深刻且有时几乎是魔法般的事情的神经网络。但在根本层面上,我们仍然不真正知道神经网络“为何有效”——而且我们没有任何关于它们内部运作的“科学全景”。
The basic structure of neural networks can be pretty simple. But by the time they’re trained up with all their weights, etc. it’s been hard to tell what’s going on—or even to get any good visualization of it. And indeed it’s far from clear even what aspects of the whole setup are actually essential, and what are just “details” that have perhaps been “grandfathered” all the way from when computational neural nets were first invented in the 1940s.
神经网络的基本结构可以非常简单。但是,当它们经过训练并调整好所有权重等后,很难判断发生了什么——甚至很难获得任何好的可视化效果。实际上,整个设置中哪些方面是必不可少的,哪些只是可能从 1940 年代计算神经网络首次被发明时“继承”下来的“细节”,仍然远未明确。
Well, what I’m going to try to do here is to get “underneath” this—and to “strip things down” as much as possible. I’m going to explore some very minimal models—that, among other things, are more directly amenable to visualization. At the outset, I wasn’t at all sure that these minimal models would be able to reproduce any of the kinds of things we see in machine learning. But, rather surprisingly, it seems they can.
好吧,我在这里要尝试的是“深入”这个问题——尽可能“简化”事物。我将探索一些非常简单的模型——这些模型在其他方面更容易进行可视化。一开始,我并不确定这些简单模型是否能够再现我们在机器学习中看到的任何类型的东西。但令人惊讶的是,它们似乎可以。
And the simplicity of their construction makes it much easier to “see inside them”—and to get more of a sense of what essential phenomena actually underlie machine learning. One might have imagined that even though the training of a machine learning system might be circuitous, somehow in the end the system would do what it does through some kind of identifiable and “explainable” mechanism. But we’ll see that in fact that’s typically not at all what happens.
它们的构造简单性使得“看透它们”变得更加容易——并且更能感受到实际上支撑机器学习的基本现象。人们可能会想,即使机器学习系统的训练过程可能是曲折的,但最终系统会通过某种可识别和“可解释”的机制来完成它的工作。但我们将看到,实际上通常并非如此。
Instead it looks much more as if the training manages to home in on some quite wild computation that “just happens to achieve the right results”. Machine learning, it seems, isn’t building structured mechanisms; rather, it’s basically just sampling from the typical complexity one sees in the computational universe, picking out pieces whose behavior turns out to overlap what’s needed. And in a sense, therefore, the possibility of machine learning is ultimately yet another consequence of the phenomenon of computational irreducibility.
相反,它看起来更像是训练能够集中在一些相当复杂的计算上,这些计算“恰好能够达到正确的结果”。机器学习似乎并不是在构建结构化的机制;相反,它基本上只是从计算宇宙中看到的典型复杂性中进行采样,挑选出其行为与所需内容重叠的部分。因此,从某种意义上说,机器学习的可能性最终是计算不可简约性现象的又一个结果。
Why is that? Well, it’s only because of computational irreducibility that there’s all that richness in the computational universe. And, more than that, it’s because of computational irreducibility that things end up being effectively random enough that the adaptive process of training a machine learning system can reach success without getting stuck.
为什么会这样?这仅仅是因为计算不可约性,才使得计算宇宙中存在如此丰富的内容。而且,更重要的是,正是由于计算不可约性,事物最终变得足够随机,以至于训练机器学习系统的自适应过程能够成功,而不会陷入困境。
But the presence of computational irreducibility also has another important implication: that even though we can expect to find limited pockets of computational reducibility, we can’t expect a “general narrative explanation” of what a machine learning system does. In other words, there won’t be a traditional (say, mathematical) “general science” of machine learning (or, for that matter, probably also neuroscience). Instead, the story will be much closer to the fundamentally computational “new kind of science” that I’ve explored for so long, and that has brought us our Physics Project and the ruliad.
但计算不可约性的存在还有另一个重要的含义:尽管我们可以期待找到有限的计算可约性区域,但我们不能期待对机器学习系统所做的事情有一个“普遍叙述解释”。换句话说,不会有一个传统的(比如,数学的)机器学习“通用科学”(或者,可能也不会有神经科学)。相反,这个故事将更接近我长期探索的根本计算“新科学”,这也带来了我们的物理项目和 ruliad。
In many ways, the problem of machine learning is a version of the general problem of adaptive evolution, as encountered for example in biology. In biology we typically imagine that we want to adaptively optimize some overall “fitness” of a system; in machine learning we typically try to adaptively “train” a system to make it align with certain goals or behaviors, most often defined by examples. (And, yes, in practice this is often done by trying to minimize a quantity normally called the “loss”.)
在许多方面,机器学习的问题是适应性进化的一种版本,例如在生物学中遇到的那样。在生物学中,我们通常想象我们想要适应性地优化系统的整体“适应度”;在机器学习中,我们通常试图适应性地“训练”一个系统,使其与某些目标或行为对齐,这些目标或行为通常由示例定义。(是的,实际上这通常是通过尝试最小化通常称为“损失”的量来完成的。)
And while in biology there’s a general sense that “things arise through evolution”, quite how this works has always been rather mysterious. But (rather to my surprise) I recently found a very simple model that seems to do well at capturing at least some of the most essential features of biological evolution. And while the model isn’t the same as what we’ll explore here for machine learning, it has some definite similarities. And in the end we’ll find that the core phenomena of machine learning and of biological evolution appear to be remarkably aligned—and both fundamentally connected to the phenomenon of computational irreducibility.
在生物学中,人们普遍认为“事物通过进化产生”,但这一过程究竟是如何运作的始终颇为神秘。然而(令我颇感意外的是)我最近发现了一个非常简单的模型,它似乎能够很好地捕捉到生物进化中一些最基本的特征。尽管这个模型与我们在这里探讨的机器学习并不相同,但它有一些明显的相似之处。最终,我们会发现机器学习和生物进化的核心现象似乎是惊人地一致的——而且两者都与计算不可约性现象有着根本的联系。
Most of what I’ll do here focuses on foundational, theoretical questions. But in understanding more about what’s really going on in machine learning—and what’s essential and what’s not—we’ll also be able to begin to see how in practice machine learning might be done differently, potentially with more efficiency and more generality.
我在这里做的大部分工作集中在基础的理论问题上。但在更深入理解机器学习中真正发生的事情——以及什么是必要的,什么不是——的过程中,我们也将能够开始看到在实践中机器学习可能会以不同的方式进行,可能会更高效和更具普遍性。
Traditional Neural Nets 传统神经网络
Note: Click any diagram to get Wolfram Language code to reproduce it.
注意:点击任何图表以获取 Wolfram Language 代码以重现它。
To begin the process of understanding the essence of machine learning, let’s start from a very traditional—and familiar—example: a fully connected (“multilayer perceptron”) neural net that’s been trained to compute a certain function f[x]:
要开始理解机器学习的本质,我们从一个非常传统且熟悉的例子开始:一个经过训练以计算某个函数 f[x] 的全连接(“多层感知器”)神经网络:
If one gives a value x as input at the top, then after “rippling through the layers of the network” one gets a value at the bottom that (almost exactly) corresponds to our function f[x]:
如果在顶部输入一个值 x,那么在“在网络的层中波动”之后,底部得到的值(几乎完全)对应于我们的函数 f[x]:
Scanning through different inputs x, we see different patterns of intermediate values inside the network:
通过扫描不同的输入 x,我们可以看到网络内部中间值的不同模式:
And here’s (on a linear and log scale) how each of these intermediate values changes with x. And, yes, the way the final value (highlighted here) emerges looks very complicated:
这里是(在线性和对数尺度上)这些中间值如何随 x 变化的情况。是的,最终值(在此突出显示)出现的方式看起来非常复杂:
So how is the neural net ultimately put together? How are these values that we’re plotting determined? We’re using the standard setup for a fully connected multilayer network. Each node (“neuron”) on each layer is connected to all nodes on the layer above—and values “flow” down from one layer to the next, being multiplied by the (positive or negative) “weight” (indicated by color in our pictures) associated with the connection through which they flow. The value of a given neuron is found by totaling up all its (weighted) inputs from the layer before, adding a “bias” value for that neuron, and then applying to the result a certain (nonlinear) “activation function” (here ReLU or Ramp[z], i.e. If[z < 0, 0, z]).
神经网络最终是如何构建的?我们绘制的这些值是如何确定的?我们使用的是标准的全连接多层网络设置。每一层的每个节点(“神经元”)都与上一层的所有节点相连——值从一层“流动”到下一层,通过连接流动的值会被与之相关的(正或负)“权重”(在我们的图中用颜色表示)所乘。给定神经元的值是通过将来自上一层的所有(加权)输入相加,添加该神经元的“偏置”值,然后对结果应用某种(非线性)“激活函数”(这里是 ReLU 或 Ramp [z],即 If [z < 0, 0, z])来找到的。
What overall function a given neural net will compute is determined by the collection of weights and biases that appear in the neural net (along with its overall connection architecture, and the activation function it’s using). The idea of machine learning is to find weights and biases that produce a particular function by adaptively “learning” from examples of that function. Typically we might start from a random collection of weights, then successively tweak weights and biases to “train” the neural net to reproduce the function:
给定神经网络的整体功能是由神经网络中出现的权重和偏置的集合决定的(以及它的整体连接架构和所使用的激活函数)。机器学习的理念是通过适应性地“学习”该函数的示例来找到产生特定功能的权重和偏置。通常,我们可能从一组随机的权重开始,然后逐步调整权重和偏置,以“训练”神经网络以重现该功能:
We can get a sense of how this progresses (and, yes, it’s complicated) by plotting successive changes in individual weights over the course of the training process (the spikes near the end come from “neutral changes” that don’t affect the overall behavior):
我们可以通过绘制训练过程中个别权重的连续变化来感知这一进展(是的,这很复杂)(接近结束时的尖峰来自“中性变化”,这些变化不影响整体行为):
The overall objective in the training is progressively to decrease the “loss”—the average (squared) difference between true values of f[x] and those generated by the neural net. The evolution of the loss defines a “learning curve” for the neural net, with the downward glitches corresponding to points where the neural net in effect “made a breakthrough” in being able to represent the function better:
培训的总体目标是逐步减少“损失”——真实值 f[x] 与神经网络生成值之间的平均(平方)差异。损失的演变定义了神经网络的“学习曲线”,向下的波动对应于神经网络在能够更好地表示函数时“取得突破”的点:
It’s important to note that typically there’s randomness injected into neural net training. So if one runs the training multiple times, one will get different networks—and different learning curves—every time:
重要的是要注意,通常在神经网络训练中会注入随机性。因此,如果多次运行训练,每次都会得到不同的网络和不同的学习曲线。
But what’s really going on in neural net training? Effectively we’re finding a way to “compile” a function (at least to some approximation) into a neural net with a certain number of (real-valued) parameters. And in the example here we happen to be using about 100 parameters.
但神经网络训练中究竟发生了什么?实际上,我们正在寻找一种方法将一个函数(至少是某种近似)“编译”成具有一定数量(实值)参数的神经网络。在这里的例子中,我们恰好使用了大约 100 个参数。
But what happens if we use a different number of parameters, or set up the architecture of our neural net differently? Here are a few examples, indicating that for the function we’re trying to generate, the network we’ve been using so far is pretty much the smallest that will work:
但是如果我们使用不同数量的参数,或者以不同的方式设置神经网络的架构,会发生什么呢?以下是一些例子,表明对于我们试图生成的函数,我们迄今为止使用的网络几乎是能够工作的最小网络
And, by the way, here’s what happens if we change our activation function from ReLU
顺便说一下,如果我们将激活函数从 ReLU 更改,结果会是怎样的
to the smoother ELU :
到更顺畅的 ELU :
Later we’ll talk about what happens when we do machine learning with discrete systems. And in anticipation of that, it’s interesting to see what happens if we take a neural net of the kind we’ve discussed here, and “quantize” its weights (and biases) in discrete levels:
稍后我们将讨论在离散系统中进行机器学习时会发生什么。为了预见这一点,看看如果我们对这里讨论过的神经网络进行“量化”,将其权重(和偏置)分为离散级别,会发生什么,这很有趣:
The result is that (as recent experience with large-scale neural nets has also shown) the basic “operation” of the neural net does not require precise real numbers, but survives even when the numbers are at least somewhat discrete—as this 3D rendering as a function of the discreteness level δ also indicates:
结果是(正如最近对大规模神经网络的经验所示)神经网络的基本“操作”并不需要精确的实数,而是在数字至少有些离散时仍然可以生存——正如这个 3D 渲染作为离散程度δ的函数所表明的:
Simplifying the Topology: Mesh Neural Nets
简化拓扑:网状神经网络
So far we’ve been discussing very traditional neural nets. But to do machine learning, do we really need systems that have all those details? For example, do we really need every neuron on each layer to get an input from every neuron on the previous layer? What happens if instead every neuron just gets input from at most two others—say with the neurons effectively laid out in a simple mesh? Quite surprisingly, it turns out that such a network is still perfectly able to generate a function like the one we’ve been using as an example:
到目前为止,我们一直在讨论非常传统的神经网络。但是要进行机器学习,我们真的需要具有所有这些细节的系统吗?例如,我们真的需要每一层的每个神经元都从上一层的每个神经元获取输入吗?如果每个神经元最多只从两个其他神经元获取输入——比如神经元有效地布置成一个简单的网格,会发生什么呢?令人惊讶的是,事实证明这样的网络仍然能够生成我们一直用作示例的函数:
And one advantage of such a “mesh neural net” is that—like a cellular automaton—its “internal behavior” can readily be visualized in a rather direct way. So, for example, here are visualizations of “how the mesh net generates its output”, stepping through different input values x:
这种“网状神经网络”的一个优点是——像细胞自动机一样——它的“内部行为”可以以相当直接的方式进行可视化。因此,例如,以下是“网状网络如何生成其输出”的可视化,逐步通过不同的输入值 x:
And, yes, even though we can visualize it, it’s still hard to understand “what’s going on inside”. Looking at the intermediate values of each individual node in the network as a function of x doesn’t help much, though we can “see something happening” at places where our function f[x] has jumps:
是的,尽管我们可以将其可视化,但仍然很难理解“内部发生了什么”。查看网络中每个节点的中间值作为 x 的函数并没有太大帮助,尽管我们可以在函数 f[x]有跳跃的地方“看到一些事情发生”。
So how do we train a mesh neural net? Basically we can use the same procedure as for a fully connected network of the kind we saw above (ReLU activation functions don’t seem to work well for mesh nets, so we’re using ELU here):
那么我们如何训练网格神经网络呢?基本上,我们可以使用与上述完全连接网络相同的程序(ReLU 激活函数似乎不适用于网格网络,因此我们在这里使用 ELU):
Here’s the evolution of differences in each individual weight during the training process:
这是训练过程中每个个体权重差异的演变:
And here are results for different random seeds:
这里是不同随机种子的结果:
At the size we’re using, our mesh neural nets have about the same number of connections (and thus weights) as our main example of a fully connected network above. And we see that if we try to reduce the size of our mesh neural net, it doesn’t do well at reproducing our function:
在我们使用的大小下,我们的网格神经网络的连接数(因此权重)大约与上面完全连接网络的主要示例相同。我们发现,如果尝试减小网格神经网络的大小,它在重现我们的函数时表现不佳:
Making Everything Discrete: A Biological Evolution Analog
使一切离散化:生物进化的类比
Mesh neural nets simplify the topology of neural net connections. But, somewhat surprisingly at first, it seems as if we can go much further in simplifying the systems we’re using—and still successfully do versions of machine learning. And in particular we’ll find that we can make our systems completely discrete.
网状神经网络简化了神经网络连接的拓扑结构。但起初有些令人惊讶的是,似乎我们可以在简化所使用的系统方面走得更远——并且仍然成功地进行机器学习的各种版本。特别是我们会发现,我们可以使我们的系统完全离散。
The typical methodology of neural net training involves progressively tweaking real-valued parameters, usually using methods based on calculus, and on finding derivatives. And one might imagine that any successful adaptive process would ultimately have to rely on being able to make arbitrarily small changes, of the kind that are possible with real-valued parameters.
神经网络训练的典型方法论涉及逐步调整实值参数,通常使用基于微积分的方法以及寻找导数。人们可能会想象,任何成功的自适应过程最终都必须依赖于能够进行任意小的变化,这种变化在实值参数中是可能的。
But in studying simple idealizations of biological evolution I recently found striking examples where this isn’t the case—and where completely discrete systems seemed able to capture the essence of what’s going on.
但在研究生物进化的简单理想化时,我最近发现了一些引人注目的例子,在这些例子中情况并非如此——而完全离散的系统似乎能够捕捉到正在发生的本质。
As an example consider a (3-color) cellular automaton. The rule is shown on the left, and the behavior one generates by repeatedly applying that rule (starting from a single-cell initial condition) is shown on the right:
作为一个例子,考虑一个(3 色)元胞自动机。规则显示在左侧,通过反复应用该规则(从单个细胞的初始条件开始)生成的行为显示在右侧:
The rule has the property that the pattern it generates (from a single-cell initial condition) survives for exactly 40 steps, and then dies out (i.e. every cell becomes white). And the important point is that this rule can be found by a discrete adaptive process. The idea is to start, say, from a null rule, and then at each step to randomly change a single outcome out of the 27 in the rule (i.e. make a “single-point mutation” in the rule). Most such changes will cause the “lifetime” of the pattern to get further from our target of 40—and these we discard. But gradually we can build up “beneficial mutations”
该规则具有这样的特性:它生成的模式(从单细胞初始条件)恰好存活 40 步,然后消亡(即每个细胞变为白色)。重要的一点是,这个规则可以通过离散自适应过程找到。其思路是从一个空规则开始,然后在每一步随机改变规则中的 27 个结果中的一个(即对规则进行“单点突变”)。大多数这样的变化会使模式的“寿命”离我们目标的 40 步更远——这些我们会丢弃。但逐渐地,我们可以积累“有益突变”。
that through “progressive adaptation” eventually get to our original lifetime-40 rule:
通过“渐进适应”,最终达到我们最初的寿命-40 规则:
We can make a plot of all the attempts we made that eventually let us reach lifetime 40—and we can think of this progressive “fitness” curve as being directly analogous to the loss curves in machine learning that we saw before:
我们可以绘制出我们所做的所有尝试的图表,这些尝试最终让我们达到了 40 的生命周期——我们可以将这个渐进的“适应度”曲线视为与之前看到的机器学习中的损失曲线直接对应
If we make different sequences of random mutations, we’ll get different paths of adaptive evolution, and different “solutions” for rules that have lifetime 40:
如果我们进行不同序列的随机突变,我们将获得不同的适应性进化路径,以及不同的“解决方案”,对于寿命为 40 的规则:
Two things are immediately notable about these. First, that they essentially all seem to be “using different ideas” to reach their goal (presumably analogous to the phenomenon of different branches in the tree of life). And second, that none of them seem to be using a clear “mechanical procedure” (of the kind we might construct through traditional engineering) to reach their goal. Instead, they seem to be finding “natural” complicated behavior that just “happens” to achieve the goal.
这两点立即引人注目。首先,它们似乎都在“使用不同的想法”来达到目标(可以类比于生命树中不同分支的现象)。其次,它们似乎都没有使用明确的“机械程序”(我们可能通过传统工程构建的那种)来达到目标。相反,它们似乎在寻找“自然”的复杂行为,这种行为恰好“发生”以实现目标。
It’s nontrivial, of course, that this behavior can achieve a goal like the one we’ve set here, as well as that simple selection based on random point mutations can successfully reach the necessary behavior. But as I discussed in connection with biological evolution, this is ultimately a story of computational irreducibility—particularly in generating diversity both in behavior, and in the paths necessary to reach it.
当然,这种行为能够实现我们在这里设定的目标并非微不足道,基于随机点突变的简单选择也能成功达到所需的行为。但正如我在与生物进化相关的讨论中提到的,这最终是一个计算不可约性的故事——特别是在生成行为的多样性以及达到这种多样性所需的路径方面。
But, OK, so how does this model of adaptive evolution relate to systems like neural nets? In the standard language of neural nets, our model is like a discrete analog of a recurrent convolutional network. It’s “convolutional” because at any given step the same rule is applied—locally—throughout an array of elements. It’s “recurrent” because in effect data is repeatedly “passed through” the same rule. The kinds of procedures (like “backpropagation”) typically used to train traditional neural nets wouldn’t be able to train such a system. But it turns out that—essentially as a consequence of computational irreducibility—the very simple method of successive random mutation can be successful.
但是,好吧,这种自适应进化模型与神经网络这样的系统有什么关系呢?在神经网络的标准语言中,我们的模型就像是一个离散的递归卷积网络的类比。它是“卷积的”,因为在任何给定的步骤中,相同的规则在一组元素中局部应用。它是“递归的”,因为实际上数据是反复“通过”相同的规则传递的。通常用于训练传统神经网络的程序(如“反向传播”)无法训练这样的系统。但事实证明,基本上由于计算不可约性的结果,连续随机突变的非常简单的方法可以成功。
Machine Learning in Discrete Rule Arrays
离散规则数组中的机器学习
Let’s say we want to set up a system like a neural net—or at least a mesh neural net—but we want it to be completely discrete. (And I mean “born discrete”, not just discretized from an existing continuous system.) How can we do this? One approach (that, as it happens, I first considered in the mid-1980s—but never seriously explored) is to make what we can call a “rule array”. Like in a cellular automaton there’s an array of cells. But instead of these cells always being updated according to the same rule, each cell at each place in the cellular automaton analog of “spacetime” can make a different choice of what rule it will use. (And although it’s a fairly extreme idealization, we can potentially imagine that these different rules represent a discrete analog of different local choices of weights in a mesh neural net.)
假设我们想建立一个像神经网络的系统——或者至少是一个网状神经网络——但我们希望它是完全离散的。(我指的是“天生离散”,而不仅仅是从现有的连续系统中离散化。)我们该如何做到这一点?一种方法(恰好我在 1980 年代中期首次考虑过,但从未认真探索)是创建一个我们可以称之为“规则阵列”的东西。就像在细胞自动机中有一个细胞阵列。但不同于这些细胞总是根据相同的规则更新,每个细胞在细胞自动机类比的“时空”中的每个位置可以选择使用不同的规则。(尽管这是一种相当极端的理想化,我们可以想象这些不同的规则代表了网状神经网络中不同局部权重选择的离散类比。)
As a first example, let’s consider a rule array in which there are two possible choices of rules:
作为第一个例子,让我们考虑一个规则数组,其中有两种可能的规则选择:k = 2,r = 1 的元胞自动机规则 4 和 146(分别是类 2 和类 3):
A particular rule array is defined by which of these rules is going to be used at each (“spacetime”) position in the array. Here are a few examples. In all cases we’re starting from the same single-cell initial condition. But in each case the rule array has a different arrangement of rule choices—with cells “running” rule 4 being given a background, and those running rule 146 a one:
特定的规则数组定义了在数组中的每个(“时空”)位置将使用哪些规则。以下是一些示例。在所有情况下,我们都从相同的单元初始条件开始。但在每种情况下,规则数组的规则选择排列不同——运行规则 4 的单元被赋予 背景,而运行规则 146 的单元则被赋予 背景:
We can see that different choices of rule array can yield very different behaviors. But (in the spirit of machine learning) can we in effect “invert this”, and find a rule array that will give some particular behavior we want?
我们可以看到,不同的规则数组选择会产生非常不同的行为。但是(在机器学习的精神下),我们能否“反转”这一点,找到一个能够产生我们想要的特定行为的规则数组?
A simple approach is to do the direct analog of what we did in our minimal modeling of biological evolution: progressively make random “single-point mutations”—here “flipping” the identity of just one rule in the rule array—and then keeping only those mutations that don’t make things worse.
一种简单的方法是直接类比我们在生物进化的最小建模中所做的:逐步进行随机的“单点突变”——在规则数组中“翻转”仅一个规则的身份——然后只保留那些没有使情况变糟的突变。
As our sample objective, let’s ask to find a rule array that makes the pattern generated from a single cell using that rule array “survive” for exactly 50 steps. At first it might not be obvious that we’d be able to find such a rule array. But in fact our simple adaptive procedure easily manages to do this:
作为我们的示例目标,让我们寻找一个规则数组,使得使用该规则数组从单个细胞生成的模式在恰好 50 步内“存活”。起初,可能并不明显我们能够找到这样的规则数组。但实际上,我们简单的自适应程序轻松地实现了这一点:
As the dots here indicate, many mutations don’t lead to longer lifetimes. But every so often, the adaptive process has a “breakthrough” that increases the lifetime—eventually reaching 50:
正如这里的点所示,许多突变并不会导致更长的寿命。但偶尔,适应过程会有一次“突破”,增加寿命——最终达到 50。
Just as in our model of biological evolution, different random sequences of mutations lead to different “solutions”, here to the problem of “living for exactly 50 steps”:
正如我们生物进化模型中所示,不同的随机突变序列导致不同的“解决方案”,在这里是针对“恰好活 50 步”的问题:
Some of these are in effect “simple solutions” that require only a few mutations. But most—like most of our examples in biological evolution—seem more as if they just “happen to work”, effectively by tapping into just the right, fairly complex behavior.
这些实际上是“简单解决方案”,只需要少量突变。但大多数——就像我们在生物进化中的大多数例子——似乎更像是“恰好有效”,实际上是通过利用恰当的、相对复杂的行为。
Is there a sharp distinction between these cases? Looking at the collection of “fitness” (AKA “learning”) curves for the examples above, it doesn’t seem so:
这些案例之间有明显的区别吗?查看上述示例的“适应性”(即“学习”)曲线集合,似乎并没有:
It’s not too difficult to see how to “construct a simple solution” just by strategically placing a single instance of the second rule in the rule array:
通过在规则数组中战略性地放置第二条规则的单个实例,构建一个简单解决方案并不太困难
But the point is that adaptive evolution by repeated mutation normally won’t “discover” this simple solution. And what’s significant is that the adaptive evolution can nevertheless still successfully find some solution—even though it’s not one that’s “understandable” like this.
但关键是,通过反复突变的适应性进化通常不会“发现”这个简单的解决方案。重要的是,适应性进化仍然能够成功找到某种解决方案——尽管这并不是像这样“可理解”的解决方案。
The cellular automaton rules we’ve been using so far take 3 inputs. But it turns out that we can make things even simpler by just putting ordinary 2-input Boolean functions into our rule array. For example, we can make a rule array from And and Xor functions (r = 1/2 rules 8 and 6):
我们迄今为止使用的细胞自动机规则有 3 个输入。但事实证明,我们可以通过将普通的 2 输入布尔函数放入我们的规则数组中,使事情变得更简单。例如,我们可以从 And 和 Xor 函数(r = 1/2 规则 8 和 6)构建一个规则数组:
Different And+Xor ( + ) rule arrays show different behavior:
不同的 And + Xor ( + ) 规则数组表现出不同的行为:
But are there for example And+Xor rule arrays that will compute any of the 16 possible (2-input) functions? We can’t get Not or any of the 8 other functions with —but it turns out we can get all 8 functions with (additional inputs here are assumed to be ):
但是,例如是否存在 And + Xor 规则数组可以计算任何 16 种可能的 (2 输入) 函数?我们无法通过 获得 Not 或其他 8 个函数——但事实证明,我们可以通过 获得所有 8 个函数(这里假设额外输入为 ):
And in fact we can also set up And+Xor rule arrays for all other “even” Boolean functions. For example, here are rule arrays for the 3-input rule 30 and rule 110 Boolean functions:
实际上,我们还可以为所有其他“偶”布尔函数设置 And + Xor 规则数组。例如,以下是 3 输入规则 30 和规则 110 布尔函数的规则数组:
It may be worth commenting that the ability to set up such rule arrays is related to functional completeness of the underlying rules we’re using—though it’s not quite the same thing. Functional completeness is about setting up arbitrary formulas, that can in effect allow long-range connections between intermediate results. Here, all information has to explicitly flow through the array. But for example the functional completeness of Nand (r = 1/2 rule 7, ) allows it to generate all Boolean functions when combined for example with First (r = 1/2 rule 12, ), though sometimes the rule arrays required are quite large:
值得一提的是,设置此类规则数组的能力与我们使用的基础规则的功能完备性有关——尽管这并不完全相同。功能完备性是关于设置任意公式,这实际上可以允许中间结果之间的长距离连接。在这里,所有信息必须明确地通过数组流动。但例如, Nand (r = 1/2 规则 7, )的功能完备性允许它与 First (r = 1/2 规则 12, )结合时生成所有布尔函数,尽管有时所需的规则数组相当大:
OK, but what happens if we try to use our adaptive evolution process—say to solve the problem of finding a pattern that survives for exactly 30 steps? Here’s a result for And+Xor rule arrays:
好的,但如果我们尝试使用我们的自适应进化过程——比如解决寻找恰好存活 30 步的模式的问题,会发生什么呢?以下是 And + Xor 规则数组的结果:
And here are examples of other “solutions” (none of which in this case look particularly “mechanistic” or “constructed”):
这里是其他“解决方案”的例子(在这种情况下,没有一个看起来特别“机械”或“构造”):
But what about learning our original f[x] = function? Well, first we have to decide how we’re going to represent the numbers x and f[x] in our discrete rule array system. And one approach is to do this simply in terms of the position of a black cell (“one-hot encoding”). So, for example, in this case there’s an initial black cell at a position corresponding to about x = –1.1. And then the result after passing through the rule array is a black cell at a position corresponding to f[x] = 1.0:
但是学习我们的原始 f[x] = 函数怎么样呢?首先,我们必须决定如何在我们的离散规则数组系统中表示数字 x 和 f[x]。一种方法是简单地通过黑色单元格的位置来表示(“独热编码”)。例如,在这种情况下,初始黑色单元格位于大约 x = –1.1 的位置。然后,经过规则数组处理后的结果是在对应于 f[x] = 1.0 的位置上有一个黑色单元格:
So now the question is whether we can find a rule array that successfully maps initial to final cell positions according to the mapping x f[x] we want. Well, here’s an example that comes at least close to doing this (note that the array is taken to be cyclic):
现在的问题是我们是否能找到一个规则数组,成功地将初始单元格位置映射到最终单元格位置,按照我们想要的映射 x f[x]。好吧,这里有一个至少接近实现这一点的例子(注意数组被视为循环的):
So how did we find this? Well, we just used a simple adaptive evolution process. In direct analogy to the way it’s usually done in machine learning, we set up “training examples”, here of the form:
那么我们是如何发现这个的呢?我们只是使用了一个简单的自适应进化过程。与机器学习中通常的做法直接类比,我们设置了“训练示例”,这里的形式为:
Then we repeatedly made single-point mutations in our rule array, keeping those mutations where the total difference from all the training examples didn’t increase. And after 50,000 mutations this gave the final result above.
然后我们在规则数组中反复进行单点突变,保留那些与所有训练示例的总差异没有增加的突变。在进行了 50,000 次突变后,这得出了上述最终结果。
We can get some sense of “how we got there” by showing the sequence of intermediate results where we got closer to the goal (as opposed to just not getting further from it):
我们可以通过展示中间结果的序列来获得一些“我们是如何到达那里的”感觉,这些结果使我们更接近目标(而不是仅仅没有进一步远离它):
Here are the corresponding rule arrays, in each case highlighting elements that have changed (and showing the computation of f[0] in the arrays):
以下是相应的规则数组,在每种情况下突出显示已更改的元素(并显示数组中 f[0] 的计算):
Different sequences of random mutations will lead to different rule arrays. But with the setup defined here, the resulting rule arrays will almost always succeed in accurately computing f[x]. Here are a few examples—in which we’re specifically showing the computation of f[0]:
不同的随机突变序列将导致不同的规则数组。但在这里定义的设置下,生成的规则数组几乎总是能够准确计算 f[x]。以下是一些示例——我们特别展示了 f[0] 的计算:
And once again an important takeaway is that we don’t see “identifiable mechanism” in what’s going on. Instead, it looks more as if the rule arrays we’ve got just “happen” to do the computations we want. Their behavior is complicated, but somehow we can manage to “tap into it” to compute our f[x].
再一次,一个重要的收获是我们没有看到正在发生的事情中的“可识别机制”。相反,看起来我们得到的规则数组只是“恰好”能够进行我们想要的计算。它们的行为很复杂,但不知怎么的,我们能够“利用它”来计算我们的 f[x]。
But how robust is this computation? A key feature of typical machine learning is that it can “generalize” away from the specific examples it’s been given. It’s never been clear just how to characterize that generalization (when does an image of a cat in a dog suit start being identified as an image of a dog?). But—at least when we’re talking about classification tasks—we can think of what’s going on in terms of basins of attraction that lead to attractors corresponding to our classes.
但是这种计算的稳健性如何?典型机器学习的一个关键特征是它可以“泛化”超出所给定的具体示例。一直以来都不清楚如何准确描述这种泛化(当一张穿着狗衣服的猫的图像开始被识别为狗的图像时,这种情况是什么时候发生的?)。但是——至少在我们谈论分类任务时——我们可以将发生的事情视为吸引盆地,这些盆地导致与我们的类别相对应的吸引子。
It’s all considerably easier to analyze, though, in the kind of discrete system we’re exploring here. For example, we can readily enumerate all our training inputs (i.e. all initial states containing a single black cell), and then see how frequently these cause any given cell to be black:
在我们这里探索的这种离散系统中,分析起来要容易得多。例如,我们可以轻松列举出所有的训练输入(即包含单个黑色单元格的所有初始状态),然后查看这些输入使任何给定单元格变为黑色的频率。
By the way, here’s what happens to this plot at successive “breakthroughs” during training:
顺便说一下,以下是训练过程中在连续“突破”时该图的变化:
But what about all possible inputs, including ones that don’t just contain a single black cell? Well, we can enumerate all of them, and compute the overall frequency for each cell in the array to be black:
但是所有可能的输入呢,包括那些不仅仅包含一个黑色单元格的输入?好吧,我们可以列举出所有这些输入,并计算数组中每个单元格为黑色的总体频率:
As we would expect, the result is considerably “fuzzier” than what we got purely with our training inputs. But there’s still a strong trace of the discrete values for f[x] that appeared in the training data. And if we plot the overall probability for a given final cell to be black, we see peaks at positions corresponding to the values 0 and 1 that f[x] takes on:
正如我们所预期的,结果比我们仅使用训练输入时要“模糊”得多。但在训练数据中出现的离散值 f[x] 仍然有很强的痕迹。如果我们绘制给定最终单元格为黑色的整体概率,我们会看到在 f[x] 取值 0 和 1 对应的位置上有峰值:
But because our system is discrete, we can explicitly look at what outcomes occur:
但由于我们的系统是离散的,我们可以明确地查看发生了哪些结果:
The most common overall is the “meaningless” all-white state—that basically occurs when the computation from the input “never makes it” to the output. But the next most common outcomes correspond exactly to f[x] = 0 and f[x] = 1. After that is the “superposition” outcome where f[x] is in effect “both 0 and 1”.
最常见的整体状态是“无意义”的全白状态——这基本上发生在输入的计算“从未到达”输出。但下一个最常见的结果恰好对应于 f[x] = 0 和 f[x] = 1。之后是“叠加”结果,其中 f[x] 实际上是“既是 0 也是 1”。
But, OK, so what initial states are “in the basins of attraction of” (i.e. will evolve to) the various outcomes here? The fairly flat plots in the last column above indicate that the overall density of black cells gives little information about what attractor a particular initial state will evolve to.
但是,好吧,那么哪些初始状态是“吸引盆地中的”(即将演变为)这里的各种结果?上面最后一列中相对平坦的图表表明,黑色单元的整体密度对特定初始状态将演变为哪个吸引子提供的信息很少。
So this means we have to look at specific configurations of cells in the initial conditions. As an example, start from the initial condition
这意味着我们必须查看初始条件下细胞的特定配置。作为一个例子,从初始条件开始。
which evolves to: 演变为:
Now we can ask what happens if we look at a sequence of slightly different initial conditions. And here we show in black and white initial conditions that still evolve to the original “attractor” state, and in pink ones that evolve to some different state:
现在我们可以问,如果我们观察一系列略有不同的初始条件会发生什么。在这里,我们用黑白显示仍然演变到原始“吸引子”状态的初始条件,用粉色显示演变到某种不同状态的初始条件:
What’s actually going on inside here? Here are a few examples, highlighting cells whose values change as a result of changing the initial condition:
这里实际上发生了什么?以下是一些示例,突出显示由于初始条件变化而导致值变化的单元格:
As is typical in machine learning, there doesn’t seem to be any simple characterization of the form of the basin of attraction. But now we have a sense of what the reason for this is: it’s another consequence of computational irreducibility. Computational irreducibility gives us the effective randomness that allows us to find useful results by adaptive evolution, but it also leads to changes having what seem like random and unpredictable effects. (It’s worth noting, by the way, that we could probably dramatically improve the robustness of our attractor basins by specifically including in our training data examples that have “noise” injected.)
在机器学习中,通常没有简单的方式来描述吸引盆的形态。但现在我们对其原因有了一定的理解:这又是计算不可约性的一个结果。计算不可约性为我们提供了有效的随机性,使我们能够通过自适应演化找到有用的结果,但它也导致了变化似乎具有随机和不可预测的效果。(顺便提一下,我们可能通过在训练数据中专门包含注入“噪声”的示例,显著提高我们吸引子盆的鲁棒性。)
Multiway Mutation Graphs
多路突变图
In doing machine learning in practice, the goal is typically to find some collection of weights, etc. that successfully solve a particular problem. But in general there will be many such collections of weights, etc. With typical continuous weights and random training steps it’s very difficult to see what the whole “ensemble” of possibilities is. But in our discrete rule array systems, this becomes more feasible.
在实际进行机器学习时,目标通常是找到一些权重等的集合,以成功解决特定问题。但一般来说,会有许多这样的权重等的集合。使用典型的连续权重和随机训练步骤,很难看出整个“集合”的可能性。但在我们的离散规则数组系统中,这变得更加可行。
Consider a tiny 2×2 rule array with two possible rules. We can make a graph whose edges represent all possible “point mutations” that can occur in this rule array:
考虑一个微小的 2×2 规则阵列,具有两种可能的规则。我们可以绘制一个图,其边表示在该规则阵列中可能发生的所有“点突变”。
In our adaptive evolution process, we’re always moving around a graph like this. But typically most “moves” will end up in states that are rejected because they increase whatever loss we’ve defined.
在我们的自适应进化过程中,我们总是在这样的图形中移动。但通常大多数“移动”会导致被拒绝的状态,因为它们增加了我们定义的任何损失。
Consider the problem of generating an And+Xor rule array in which we end with lifetime-4 patterns. Defining the loss as how far we are from this lifetime, we can draw a graph that shows all possible adaptive evolution paths that always progressively decrease the loss:
考虑生成一个 And + Xor 规则数组的问题,其中我们以生命周期为 4 的模式结束。将损失定义为我们与这个生命周期的距离,我们可以绘制一张图,显示所有可能的自适应进化路径,这些路径始终逐步减少损失:
The result is a multiway graph of the type we’ve now seen in a great many kinds of situations—notably our recent study of biological evolution.
结果是一个多向图,这种类型我们在许多情况中都见过——特别是在我们最近对生物进化的研究中。
And although this particular example is quite trivial, the idea in general is that different parts of such a graph represent “different strategies” for solving a problem. And—in direct analogy to our Physics Project and our studies of things like game graphs—one can imagine such strategies being laid out in a “branchial space” defined by common ancestry of configurations in the multiway graph.
尽管这个特定的例子相当微不足道,但一般来说,这种图的不同部分代表了“解决问题的不同策略”。并且——与我们的物理项目以及我们对游戏图等事物的研究直接类比——可以想象这些策略被布置在一个由多路径图中配置的共同祖先定义的“分支空间”中。
And one can expect that while in some cases the branchial graph will be fairly uniform, in other cases it will have quite separated pieces—that represent fundamentally different strategies. Of course, the fact that underlying strategies may be different doesn’t mean that the overall behavior or performance of the system will be noticeably different. And indeed one expects that in most cases computational irreducibility will lead to enough effective randomness that there’ll be no discernable difference.
可以预期,在某些情况下,鳃图将相对均匀,而在其他情况下,它将有相当分离的部分——代表根本不同的策略。当然,潜在策略可能不同并不意味着系统的整体行为或性能会明显不同。实际上,人们期望在大多数情况下,计算不可约性将导致足够的有效随机性,以至于没有可察觉的差异。
But in any case, here’s an example starting with a rule array that contains both And and Xor—where we observe distinct branches of adaptive evolution that lead to different solutions to the problem of finding a configuration with a lifetime of exactly 4:
但无论如何,这里有一个例子,从一个包含 And 和 Xor 的规则数组开始——我们观察到适应性进化的不同分支,导致找到一个寿命恰好为 4 的配置的不同解决方案:
Optimizing the Learning Process
优化学习过程
How should one actually do the learning in machine learning? In practical work with traditional neural nets, learning is normally done using systematic algorithmic methods like backpropagation. But so far, all we’ve done here is something much simpler: we’ve “learned” by successively making random point mutations, and keeping only ones that don’t lead us further from our goal. And, yes, it’s interesting that such a procedure can work at all—and (as we’ve discussed elsewhere) this is presumably very relevant to understanding phenomena like biological evolution. But, as we’ll see, there are more efficient (and probably much more efficient) methods of doing machine learning, even for the kinds of discrete systems we’re studying.
在机器学习中,实际应该如何进行学习?在传统神经网络的实际工作中,学习通常是通过系统的算法方法进行的,比如反向传播。但到目前为止,我们所做的只是一些更简单的事情:我们通过不断进行随机点突变来“学习”,并仅保留那些没有使我们离目标更远的突变。是的,这样的过程能够起作用确实很有趣——而且(正如我们在其他地方讨论过的)这显然与理解生物进化等现象非常相关。但是,正如我们将看到的,即使对于我们正在研究的离散系统,还有更高效(而且可能更高效)的方法来进行机器学习。
Let’s start by looking again at our earlier example of finding an And+Xor rule array that gives a “lifetime” of exactly 30. At each step in our adaptive (“learning”) process we make a single-point mutation (changing a single rule in the rule array), keeping the mutation if it doesn’t take us further from our goal. The mutations gradually accumulate—every so often reaching a rule array that gives a lifetime closer to 30. Just as above, here’s a plot of the lifetime achieved by successive mutations—with the “internal” red dots corresponding to rejected mutations:
让我们再次回顾一下之前的例子,寻找一个 And + Xor 规则数组,使其“寿命”恰好为 30。在我们自适应(“学习”)过程的每一步中,我们进行一次单点突变(改变规则数组中的单个规则),如果突变没有使我们离目标更远,则保留该突变。这些突变逐渐积累——每隔一段时间就会达到一个使寿命更接近 30 的规则数组。和上面一样,这里是通过连续突变所达到的寿命的图示——“内部”的红点对应于被拒绝的突变:
We see a series of “plateaus” at which mutations are accumulating but not changing the overall lifetime. And between these we see occasional “breakthroughs” where the lifetime jumps. Here are the actual rule array configurations for these breakthroughs, with mutations since the last breakthrough highlighted:
我们看到一系列“平台”,在这些平台上突变在积累,但并没有改变整体寿命。在这些平台之间,我们看到偶尔的“突破”,寿命会突然增加。以下是这些突破的实际规则数组配置,自上次突破以来的突变已被突出显示:
But in the end the process here is quite wasteful; in this example, we make a total of 1705 mutations, but only 780 of them actually contribute to generating the final rule array; all the others are discarded along the way.
但最终这里的过程相当浪费;在这个例子中,我们总共进行了 1705 次突变,但只有 780 次实际上有助于生成最终的规则数组;其余的都在过程中被丢弃。
So how can we do better? One strategy is to try to figure out at each step which mutation is “most likely to make a difference”. And one way to do this is to try every possible mutation in turn at every step (as in multiway evolution)—and see what effect each of them has on the ultimate lifetime. From this we can construct a “change map” in which we give the change of lifetime associated with a mutation at every particular cell. The results will be different for every configuration of rule array, i.e. at every step in the adaptive evolution. But for example here’s what they are for the particular “breakthrough” configurations shown above (elements in regions that are colored gray won’t affect the result if they are changed; ones colored red will have a positive effect (with more intense red being more positive), and ones colored blue a negative one:
那么我们如何才能做得更好呢?一种策略是在每一步尝试找出“最有可能产生影响”的突变。实现这一点的一种方法是在每一步依次尝试每一个可能的突变(如多向进化)——并观察它们对最终寿命的影响。由此我们可以构建一个“变化图”,在其中我们给出与每个特定细胞的突变相关的寿命变化。对于每种规则数组的配置,结果都会不同,即在适应性进化的每一步。但例如,以下是上述特定“突破”配置的结果(灰色区域的元素如果被改变不会影响结果;红色的元素会产生积极影响(红色越深,影响越积极),而蓝色的元素则会产生消极影响):
Let’s say we start from a random rule array, then repeatedly construct the change map and apply the mutation that it implies gives the most positive change—in effect at each step following the “path of steepest descent” to get to the lifetime we want (i.e. reduce the loss). Then the sequence of “breakthrough” configurations we get is:
假设我们从一个随机规则数组开始,然后反复构建变化图并应用它所暗示的最积极变化的突变——实际上在每一步遵循“最陡下降路径”以达到我们想要的生命周期(即减少损失)。那么我们得到的“突破”配置序列是:
And this in effect corresponds to a slightly more direct “path to a solution” than our sequence of pure single-point mutations.
这实际上对应于比我们的一系列纯单点突变更直接的“解决方案路径”。
By the way, the particular problem of reaching a certain lifetime has a simple enough structure that this “steepest descent” method—when started from a simple uniform rule array—finds a very “mechanical” (if slow) path to a solution:
顺便提一下,达到某个寿命的特定问题结构简单,因此这种“最陡下降”方法——从一个简单的均匀规则数组开始——找到了一条非常“机械”(尽管缓慢)的解决路径:
What about the problem of learning f[x] = ? Once again we can make a change map based on the loss we define. Here are the results for a sequence of “breakthrough” configurations. The gray regions are ones where changes will be “neutral”, so that there’s still exploration that can be done without affecting the loss. The red regions are ones that are in effect “locked in” and where any changes would be deleterious in terms of loss:
学习 f[x] = 的问题怎么样?我们可以根据我们定义的损失再次制作一个变化图。以下是一系列“突破”配置的结果。灰色区域是变化将是“中立”的地方,因此仍然可以进行探索而不影响损失。红色区域实际上是“锁定”的地方,任何变化在损失方面都是有害的:
So what happens in this case if we follow the “path of steepest descent”, always making the change that would be best according to the change map? Well, the results are actually quite unsatisfactory. From almost any initial condition the system quickly gets stuck, and never finds any satisfactory solution. In effect it seems that deterministically following the path of steepest descent leads us to a “local minimum” from which we cannot escape. So what are we missing in just looking at the change map? Well, the change map as we’ve constructed it has the limitation that it’s separately assessing the effect of each possible individual mutation. It doesn’t deal with multiple mutations at a time—which could well be needed in general if one’s going to find the “fastest path to success”, and avoid getting stuck.
那么,如果我们遵循“最陡下降路径”,始终根据变化图做出最佳改变,这种情况下会发生什么呢?结果实际上相当不令人满意。从几乎任何初始条件出发,系统很快就会陷入困境,永远找不到任何令人满意的解决方案。实际上,似乎确定性地遵循最陡下降路径会导致我们到达一个“局部最小值”,而我们无法逃脱。那么,仅仅查看变化图我们错过了什么呢?我们构建的变化图有一个局限性,即它单独评估每个可能的个体突变的影响。它没有同时处理多个突变——如果想要找到“通往成功的最快路径”,并避免陷入困境,这可能是必要的。
But even in constructing the change map there’s already a problem. Because at least the direct way of computing it scales quite poorly. In an n×n rule array we have to check the effect of flipping about n2 values, and for each one we have to run the whole system—taking altogether about n4 operations. And one has to do this separately for each step in the learning process.
但即使在构建变化图时也已经存在一个问题。因为至少直接计算的方法扩展性相当差。在一个 n×n 的规则数组中,我们必须检查翻转大约 n² 个值的影响,对于每一个值,我们都必须运行整个系统——总共大约需要 n⁴ 次操作。而且必须为学习过程中的每一步单独进行此操作。
So how do traditional neural nets avoid this kind of inefficiency? The answer in a sense involves a mathematical trick. And at least as it’s usually presented it’s all based on the continuous nature of the weights and values in neural nets—which allow us to use methods from calculus.
那么传统神经网络是如何避免这种低效的呢?从某种意义上说,答案涉及一个数学技巧。至少在通常的表述中,这一切都基于神经网络中权重和数值的连续性——这使我们能够使用微积分的方法。
Let’s say we have a neural net like this
假设我们有一个这样的神经网络
that computes some particular function f[x]:
计算某个特定函数 f[x]:
We can ask how this function changes as we change each of the weights in the network:
我们可以询问当我们改变网络中的每个权重时,这个函数是如何变化的:
And in effect this gives us something like our “change map” above. But there’s an important difference. Because the weights are continuous, we can think about infinitesimal changes to them. And then we can ask questions like “How does f[x] change when we make an infinitesimal change to a particular weight wi?”—or equivalently, “What is the partial derivative of f with respect to wi at the point x?” But now we get to use a key feature of infinitesimal changes: that they can always be thought of as just “adding linearly” (essentially because ε2 can always be ignored compared to ε). Or, in other words, we can summarize any infinitesimal change just by giving its “direction” in weight space, i.e. a vector that says how much of each weight should be (infinitesimally) changed. So if we want to change f[x] (infinitesimally) as quickly as possible, we should go in the direction of steepest descent defined by all the derivatives of f with respect to the weights.
实际上,这给了我们类似于上面的“变化图”。但有一个重要的区别。由于权重是连续的,我们可以考虑对它们进行无穷小的变化。然后我们可以问这样的问题:“当我们对特定权重 w 进行无穷小变化时,f[x]是如何变化的?”——或者等价地,“在点 x 处,f 关于 w 的偏导数是多少?”但现在我们可以利用无穷小变化的一个关键特征:它们总是可以被视为“线性加法”(本质上是因为与ε相比,ε²总是可以被忽略)。换句话说,我们可以通过给出其在权重空间中的“方向”来总结任何无穷小变化,即一个向量,表示每个权重应该(无穷小地)变化多少。因此,如果我们想尽可能快地(无穷小地)改变 f[x],我们应该朝着由 f 关于权重的所有导数定义的最陡下降方向前进。
In machine learning, we’re typically trying in effect to set the weights so that the form of f[x] we generate successfully minimizes whatever loss we’ve defined. And we do this by incrementally “moving in weight space”—at every step computing the direction of steepest descent to know where to go next. (In practice, there are all sorts of tricks like “ADAM” that try to optimize the way to do this.)
在机器学习中,我们通常试图设置权重,以便我们生成的 f[x] 形式成功地最小化我们定义的任何损失。我们通过逐步“在权重空间中移动”来实现这一点——在每一步计算最陡下降的方向,以知道下一步该去哪里。(在实践中,有各种技巧,如“ADAM”,试图优化实现这一点的方法。)
But how do we efficiently compute the partial derivative of f with respect to each of the weights? Yes, we could do the analog of generating pictures like the ones above, separately for each of the weights. But it turns out that a standard result from calculus gives us a vastly more efficient procedure that in effect “maximally reuses” parts of the computation that have already been done.
但是我们如何有效地计算 f 关于每个权重的偏导数呢?是的,我们可以为每个权重单独生成像上面那样的图片。但事实证明,微积分中的一个标准结果为我们提供了一种更高效的程序,实际上是“最大限度地重用”已经完成的计算部分。
It all starts with the textbook chain rule for the derivative of nested (i.e. composed) functions:
一切都始于教科书中关于嵌套(即复合)函数导数的链式法则:
This basically says that the (infinitesimal) change in the value of the “whole chain” d[c[b[a[x]]]] can be computed as a product of (infinitesimal) changes associated with each of the “links” in the chain. But the key observation is then that when we get to the computation of the change at a certain point in the chain, we’ve already had to do a lot of the computation we need—and so long as we stored those results, we always have only an incremental computation to perform.
这基本上说的是“整个链”的值的(无穷小)变化 d[c[b[a[x]]]] 可以作为链中每个“环节”相关的(无穷小)变化的乘积来计算。但关键的观察是,当我们计算链中某一点的变化时,我们已经完成了很多所需的计算——只要我们存储了这些结果,我们总是只需进行增量计算。
So how does this apply to neural nets? Well, each layer in a neural net is in effect doing a function composition. So, for example, our d[c[b[a[x]]]] is like a trivial neural net:
那么这如何应用于神经网络呢?实际上,神经网络中的每一层都在进行函数组合。因此,例如,我们的 d[c[b[a[x]]]] 就像一个简单的神经网络:
But what about the weights, which, after all, are what we are trying to find the effect of changing? Well, we could include them explicitly in the function we’re computing:
但是权重呢,毕竟这就是我们试图找出变化影响的内容?好吧,我们可以将它们明确地包含在我们正在计算的函数中:
And then we could in principle symbolically compute the derivatives with respect to these weights:
然后我们可以原则上符号计算相对于这些权重的导数:
For our network above
对于我们的网络以上
the corresponding expression (ignoring biases) is
相应的表达式(忽略偏差)是
where ϕ denotes our activation function. Once again we’re dealing with nested functions, and once again—though it’s a bit more intricate in this case—the computation of derivatives can be done by incrementally evaluating terms in the chain rule and in effect using the standard neural net method of “backpropagation”.
其中 ϕ 表示我们的激活函数。我们再次处理嵌套函数,再次—尽管在这种情况下稍微复杂一些—导数的计算可以通过逐步评估链式法则中的项来完成,实际上使用标准神经网络方法“反向传播”。
So what about the discrete case? Are there similar methods we can use there? We won’t discuss this in detail here, but we’ll give some indications of what’s likely to be involved.
那么离散情况如何呢?我们可以使用类似的方法吗?我们在这里不会详细讨论,但会给出一些可能涉及的内容的指示。
As a potentially simpler case, let’s consider ordinary cellular automata. The analog of our change map asks how the value of a particular “output” cell is affected by changes in other cells—or in effect what the “partial derivative” of the output value is with respect to changes in values of other cells.
作为一个潜在的简单案例,让我们考虑普通的元胞自动机。我们变化映射的类比询问特定“输出”单元的值如何受到其他单元变化的影响——实际上就是输出值相对于其他单元值变化的“偏导数”是什么。
For example, consider the highlighted “output” cell in this cellular automaton evolution:
例如,考虑这个元胞自动机演化中突出显示的“输出”单元格:
Now we can look at each cell in this array, and make a change map based on seeing whether flipping the value of just that cell (and then running the cellular automaton forwards from that point) would change the value of the output cell:
现在我们可以查看这个数组中的每个单元格,并根据仅翻转该单元格的值(然后从该点向前运行细胞自动机)是否会改变输出单元格的值来制作变化图
The form of the change map is different if we look at different “output cells”:
如果我们查看不同的“输出单元”,变化图的形式是不同的:
Here, by the way, are some larger change maps for this and a couple of other cellular automaton rules:
顺便提一下,这里有一些更大的变化图,用于这个和其他几个细胞自动机规则:
But is there a way to construct such change maps incrementally? One might have thought that there would immediately be at least for cellular automata that (unlike the cases here) are fundamentally reversible. But actually such reversibility doesn’t seem to help much—because although it allows us to “backtrack” whole states of the cellular automaton, it doesn’t allow us to trace the separate effects of individual cells.
但是有没有办法逐步构建这样的变化图?人们可能会认为,对于那些(与这里的情况不同)在根本上是可逆的细胞自动机,至少会立即存在这样的情况。但实际上,这种可逆性似乎并没有太大帮助——因为尽管它允许我们“回溯”细胞自动机的整个状态,但它并不允许我们追踪单个细胞的独立影响。
So how about using discrete analogs of derivatives and the chain rule? Let’s for example call the function computed by one step in rule 30 cellular automaton evolution w[x, y, z]. We can think of the “partial derivative” of this function with respect to x at the point x as representing whether the output of w changes when x is flipped starting from the value given:
那么使用导数和链式法则的离散类比怎么样呢?例如,我们可以将规则 30 元胞自动机演化中一步计算的函数称为 w[x, y, z]。我们可以认为该函数在点 x 处关于 x 的“偏导数”表示当 x 从给定值翻转时 w 的输出是否发生变化:
(Note that “no change” is indicated as False or , while a change is indicated as True or . And, yes, one can either explicitly compute the rule outcomes here, and then deduce from them the functional form, or one can use symbolic rules to directly deduce the functional form.)
(请注意,“无变化”用 False 或 表示,而变化用 True 或 表示。是的,可以在这里明确计算规则结果,然后从中推导出函数形式,或者可以使用符号规则直接推导函数形式。)
One can compute a discrete analog of a derivative for any Boolean function. For example, we have
可以为任何布尔函数计算导数的离散模拟。例如,我们有
and 和
which we can write as:
我们可以写成:
We also have: 我们还有:
And here is a table of “Boolean derivatives” for all 2-input Boolean functions:
这里是所有 2 输入布尔函数的“布尔导数”表:
And indeed there’s a whole “Boolean calculus” one can set up for these kinds of derivatives. And in particular, there’s a direct analog of the chain rule:
实际上,可以为这些类型的导数建立一个完整的“布尔演算”。特别是,有一个链式法则的直接类比:
where Xnor[x,y] is effectively the equality test x == y:
在这里 Xnor [x,y] 实际上是等式测试 x == y:
But, OK, how do we use this to create our change maps? In our simple cellular automaton case, we can think of our change map as representing how a change in an output cell “propagates back” to previous cells. But if we just try to apply our discrete calculus rules we run into a problem: different “chain rule chains” can imply different changes in the value of the same cell. In the continuous case this path dependence doesn’t happen because of the way infinitesimals work. But in the discrete case it does. And ultimately we’re doing a kind of backtracking that can really be represented faithfully only as a multiway system. (Though if we just want probabilities, for example, we can consider averaging over branches of the multiway system—and the change maps we showed above are effectively the result of thresholding over the multiway system.)
但是,好吧,我们如何利用这个来创建我们的变化图?在我们简单的细胞自动机案例中,我们可以将我们的变化图视为表示输出单元的变化如何“向后传播”到之前的单元。但是如果我们只是尝试应用我们的离散微积分规则,我们就会遇到一个问题:不同的“链式法则链”可能会暗示同一个单元值的不同变化。在连续情况下,由于无穷小的工作方式,这种路径依赖不会发生。但在离散情况下,它确实会发生。最终,我们正在进行一种回溯,这种回溯实际上只能作为多路径系统忠实地表示。(不过,如果我们只想要概率,例如,我们可以考虑对多路径系统的分支进行平均——而我们上面展示的变化图实际上是对多路径系统进行阈值处理的结果。)
But despite the appearance of such difficulties in the “simple” cellular automaton case, such methods typically seem to work better in our original, more complicated rule array case. There’s a bunch of subtlety associated with the fact that we’re finding derivatives not only with respect to the values in the rule array, but also with respect to the choice of rules (which are the analog of weights in the continuous case).
但尽管在“简单”细胞自动机的情况下出现了这样的困难,这些方法在我们原始的、更复杂的规则数组情况下通常似乎效果更好。与我们不仅针对规则数组中的值求导,而且还针对规则的选择(这些规则是连续情况下权重的类比)相关的微妙之处有很多。
Let’s consider the And+Xor rule array:
让我们考虑 And + Xor 规则数组:
Our loss is the number of cells whose values disagree with the row shown at the bottom. Now we can construct a change map for this rule array both in a direct “forward” way, and “backwards” using our discrete derivative methods (where we effectively resolve the small amount of “multiway behavior” by always picking “majority” values):
我们的损失是与底部显示的行值不一致的单元格数量。现在我们可以以直接的“前向”方式和使用我们的离散导数方法的“反向”方式为这个规则数组构建一个变化图(在这里我们通过始终选择“多数”值有效地解决了少量的“多向行为”):
The results are similar, though in this case not exactly the same. Here are a few other examples:
结果是相似的,尽管在这种情况下并不完全相同。以下是一些其他例子:
And, yes, in detail there are essentially always local differences between the results from the forward and backward methods. But the backward method—like in the case of backpropagation in ordinary neural nets—can be implemented much more efficiently. And for purposes of practical machine learning it’s actually likely to be perfectly satisfactory—especially given that the forward method is itself only providing an approximation to the question of which mutations are best to do.
而且,是的,详细来说,前向和后向方法的结果之间基本上总是存在局部差异。但后向方法——就像普通神经网络中的反向传播一样——可以更高效地实现。而在实际机器学习的目的下,这实际上可能是完全令人满意的——尤其考虑到前向方法本身只是对哪些突变是最佳选择的问题提供了一个近似。
And as an example, here are the results of the forward and backward methods for the problem of learning the function f[x] = , for the “breakthrough” configurations that we showed above:
作为一个例子,以下是前向和后向方法在学习函数 f[x] = 的问题上的结果,针对我们上面展示的“突破”配置:
What Can Be Learned?
可以学到什么?
We’ve now shown quite a few examples of machine learning in action. But a fundamental question we haven’t yet addressed is what kind of thing can actually be learned by machine learning. And even before we get to this, there’s another question: given a particular underlying type of system, what kinds of functions can it even represent?
我们现在已经展示了许多机器学习的实际例子。但我们尚未解决的一个基本问题是,机器学习实际上可以学习什么样的东西。在我们讨论这个问题之前,还有另一个问题:给定一个特定的基础系统,它可以表示什么样的函数?
As a first example consider a minimal neural net of the form (essentially a single-layer perceptron):
作为第一个例子,考虑一种最小神经网络的形式(本质上是一个单层感知器):
With ReLU (AKA Ramp) as the activation function and the first set of weights all taken to be 1, the function computed by such a neural net has the form:
使用 ReLU(也称为 Ramp )作为激活函数,并且第一组权重全部取为 1,则该神经网络计算的函数具有以下形式:
With enough weights and biases this form can represent any piecewise linear function—essentially just by moving around ramps using biases, and scaling them using weights. So for example consider the function:
通过足够的权重和偏置,这种形式可以表示任何分段线性函数——本质上只是通过使用偏置移动斜坡,并使用权重对其进行缩放。因此,例如考虑以下函数:
This is the function computed by the neural net above—and here’s how it’s built up by adding in successive ramps associated with the individual intermediate nodes (neurons):
这是由上面的神经网络计算得出的函数——以下是通过添加与各个中间节点(神经元)相关的连续斜坡来构建它的过程:
(It’s similarly possible to get all smooth functions from activation functions like ELU, etc.)
(同样可以通过像 ELU 等激活函数获得所有平滑函数。)
Things get slightly more complicated if we try to represent functions with more than one argument. With a single intermediate layer we can only get “piecewise (hyper)planar” functions (i.e. functions that change direction only at linear “fault lines”):
如果我们尝试表示具有多个参数的函数,事情会变得稍微复杂一些。通过一个单一的中间层,我们只能得到“分段(超)平面”函数(即仅在线性“断层线”处改变方向的函数):
But already with a total of two intermediate layers—and sufficiently many nodes in each of these layers—we can generate any piecewise function of any number of arguments.
但已经有两个中间层——并且每个层中有足够多的节点——我们可以生成任何数量参数的任意分段函数。
If we limit the number of nodes, then roughly we limit the number of boundaries between different linear regions in the values of the functions. But as we increase the number of layers with a given number of nodes, we basically increase the number of sides that polygonal regions within the function values can have:
如果我们限制节点的数量,那么大致上我们限制了函数值之间不同线性区域的边界数量。但是,当我们在给定节点数量的情况下增加层数时,我们基本上增加了函数值中多边形区域可以拥有的边的数量:
So what happens with the mesh nets that we discussed earlier? Here are a few random examples, showing results very similar to shallow, fully connected networks with a comparable total number of nodes:
那么我们之前讨论的网状网络会发生什么呢?以下是一些随机示例,显示出与浅层、全连接网络非常相似的结果,节点总数也相当:
OK, so how about our fully discrete rule arrays? What functions can they represent? We already saw part of the answer earlier when we generated rule arrays to represent various Boolean functions. It turns out that there is a fairly efficient procedure based on Boolean satisfiability for explicitly finding rule arrays that can represent a given function—or determine that no rule array (say of a given size) can do this.
好的,那么我们的完全离散规则数组怎么样?它们可以表示什么功能?我们之前在生成规则数组以表示各种布尔函数时已经看到部分答案。事实证明,有一种基于布尔可满足性的相当有效的程序,可以明确找到能够表示给定函数的规则数组——或者确定没有规则数组(例如,给定大小的规则数组)可以做到这一点。
Using this procedure, we can find minimal And+Xor rule arrays that represent all (“even”) 3-input Boolean functions (i.e. r = 1 cellular automaton rules):
使用此过程,我们可以找到表示所有(“偶数”)3 输入布尔函数的最小 And + Xor 规则数组(即 r = 1 细胞自动机规则):
It’s always possible to specify any n-input Boolean function by an array of 2n bits, as in:
总是可以通过一个 2n 位的数组来指定任何 n 输入的布尔函数,如下所示:
But we see from the pictures above that when we “compile” Boolean functions into And+Xor rule arrays, they can take different numbers of bits (i.e. different numbers of elements in the rule array). (In effect, the “algorithmic information content” of the function varies with the “language” we’re using to represent them.) And, for example, in the n = 3 case shown here, the distribution of minimal rule array sizes is:
但我们从上面的图片中看到,当我们将布尔函数“编译”成 And + Xor 规则数组时,它们可以使用不同数量的位(即规则数组中的元素数量不同)。 (实际上,函数的“算法信息内容”随着我们用来表示它们的“语言”而变化。)例如,在这里显示的 n = 3 的情况下,最小规则数组大小的分布是:
There are some functions that are difficult to represent as And+Xor rule arrays (and seem to require 15 rule elements)—and others that are easier. And this is similar to what happens if we represent Boolean functions as Boolean expressions (say in conjunctive normal form) and count the total number of (unary and binary) operations used:
有一些函数很难表示为 And + Xor 规则数组(似乎需要 15 个规则元素)——还有一些则更容易。这与我们将布尔函数表示为布尔表达式(例如在合取范式中)并计算所使用的(单元和二元)操作的总数时发生的情况类似。
OK, so we know that there is in principle an And+Xor rule array that will compute any (even) Boolean function. But now we can ask whether an adaptive evolution process can actually find such a rule array—say with a sequence of single-point mutations. Well, if we do such adaptive evolution—with a loss that counts the number of “wrong outputs” for, say, rule 254—then here’s a sequence of successive breakthrough configurations that can be produced:
好的,所以我们知道原则上存在一个 And + Xor 规则数组,可以计算任何(偶数)布尔函数。但现在我们可以问,适应性进化过程是否真的能够找到这样的规则数组——比如通过一系列单点突变。好吧,如果我们进行这样的适应性进化——损失计算“错误输出”数量,比如规则 254——那么这里有一系列可以产生的连续突破配置:
The results aren’t as compact as the minimal solution above. But it seems to always be possible to find at least some And+Xor rule array that “solves the problem” just by using adaptive evolution with single-point mutations.
结果没有上面的最小解决方案那么紧凑。但似乎总是可以找到至少一些 And + Xor 规则数组,通过使用单点突变的自适应进化来“解决问题”。
Here are results for some other Boolean functions:
以下是其他一些布尔函数的结果:
And so, yes, not only are all (even) Boolean functions representable in terms of And+Xor rule arrays, they’re also learnable in this form, just by adaptive evolution with single-point mutations.
因此,是的,所有(甚至)布尔函数不仅可以用 And + Xor 规则数组表示,而且可以通过单点突变的自适应进化以这种形式学习。
In what we did above, we were looking at how machine learning works with our rule arrays in specific cases like for the function. But now we’ve got a case where we can explicitly enumerate all possible functions, at least of a given class. And in a sense what we’re seeing is evidence that machine learning tends to be very broad—and capable at least in principle of learning pretty much any function.
在我们上面所做的事情中,我们在查看机器学习如何在特定情况下与我们的规则数组一起工作,例如对于 函数。但现在我们有一个案例,可以明确列举出所有可能的函数,至少是在给定类别中。从某种意义上说,我们看到的证据表明,机器学习往往是非常广泛的——并且至少在原则上能够学习几乎任何函数。
Of course, there can be specific restrictions. Like the And+Xor rule arrays we’re using here can’t represent (“odd”) functions where . (The Nand+First rule arrays we discussed above nevertheless can.) But in general it seems to be a reflection of the Principle of Computational Equivalence that pretty much any setup is capable of representing any function—and also adaptively “learning” it.
当然,可能会有特定的限制。比如我们在这里使用的 And + Xor 规则数组无法表示(“奇”)函数,其中 。(不过我们上面讨论的 Nand + First 规则数组仍然可以。)但一般来说,这似乎反映了计算等价原理,几乎任何设置都能够表示任何函数——并且也能够自适应地“学习”它。
By the way, it’s a lot easier to discuss questions about representing or learning “any function” when one’s dealing with discrete (countable) functions—because one can expect to either be able to “exactly get” a given function, or not. But for continuous functions, it’s more complicated, because one’s pretty much inevitably dealing with approximations (unless one can use symbolic forms, which are basically discrete). So, for example, while we can say (as we did above) that (ReLU) neural nets can represent any piecewise-linear function, in general we’ll only be able to imagine successively approaching an arbitrary function, much like when you progressively add more terms in a simple Fourier series:
顺便说一下,当处理离散(可数)函数时,讨论关于表示或学习“任何函数”的问题要容易得多——因为人们可以期待要么“准确得到”一个给定的函数,要么不能。但是对于连续函数来说,这就复杂得多,因为人们几乎不可避免地要处理近似(除非可以使用符号形式,这基本上是离散的)。因此,例如,虽然我们可以说(正如我们上面所说的)ReLU 神经网络可以表示任何分段线性函数,但一般来说,我们只能想象逐步接近一个任意函数,就像在简单的傅里叶级数中逐渐添加更多项一样:
Looking back at our results for discrete rule arrays, one notable observation that is that while we can successfully reproduce all these different Boolean functions, the actual rule array configurations that achieve this tend to look quite messy. And indeed it’s much the same as we’ve seen throughout: machine learning can find solutions, but they’re not “structured solutions”; they’re in effect just solutions that “happen to work”.
回顾我们在离散规则数组上的结果,一个显著的观察是,尽管我们能够成功重现所有这些不同的布尔函数,但实现这一点的实际规则数组配置往往看起来相当杂乱。实际上,这与我们之前看到的情况非常相似:机器学习可以找到解决方案,但它们并不是“结构化的解决方案”;实际上,它们只是“恰好有效的解决方案”。
Are there more structured ways of representing Boolean functions with rule arrays? Here are the two possible minimum-size And+Xor rule arrays that represent rule 30:
是否有更结构化的方式来表示带规则数组的布尔函数?以下是表示规则 30 的两种可能的最小尺寸 And + Xor 规则数组:
At the next-larger size there are more possibilities for rule 30:
在下一个更大的尺寸中,规则 30 有更多的可能性:
And there are also rule arrays that can represent rule 110:
还有一些规则数组可以表示规则 110:
But in none of these cases is there obvious structure that allows us to immediately see how these computations work, or what function is being computed. But what if we try to explicitly construct—effectively by standard engineering methods—a rule array that computes a particular function? We can start by taking something like the function for rule 30 and writing it in terms of And and Xor (i.e. in ANF, or “algebraic normal form”):
但在这些情况下都没有明显的结构,让我们能够立即看到这些计算是如何工作的,或者正在计算什么函数。但是如果我们尝试明确构建——有效地通过标准工程方法——一个计算特定函数的规则数组呢?我们可以从类似规则 30 的函数开始,并用 And 和 Xor (即以 ANF 或“代数标准形式”)来表示它:
We can imagine implementing this using an “evaluation graph”:
我们可以想象使用“评估图”来实现这一点:
But now it’s easy to turn this into a rule array (and, yes, we haven’t gone all the way and arranged to copy inputs, etc.):
但现在很容易将其转换为规则数组(是的,我们还没有完全处理输入复制等问题):
“Evaluating” this rule array for different inputs, we can see that it indeed gives rule 30:
“评估”这个规则数组对于不同的输入,我们可以看到它确实给出了规则 30:
Doing the same thing for rule 110, the And+Xor expression is
对规则 110 做同样的事情, And + Xor 表达式是
the evaluation graph is
评估图是
and the rule array is:
规则数组为:
And at least with the evaluation graph as a guide, we can readily “see what’s happening” here. But the rule array we’re using is considerably larger than our minimal solutions above—or even than the solutions we found by adaptive evolution.
至少有了评估图作为指导,我们可以清楚地“看到这里发生了什么”。但是我们使用的规则数组比我们上面提到的最小解决方案要大得多——甚至比我们通过自适应进化找到的解决方案还要大。
It’s a typical situation that one sees in many other kinds of systems (like for example sorting networks): it’s possible to have a “constructed solution” that has clear structure and regularity and is “understandable”. But minimal solutions—or ones found by adaptive evolution—tend to be much smaller. But they almost always look in many ways random, and aren’t readily understandable or interpretable.
这是一种在许多其他系统中常见的典型情况(例如排序网络):可能存在一种“构造的解决方案”,具有明确的结构和规律性,并且是“可理解的”。但是,最小解决方案——或通过自适应进化找到的解决方案——往往要小得多。但它们在许多方面几乎总是看起来是随机的,并且不容易理解或解释。
So far, we’ve been looking at rule arrays that compute specific functions. But in getting a sense of what rule arrays can do, we can consider rule arrays that are “programmable”, in that their input specifies what function they should compute. So here, for example, is an And+Xor rule array—found by adaptive evolution—that takes the “bit pattern” of any (even) Boolean function as input on the left, then applies that Boolean function to the inputs on the right:
到目前为止,我们一直在研究计算特定函数的规则数组。但为了了解规则数组可以做什么,我们可以考虑“可编程”的规则数组,因为它们的输入指定了它们应该计算的函数。因此,这里有一个由自适应进化发现的 And + Xor 规则数组——它将任何(偶数)布尔函数的“位模式”作为左侧的输入,然后将该布尔函数应用于右侧的输入:
And with this same rule array we can now compute any possible (even) Boolean function. So here, for example, it’s evaluating Or:
通过这个相同的规则数组,我们现在可以计算任何可能的(偶数)布尔函数。因此,在这里,例如,它正在评估 Or :
Other Kinds of Models and Setups
其他类型的模型和设置
Our general goal here has been to set up models that capture the most essential features of neural nets and machine learning—but that are simple enough in their structure that we can readily “look inside” and get a sense of what they are doing. Mostly we’ve concentrated on rule arrays as a way to provide a minimal analog of standard “perceptron-style” feed-forward neural nets. But what about other architectures and setups?
我们在这里的一般目标是建立能够捕捉神经网络和机器学习最基本特征的模型——但其结构足够简单,以便我们可以轻松“查看内部”,了解它们的工作原理。我们主要集中在规则数组上,以提供标准“感知器风格”前馈神经网络的最小类比。那么其他架构和设置呢?
In effect, our rule arrays are “spacetime-inhomogeneous” generalizations of cellular automata—in which adaptive evolution determines which rule (say from a finite set) should be used at every (spatial) position and every (time) step. A different idealization (that in fact we already used in one section above) is to have an ordinary homogeneous cellular automaton—but with a single “global rule” determined by adaptive evolution. Rule arrays are the analog of feed-forward networks in which a given rule in the rule array is in effect used only once as data “flows through” the system. Ordinary homogeneous cellular automata are like recurrent networks in which a single stream of data is in effect subjected over and over again to the same rule.
实际上,我们的规则阵列是“时空非均匀”的细胞自动机推广——在这种情况下,自适应演化决定在每个(空间)位置和每个(时间)步骤应使用哪个规则(例如,从有限集合中选择)。另一种理想化(实际上我们在上面的一个部分中已经使用过)是拥有一个普通的均匀细胞自动机——但由自适应演化确定一个单一的“全局规则”。规则阵列是前馈网络的类比,其中规则阵列中的给定规则实际上只在数据“流经”系统时使用一次。普通的均匀细胞自动机就像递归网络,其中单一的数据流实际上反复受到相同规则的影响。
There are various interpolations between these cases. For example, we can imagine a “layered rule array” in which the rules at different steps can be different, but those on a given step are all the same. Such a system can be viewed as an idealization of a convolutional neural net in which a given layer applies the same kernel to elements at all positions, but different layers can apply different kernels.
在这些案例之间存在各种插值。例如,我们可以想象一个“分层规则数组”,其中不同步骤的规则可以不同,但在给定步骤上的规则都是相同的。这样的系统可以被视为卷积神经网络的理想化,其中给定层对所有位置的元素应用相同的核,但不同层可以应用不同的核。
A layered rule array can’t encode as much information as a general rule array. But it’s still able to show machine-learning-style phenomena. And here, for example, is adaptive evolution for a layered And+Xor rule array progressively solving the problem of generating a pattern that lives for exactly 30 steps:
分层规则数组无法编码与一般规则数组一样多的信息。但它仍然能够展示机器学习风格的现象。在这里,例如,分层 And + Xor 规则数组的自适应演化逐步解决生成一个恰好存活 30 步的模式的问题:
One could also imagine “vertically layered” rule arrays, in which different rules are used at different positions, but any given position keeps running the same rule forever. However, at least for the kinds of problems we’ve considered here, it doesn’t seem sufficient to just be able to pick the positions at which different rules are run. One seems to either need to change rules at different (time) steps, or one needs to be able to adaptively evolve the underlying rules themselves.
人们也可以想象“垂直分层”的规则数组,其中在不同的位置使用不同的规则,但任何给定的位置永远保持运行相同的规则。然而,至少对于我们在这里考虑的那些问题,仅仅能够选择不同规则运行的位置似乎是不够的。似乎要么需要在不同的(时间)步骤中更改规则,要么需要能够自适应地演化基础规则本身。
Rule arrays and ordinary cellular automata share the feature that the value of each cell depends only on the values of neighboring cells on the step before. But in neural nets it’s standard for the value at a given node to depend on the values of lots of nodes on the layer before. And what makes this straightforward in neural nets is that (weighted, and perhaps otherwise transformed) values from previous nodes are taken to be combined just by simple numerical addition—and addition (being n-ary and associative) can take any number of “inputs”. In a cellular automaton (or Boolean function), however, there’s always a definite number of inputs, determined by the structure of the function. In the most straightforward case, the inputs come only from nearest-neighboring cells. But there’s no requirement that this is how things need to work—and for example we can pick any “local template” to bring in the inputs for our function. This template could either be the same at every position and every step, or it could be picked from a certain set differently at different positions—in effect giving us “template arrays” as well as rule arrays.
规则数组和普通的元胞自动机具有相同的特征,即每个单元的值仅依赖于前一步相邻单元的值。但在神经网络中,给定节点的值通常依赖于前一层中许多节点的值。而在神经网络中,这种情况之所以简单,是因为来自前一节点的(加权的,可能还有其他变换的)值通过简单的数值加法进行组合——而加法(是 n 元和结合的)可以接受任意数量的“输入”。然而,在元胞自动机(或布尔函数)中,输入的数量总是确定的,由函数的结构决定。在最简单的情况下,输入仅来自最近邻单元。但并没有要求事情必须这样运作——例如,我们可以选择任何“局部模板”来引入我们函数的输入。这个模板可以在每个位置和每个步骤都是相同的,或者可以在不同位置从某个集合中不同地选择——实际上给我们提供了“模板数组”以及规则数组。
So what about having a fully connected network, as we did in our very first neural net examples above? To set up a discrete analog of this we first need some kind of discrete n-ary associative “accumulator” function to fill the place of numerical addition. And for this we could pick a function like And, Or, Xor—or Majority. And if we’re not just going to end up with the same value at each node on a given layer, we need to set up some analog of a weight associated with each connection—which we can achieve by applying either Identity or Not (i.e. flip or not) to the value flowing through each connection.
那么,关于我们在上面第一个神经网络示例中所做的完全连接网络,情况如何呢?为了建立这个离散的类比,我们首先需要某种离散的 n 叉关联“累加器”函数来替代数值加法。为此,我们可以选择像 And 、 Or 、 Xor 或 Majority 这样的函数。如果我们不想在给定层的每个节点上最终得到相同的值,我们需要设置与每个连接相关的某种权重类比——我们可以通过对每个连接中流动的值应用 Identity 或 Not (即翻转或不翻转)来实现。
Here’s an example of a network of this type, trained to compute the function we discussed above:
这是一个此类型网络的示例,经过训练以计算我们上面讨论的 函数:
There are just two kinds of connections here: flip and not. And at each node we’re computing the majority function—giving value 1 if the majority of its inputs are 1, and 0 otherwise. With the “one-hot encoding” of input and output that we used before, here are a few examples of how this network evaluates our function:
这里只有两种连接:翻转和不翻转。在每个节点,我们计算多数函数——如果大多数输入为 1,则输出 1,否则输出 0。使用我们之前的“独热编码”输入和输出,以下是该网络如何评估我们的函数的一些示例:
This was trained just using 1000 steps of single-point mutation applied to the connection types. The loss systematically goes down—but the configuration of the connection types continues to look quite random even as it achieves zero loss (i.e. even after the function has been completely learned):
这仅使用 1000 步单点突变训练应用于连接类型。损失系统性地下降——但连接类型的配置即使在达到零损失时(即即使在函数完全学习后)仍然看起来相当随机:
In what we’ve just done we assume that all connections continue to be present, though their types (or effectively signs) can change. But we can also consider a network where connections can end up being zeroed out during training—so that they are effectively no longer present.
在我们刚刚做的事情中,我们假设所有连接仍然存在,尽管它们的类型(或有效的符号)可能会改变。但我们也可以考虑一个网络,在训练过程中连接可能会被归零——这样它们实际上就不再存在。
Much of what we’ve done here with machine learning has centered around trying to learn transformations of the form x f[x]. But another typical application of machine learning is autoencoding—or in effect learning how to compress data representing a certain set of examples. And once again it’s possible to do such a task using rule arrays, with learning achieved by a series of single-point mutations.
我们在这里所做的许多机器学习工作都集中在尝试学习形式为 x f[x] 的变换。但机器学习的另一个典型应用是自编码——实际上是学习如何压缩表示某一组示例的数据。同样,可以使用规则数组来完成这样的任务,通过一系列单点突变来实现学习。
As a starting point, consider training a rule array (of cellular automaton rules 4 and 146) to reproduce unchanged a block of black cells of any width. One might have thought this would be trivial. But it’s not, because in effect the initial data inevitably gets “ground up” inside the rule array, and has to be reconstituted at the end. But, yes, it’s nevertheless possible to train a rule array to at least roughly do this—even though once again the rule arrays we find that manage to do this look quite random:
作为起点,考虑训练一个规则数组(细胞自动机规则 4 和 146)以不变地再现任意宽度的黑色单元块。人们可能会认为这很简单。但事实并非如此,因为初始数据不可避免地在规则数组内部“被磨碎”,并且必须在最后重新构建。但是,是的,尽管如此,训练一个规则数组至少大致做到这一点仍然是可能的——尽管我们发现能够做到这一点的规则数组看起来相当随机:
But to set up a nontrivial autoencoder let’s imagine that we progressively “squeeze” the array in the middle, creating an increasingly narrow “bottleneck” through which the data has to flow. At the bottleneck we effectively have a compressed version of the original data. And we find that at least down to some width of bottleneck, it’s possible to create rule arrays that—with reasonable probability—can act as successful autoencoders of the original data:
但是为了建立一个非平凡的自编码器,让我们想象我们逐渐“压缩”中间的数组,形成一个越来越窄的“瓶颈”,数据必须通过这个瓶颈流动。在瓶颈处,我们实际上得到了原始数据的压缩版本。我们发现,至少在某个瓶颈宽度以下,可以创建规则数组,这些数组在合理的概率下可以作为原始数据的成功自编码器:
The success of LLMs has highlighted the use of machine learning for sequence continuation—and the effectiveness of transformers for this. But just as with other neural nets, the forms of transformers that are used in practice are typically very complicated. But can one find a minimal model that nevertheless captures the “essence of transformers”?
LLMs的成功突显了机器学习在序列延续中的应用——以及变换器在这方面的有效性。但与其他神经网络一样,实际使用的变换器形式通常非常复杂。但是否可以找到一个最小模型,仍然能够捕捉到“变换器的本质”?
Let’s say that we have a sequence that we want to continue, like:
假设我们有一个想要继续的序列,比如:
We want to encode each possible value by a vector, as in
我们想通过向量对每个可能的值进行编码,如下所示
so that, for example, our original sequence is encoded as:
因此,例如,我们的原始序列被编码为:
Then we have a “head” that reads a block of consecutive vectors, picking off certain values and feeding pairs of them into And and Xor functions, to get a vector of Boolean values:
然后我们有一个“头”,它读取一块连续的向量,挑选出某些值并将它们的对输入到 And 和 Xor 函数中,以获得一个布尔值向量:
Ultimately this head is going to “slide” along our sequence, “predicting” what the next element in the sequence will be. But somehow we have to go from our vector of Boolean values to (probabilities of) sequence elements. Potentially we might be able to do this just with a rule array. But for our purposes here we’ll use a fully connected single-layer Identity+Not network in which at each output node we just find the sum of the number of values that come to it—and treat this as determining (through a softmax) the probability of the corresponding element:
最终,这个头将沿着我们的序列“滑动”, “预测”序列中的下一个元素。但我们必须以某种方式将布尔值向量转换为(序列元素的)概率。我们可能仅通过规则数组来实现这一点。但在这里,我们将使用一个完全连接的单层 Identity + Not 网络,在每个输出节点,我们只需找到到达该节点的值的总和——并将其视为通过 softmax 确定相应元素的概率:
In this case, the element with the maximum value is 5, so at “zero temperature” this would be our “best prediction” for the next element.
在这种情况下,最大值的元素是 5,因此在“零温度”下,这将是我们对下一个元素的“最佳预测”。
To train this whole system we just make a sequence of random point mutations to everything, keeping mutations that don’t increase the loss (where the loss is basically the difference between predicted next values and actual next values, or, more precisely, the “categorical cross-entropy”). Here’s how this loss progresses in a typical such training:
为了训练整个系统,我们只需对所有内容进行一系列随机点突变,保留那些不会增加损失的突变(损失基本上是预测下一个值与实际下一个值之间的差异,或者更准确地说,是“分类交叉熵”)。以下是这种典型训练中损失的进展情况:
At the end of this training, here are the components of our minimal transformer:
在本次培训结束时,我们的最小变压器的组成部分如下:
First come the encodings of the different possible elements in the sequence. Then there’s the head, here shown applied to the encoding of the first elements of the original sequence. Finally there’s a single-layer discrete network that takes the output from the head, and deduces relative probabilities for different elements to come next. In this case the highest-probability prediction for the next element is that it should be element 6.
首先是序列中不同可能元素的编码。然后是头部,这里显示的是应用于原始序列第一个元素编码的头部。最后是一个单层离散网络,它接收头部的输出,并推导出不同元素接下来出现的相对概率。在这种情况下,下一个元素的最高概率预测是它应该是元素 6。
To do the analog of an LLM we start from some initial “prompt”, i.e. an initial sequence that fits within the width (“context window”) of the head. Then we progressively apply our minimal transformer, for example at each step taking the next element to be the one with the highest predicted probability (i.e. operating “at zero temperature”). With this setup the collection of “prediction strengths” is shown in gray, with the “best prediction” shown in red:
为了进行一个LLM的类比,我们从一些初始“提示”开始,即一个适合头部宽度(“上下文窗口”)的初始序列。然后我们逐步应用我们的最小变换器,例如在每一步中选择下一个元素为预测概率最高的元素(即“在零温度下”操作)。在这种设置下,“预测强度”的集合以灰色显示,“最佳预测”以红色显示:
Running this even far beyond our original training data, we see that we get a “prediction” of a continued sine wave:
运行这一过程,即使远超我们最初的训练数据,我们看到我们得到了一个持续正弦波的“预测”:
As we might expect, the fact that our minimal transformer can make such a plausible prediction relies on the simplicity of our sine curve. If we use “more complicated” training data, such as the “mathematically defined” () blue curve in
正如我们所预期的,我们的最小变压器能够做出如此合理的预测,依赖于我们正弦曲线的简单性。如果我们使用“更复杂”的训练数据,例如“数学定义的”( )蓝色曲线,
the result of training and running a minimal transformer is now:
训练和运行一个最小变换器的结果现在是:
And, not surprisingly, it can’t “figure out the computation” to correctly continue the curve. By the way, different training runs will involve different sequences of mutations, and will yield different predictions (often with periodic “hallucinations”):
而且,毫不奇怪,它无法“计算出”正确地继续曲线。顺便提一下,不同的训练运行将涉及不同的突变序列,并会产生不同的预测(通常伴随着周期性的“幻觉”):
In looking at “perceptron-style” neural nets we wound up using rule arrays—or, in effect, spacetime-inhomogeneous cellular automata—as our minimal models. Here we’ve ended up with a slightly more complicated minimal model for transformer neural nets. But if we were to simplify it further, we would end up not with something like a cellular automaton but instead with something like a tag system, in which one has a sequence of elements, and at each step removes a block from the beginning, and—depending on its form—adds a certain block at the end, as in:
在研究“感知器风格”的神经网络时,我们最终使用了规则数组 — ,或者实际上是时空不均匀的元胞自动机 — 作为我们的最小模型。在这里,我们得到了一个稍微复杂一点的变换神经网络的最小模型。但如果我们进一步简化它,我们最终得到的将不是像元胞自动机那样的东西,而是像标签系统那样的东西,其中有一系列元素,在每一步中从开头移除一个块,并且 — 根据其形式 — 在末尾添加一个特定的块,如下所示:
And, yes, such systems can generate extremely complex behavior—reinforcing the idea (that we have repeatedly seen here) that machine learning works by selecting complexity that aligns with goals that have been set.
而且,是的,这种系统可以生成极其复杂的行为 — ,强化了我们在这里反复看到的观点,即机器学习通过选择与设定目标一致的复杂性来工作。
And along these lines, one can consider all sorts of different computational systems as foundations for machine learning. Here we’ve been looking at cellular-automaton-like and tag-system-like examples. But for example our Physics Project has shown us the power and flexibility of systems based on hypergraph rewriting. And from what we’ve seen here, it seems very plausible that something like hypergraph rewriting can serve as a yet more powerful and flexible substrate for machine learning.
沿着这些思路,我们可以将各种不同的计算系统视为机器学习的基础。在这里,我们一直在研究类似细胞自动机和标签系统的例子。但例如,我们的物理项目向我们展示了基于超图重写的系统的强大和灵活性。从我们在这里看到的情况来看,超图重写可以作为机器学习更强大和灵活的基础,这似乎是非常合理的。
So in the End, What’s Really Going On in Machine Learning?
最终,机器学习中究竟发生了什么?
There are, I think, several quite striking conclusions from what we’ve been able to do here. The first is just that models much simpler than traditional neural nets seem capable of capturing the essential features of machine learning—and indeed these models may well be the basis for a new generation of practical machine learning.
我认为,从我们在这里所做的工作中,有几个相当引人注目的结论。首先,远比传统神经网络简单的模型似乎能够捕捉机器学习的基本特征——实际上,这些模型很可能是新一代实用机器学习的基础。
But from a scientific point of view, one of the things that’s important about these models is that they are simple enough in structure that it’s immediately possible to produce visualizations of what they’re doing inside. And studying these visualizations, the most immediately striking feature is how complicated they look.
但从科学的角度来看,这些模型的重要性之一在于它们的结构足够简单,以至于可以立即生成它们内部运作的可视化图像。在研究这些可视化时,最引人注目的特征是它们看起来是多么复杂。
It could have been that machine learning would somehow “crack systems”, and find simple representations for what they do. But that doesn’t seem to be what’s going on at all. Instead what seems to be happening is that machine learning is in a sense just “hitching a ride” on the general richness of the computational universe. It’s not “specifically building up behavior one needs”; rather what it’s doing is to harness behavior that’s “already out there” in the computational universe.
机器学习可能会以某种方式“破解系统”,并找到它们所做事情的简单表示。但这似乎根本不是发生的事情。相反,似乎发生的情况是机器学习在某种意义上只是“搭便车”于计算宇宙的整体丰富性。它并不是“专门构建所需的行为”;而是它所做的是利用计算宇宙中“已经存在”的行为。
The fact that this could possibly work relies on the crucial—and at first unexpected—fact that in the computational universe even very simple programs can ubiquitously produce all sorts of complex behavior. And the point then is that this behavior has enough richness and diversity that it’s possible to find instances of it that align with machine learning objectives one’s defined. In some sense what machine learning is doing is to “mine” the computational universe for programs that do what one wants.
这个可能有效的事实依赖于一个关键的——起初意想不到的——事实,即在计算宇宙中,即使是非常简单的程序也能普遍产生各种复杂行为。关键在于,这种行为具有足够的丰富性和多样性,以至于可以找到与定义的机器学习目标相一致的实例。从某种意义上说,机器学习所做的就是“挖掘”计算宇宙中能够实现所需功能的程序。
It’s not that machine learning nails a specific precise program. Rather, it’s that in typical successful applications of machine learning there are lots of programs that “do more or less the right thing”. If what one’s trying to do involves something computationally irreducible, machine learning won’t typically be able to “get well enough aligned” to correctly “get through all the steps” of the irreducible computation. But it seems that many “human-like tasks” that are the particular focus of modern machine learning can successfully be done.
机器学习并不是针对某个特定精确程序的。相反,在典型的成功机器学习应用中,有很多程序“或多或少地做对了事情”。如果所尝试的事情涉及某些计算上不可简化的内容,机器学习通常无法“很好地对齐”以正确“完成所有步骤”的不可简化计算。但似乎许多现代机器学习特别关注的“类人任务”可以成功完成。
And by the way, one can expect that with the minimal models explored here, it becomes more feasible to get a real characterization of what kinds of objectives can successfully be achieved by machine learning, and what cannot. Critical to the operation of machine learning is not only that there exist programs that can do particular kinds of things, but also that they can realistically be found by adaptive evolution processes.
顺便提一下,可以预期通过这里探讨的最小模型,能够更可行地真实表征机器学习能够成功实现的目标类型以及无法实现的目标。机器学习的运作关键不仅在于存在能够完成特定任务的程序,还在于这些程序能够通过自适应进化过程被现实地发现。
In what we’ve done here we’ve often used what’s essentially the very simplest possible process for adaptive evolution: a sequence of point mutations. And what we’ve discovered is that even this is usually sufficient to lead us to satisfactory machine learning solutions. It could be that our paths of adaptive evolution would always be getting stuck—and not reaching any solution. But the fact that this doesn’t happen seems crucially connected to the computational irreducibility that’s ubiquitous in the systems we’re studying, and that leads to effective randomness that with overwhelming probability will “give us a way out” of anywhere we got stuck.
在我们所做的工作中,我们经常使用本质上是自适应进化中最简单的过程:一系列点突变。我们发现,即使是这样,通常也足以让我们找到令人满意的机器学习解决方案。我们的自适应进化路径可能总是会陷入困境——而无法找到任何解决方案。但这种情况不会发生的事实似乎与我们研究的系统中普遍存在的计算不可约性密切相关,这导致了有效的随机性,具有压倒性的概率将“为我们提供一条出路”,使我们摆脱困境。
In some sense computational irreducibility “levels the playing field” for different processes of adaptive evolution, and lets even simple ones be successful. Something similar seems to happen for the whole framework we’re using. Any of a wide class of systems seem capable of successful machine learning, even if they don’t have the detailed structure of traditional neural nets. We can see this as a typical reflection of the Principle of Computational Equivalence: that even though systems may differ in their details, they are ultimately all equivalent in the computations they can do.
在某种意义上,计算不可简化性“平衡了”不同适应性进化过程的竞争环境,使得即使是简单的过程也能取得成功。类似的情况似乎也发生在我们所使用的整个框架中。任何广泛类别的系统似乎都能够成功地进行机器学习,即使它们没有传统神经网络的详细结构。我们可以将其视为计算等价原理的典型反映:尽管系统在细节上可能有所不同,但它们在能够进行的计算上最终都是等价的。
The phenomenon of computational irreducibility leads to a fundamental tradeoff, of particular importance in thinking about things like AI. If we want to be able to know in advance—and broadly guarantee—what a system is going to do or be able to do, we have to set the system up to be computationally reducible. But if we want the system to be able to make the richest use of computation, it’ll inevitably be capable of computationally irreducible behavior. And it’s the same story with machine learning. If we want machine learning to be able to do the best it can, and perhaps give us the impression of “achieving magic”, then we have to allow it to show computational irreducibility. And if we want machine learning to be “understandable” it has to be computationally reducible, and not able to access the full power of computation.
计算不可约现象导致了一种基本的权衡,这在思考人工智能等事物时尤为重要。如果我们希望能够提前知道——并广泛保证——一个系统将要做什么或能够做什么,我们必须将系统设置为计算可约的。但如果我们希望系统能够充分利用计算,它不可避免地会表现出计算不可约的行为。机器学习也是同样的道理。如果我们希望机器学习能够做到最好,并可能给我们“实现魔法”的印象,那么我们必须允许它表现出计算不可约性。如果我们希望机器学习是“可理解的”,它就必须是计算可约的,而无法访问计算的全部能力。
At the outset, though, it’s not obvious whether machine learning actually has to access such power. It could be that there are computationally reducible ways to solve the kinds of problems we want to use machine learning to solve. But what we’ve discovered here is that even in solving very simple problems, the adaptive evolution process that’s at the heart of machine learning will end up sampling—and using—what we can expect to be computationally irreducible processes.
起初,机器学习是否真的需要访问如此强大的计算能力并不明显。可能存在计算上可简化的方法来解决我们希望使用机器学习解决的问题。但我们在这里发现,即使在解决非常简单的问题时,机器学习核心的自适应进化过程最终也会对我们可以预期为计算上不可简化的过程进行采样和使用。
Like biological evolution, machine learning is fundamentally about finding things that work—without the constraint of “understandability” that’s forced on us when we as humans explicitly engineer things step by step. Could one imagine constraining machine learning to make things understandable? To do so would effectively prevent machine learning from having access to the power of computationally irreducible processes, and from the evidence here it seems unlikely that with this constraint the kind of successes we’ve seen in machine learning would be possible.
像生物进化一样,机器学习从根本上来说是关于寻找有效的方法——而不受我们作为人类在逐步明确设计事物时所强加的“可理解性”限制。能否想象将机器学习限制为使事物可理解?这样做将有效地阻止机器学习接触到计算上不可简约过程的力量,从这里的证据来看,似乎在这种限制下,我们在机器学习中所看到的成功是不可能的。
So what does this mean for the “science of machine learning”? One might have hoped that one would be able to “look inside” machine learning systems and get detailed narrative explanations for what’s going on; that in effect one would be able to “explain the mechanism” for everything. But what we’ve seen here suggests that in general nothing like this will work. All one will be able to say is that somewhere out there in the computational universe there’s some (typically computationally irreducible) process that “happens” to be aligned with what we want.
那么这对“机器学习的科学”意味着什么呢?人们可能希望能够“深入了解”机器学习系统,并获得关于发生了什么的详细叙述解释;实际上,人们希望能够“解释一切机制”。但我们在这里看到的表明,通常这样的事情是行不通的。我们所能说的只是,在计算宇宙的某个地方,有某个(通常是计算上不可简化的)过程“恰好”与我们想要的对齐。
Yes, we can make general statements—strongly based on computational irreducibility—about things like the findability of such processes, say by adaptive evolution. But if we ask “How in detail does the system work?”, there won’t be much of an answer to that. Of course we can trace all its computational steps and see that it behaves in a certain way. But we can’t expect what amounts to a “global human-level explanation” of what it’s doing. Rather, we’ll basically just be reduced to looking at some computationally irreducible process and observing that it “happens to work”—and we won’t have a high-level explanation of “why”.
是的,我们可以基于计算不可约性对诸如通过适应性进化找到此类过程之类的事情做出一般性陈述。但如果我们问“系统是如何详细工作的?”,那么对此就不会有太多答案。当然,我们可以追踪其所有计算步骤,看到它以某种方式运行。但我们不能期待得到一个“全球人类水平的解释”来说明它在做什么。相反,我们基本上只能看一些计算不可约的过程,并观察到它“恰好有效”——而我们不会有一个关于“为什么”的高层次解释。
But there is one important loophole to all this. Within any computationally irreducible system, there are always inevitably pockets of computational reducibility. And—as I’ve discussed at length particularly in connection with our Physics Project—it’s these pockets of computational reducibility that allow computationally bounded observers like us to identify things like “laws of nature” from which we can build “human-level narratives”.
但这一切中有一个重要的漏洞。在任何计算不可约的系统中,总是不可避免地存在计算可约的区域。而正如我在与我们的物理项目相关的讨论中详细阐述的那样,正是这些计算可约的区域使得像我们这样的计算受限观察者能够识别出“自然法则”等事物,从而构建“人类级叙事”。
So what about machine learning? What pockets of computational reducibility show up there, from which we might build “human-level scientific laws”? Much as with the emergence of “simple continuum behavior” from computationally irreducible processes happening at the level of molecules in a gas or ultimate discrete elements of space, we can expect that at least certain computationally reducible features will be more obvious when one’s dealing with larger numbers of components. And indeed in sufficiently large machine learning systems, it’s routine to see smooth curves and apparent regularity when one’s looking at the kind of aggregated behavior that’s probed by things like training curves.
那么机器学习呢?在这里出现了哪些计算可约性的领域,我们可以从中构建“人类级别的科学法则”?就像从在气体中发生的计算不可约过程中的“简单连续行为”一样,我们可以预期,当处理更大数量的组件时,至少某些计算可约特征会更加明显。实际上,在足够大的机器学习系统中,当观察像训练曲线这样的聚合行为时,看到平滑曲线和明显的规律性是常见的。
But the question about pockets of reducibility is always whether they end up being aligned with things we consider interesting or useful. Yes, it could be that machine learning systems would exhibit some kind of collective (“EEG-like”) behavior. But what’s not clear is whether this behavior will tell us anything about the actual “information processing” (or whatever) that’s going on in the system. And if there is to be a “science of machine learning” what we have to hope for is that we can find in machine learning systems pockets of computational reducibility that are aligned with things we can measure, and care about.
但关于可约性小块的问题始终是它们是否与我们认为有趣或有用的事物对齐。是的,机器学习系统可能会表现出某种集体(“EEG 样”)行为。但不清楚的是,这种行为是否会告诉我们系统中实际发生的“信息处理”(或其他)任何信息。如果要有“机器学习科学”,我们希望的是能够在机器学习系统中找到与我们可以测量并关心的事物对齐的计算可约性小块。
So given what we’ve been able to explore here about the foundations of machine learning, what can we say about the ultimate power of machine learning systems? A key observation has been that machine learning works by “piggybacking” on computational irreducibility—and in effect by finding “natural pieces of computational irreducibility” that happen to fit with the objectives one has. But what if those objectives involve computational irreducibility—as they often do when one’s dealing with a process that’s been successfully formalized in computational terms (as in math, exact science, computational X, etc.)? Well, it’s not enough that our machine learning system “uses some piece of computational irreducibility inside”. To achieve a particular computationally irreducible objective, the system would have to do something closely aligned with that actual, specific objective.
考虑到我们在这里能够探索的机器学习基础,我们可以对机器学习系统的最终能力说些什么呢?一个关键的观察是,机器学习通过“搭便车”计算不可约性来工作——实际上是通过找到“自然的计算不可约性片段”,这些片段恰好与我们所拥有的目标相契合。但是,如果这些目标涉及计算不可约性——当我们处理一个已经成功形式化为计算术语的过程时(如数学、精确科学、计算 X 等),这种情况经常发生,那么情况就不一样了。我们的机器学习系统“在内部使用某些计算不可约性片段”是不够的。为了实现一个特定的计算不可约目标,系统必须做一些与该实际、特定目标紧密对齐的事情。
It has to be said, however, that by laying bare more of the essence of machine learning here, it becomes easier to at least define the issues of merging typical “formal computation” with machine learning. Traditionally there’s been a tradeoff between the computational power of a system and its trainability. And indeed in terms of what we’ve seen here this seems to reflect the sense that “larger chunks of computational irreducibility” are more difficult to fit into something one’s incrementally building up by a process of adaptive evolution.
必须指出的是,通过在这里揭示更多机器学习的本质,至少可以更容易地定义将典型的“形式计算”与机器学习相结合的问题。传统上,系统的计算能力与其可训练性之间存在权衡。实际上,从我们在这里看到的情况来看,这似乎反映了“更大块的计算不可约性”更难融入通过适应性进化过程逐步构建的东西。
So how should we ultimately think of machine learning? In effect its power comes from leveraging the “natural resource” of computational irreducibility. But when it uses computational irreducibility it does so by “foraging” pieces that happen to advance its objectives. Imagine one’s building a wall. One possibility is to fashion bricks of a particular shape that one knows will fit together. But another is just to look at stones one sees lying around, then to build the wall by fitting these together as best one can.
那么我们最终应该如何看待机器学习呢?实际上,它的力量来自于利用“计算不可约性”的“自然资源”。但是,当它使用计算不可约性时,它是通过“觅食”那些恰好能推动其目标的部分来实现的。想象一下,一个人在建造一面墙。一种可能性是制作特定形状的砖块,这些砖块是已知能够组合在一起的。但另一种可能性是仅仅查看周围的石头,然后尽可能地将这些石头组合在一起以建造墙壁。
And if one then asks “Why does the wall have such-and-such a pattern?” the answer will end up being basically “Because that’s what one gets from the stones that happened to be lying around”. There’s no overarching theory to it in itself; it’s just a reflection of the resources that were out there. Or, in the case of machine learning, one can expect that what one sees will be to a large extent a reflection of the raw characteristics of computational irreducibility. In other words, the foundations of machine learning are as much as anything rooted in the science of ruliology. And it’s in large measure to that science we should look in our efforts to understand more about “what’s really going on” in machine learning, and quite possibly also in neuroscience.
如果有人问“为什么墙上有这样的图案?”答案基本上会是“因为这就是从周围的石头中得到的”。这本身没有一个总体理论;它只是反映了当时可用的资源。或者,在机器学习的情况下,人们可以预期所看到的在很大程度上是计算不可简约性的原始特征的反映。换句话说,机器学习的基础在很大程度上根植于规则学的科学。我们在努力理解机器学习中“真正发生了什么”的过程中,应该在很大程度上关注这一科学,可能也包括神经科学。
Historical & Personal Notes
历史与个人笔记
In some ways it seems like a quirk of intellectual history that the kinds of foundational questions I’ve been discussing here weren’t already addressed long ago—and in some ways it seems like an inexorable consequence of the only rather recent development of certain intuitions and tools.
在某种程度上,这似乎是知识历史的一个怪癖,我在这里讨论的那些基础性问题早就没有被解决——而在某种程度上,这似乎是某些直觉和工具的相对较新发展所带来的不可避免的结果。
The idea that the brain is fundamentally made of connected nerve cells was considered in the latter part of the nineteenth century, and took hold in the first decades of the twentieth century—with the formalized concept of a neural net that operates in a computational way emerging in full form in the work of Warren McCulloch and Walter Pitts in 1943. By the late 1950s there were hardware implementations of neural nets (typically for image processing) in the form of “perceptrons”. But despite early enthusiasm, practical results were mixed, and at the end of the 1960s it was announced that simple cases amenable to mathematical analysis had been “solved”—leading to a general belief that “neural nets couldn’t do anything interesting”.
大脑基本上由连接的神经细胞构成的观点在十九世纪后期被提出,并在二十世纪的头几十年中逐渐被接受——1943 年,沃伦·麦卡洛克和沃尔特·皮茨的工作中正式形成了以计算方式运作的神经网络的概念。到 1950 年代末,已经出现了神经网络的硬件实现(通常用于图像处理),形式为“感知器”。但尽管早期热情高涨,实际结果却喜忧参半,到了 1960 年代末,宣布简单的数学分析可处理的案例已经“解决”——这导致人们普遍认为“神经网络无法做任何有趣的事情”。
Ever since the 1940s there had been a trickle of general analyses of neural nets, particularly using methods from physics. But typically these analyses ended up with things like continuum approximations—that could say little about the information-processing aspects of neural nets. Meanwhile, there was an ongoing undercurrent of belief that somehow neural networks would both explain and reproduce how the brain works—but no methods seemed to exist to say quite how. Then at the beginning of the 1980s there was a resurgence of interest in neural networks, coming from several directions. Some of what was done concentrated on very practical efforts to get neural nets to do particular “human-like” tasks. But some was more theoretical, typically using methods from statistical physics or dynamical systems.
自 1940 年代以来,关于神经网络的总体分析逐渐增多,特别是使用物理学的方法。但通常这些分析最终得出的结果是连续近似——这对神经网络的信息处理方面几乎没有帮助。与此同时,人们一直相信神经网络能够解释并再现大脑的工作方式——但似乎没有方法能够明确说明这一点。然后在 1980 年代初,神经网络的兴趣重新兴起,来自多个方向。一些研究集中在非常实际的努力上,以使神经网络完成特定的“类人”任务。但有些则更具理论性,通常使用统计物理或动力系统的方法。
Before long, however, the buzz died down, and for several decades only a few groups were left working with neural nets. Then in 2011 came a surprise breakthrough in using neural nets for image analysis. It was an important practical advance. But it was driven by technological ideas and development—not any significant new theoretical analysis or framework.
然而,不久之后,热潮平息,接下来的几十年里,只有少数几个团体继续研究神经网络。然后在 2011 年,神经网络在图像分析中的应用出现了意外的突破。这是一个重要的实际进展。但这一进展是由技术理念和发展推动的,而不是任何显著的新理论分析或框架。
And this was also the pattern for almost all of what followed. People spent great effort to come up with neural net systems that worked—and all sorts of folklore grew up about how this should best be done. But there wasn’t really even an attempt at an underlying theory; this was a domain of engineering practice, not basic science.
这也是随后的几乎所有内容的模式。人们花费了大量精力来开发有效的神经网络系统——并且围绕如何最好地做到这一点产生了各种民间传说。但实际上并没有对基础理论进行尝试;这是一个工程实践的领域,而不是基础科学。
And it was in this tradition that ChatGPT burst onto the scene in late 2022. Almost everything about LLMs seemed to be complicated. Yes, there were empirically some large-scale regularities (like scaling laws). And I quickly suspected that the success of LLMs was a strong hint of general regularities in human language that hadn’t been clearly identified before. But beyond a few outlier examples, almost nothing about “what’s going on inside LLMs” has seemed easy to decode. And efforts to put “strong guardrails” on the operation of the system—in effect so as to make it in some way “predictable” or “understandable”—typically seem to substantially decrease its power (a point that now makes sense in the context of computational irreducibility).
在这种传统中,ChatGPT 于 2022 年底横空出世。几乎关于LLMs的一切似乎都很复杂。是的,经验上确实存在一些大规模的规律(比如缩放法则)。我很快怀疑LLMs的成功是人类语言中尚未明确识别的一般规律的强烈暗示。但除了少数几个异常例子,几乎没有关于“LLMs内部发生了什么”的内容看起来容易解码。对系统操作施加“强保护措施”的努力——实际上是为了使其在某种程度上“可预测”或“可理解”——通常似乎会显著降低其能力(这一点在计算不可约性的背景下现在是有意义的)。
My own interaction with machine learning and neural nets began in 1980 when I was developing my SMP symbolic computation system, and wondering whether it might be possible to generalize the symbolic pattern-matching foundations of the system to some kind of “fuzzy pattern matching” that would be closer to human thinking. I was aware of neural nets but thought of them as semi-realistic models of brains, not for example as potential sources of algorithms of the kind I imagined might “solve” fuzzy matching.
我与机器学习和神经网络的互动始于 1980 年,当时我正在开发我的 SMP 符号计算系统,并思考是否有可能将该系统的符号模式匹配基础推广到某种更接近人类思维的“模糊模式匹配”。我知道神经网络,但将其视为半现实的脑模型,而不是我想象中可能“解决”模糊匹配的算法来源。
And it was partly as a result of trying to understand the essence of systems like neural nets that in 1981 I came up with what I later learned could be thought of as one-dimensional cellular automata. Soon I was deeply involved in studying cellular automata and developing a new intuition about how complex behavior could arise even from simple rules. But when I learned about recent efforts to make idealized models of neural nets using ideas from statistical mechanics, I was at least curious enough to set up simulations to try to understand more about these models.
这部分是因为试图理解神经网络等系统的本质,1981 年我提出了后来我了解到可以被视为一维元胞自动机的概念。很快我就深入研究元胞自动机,并发展出一种新的直觉,关于复杂行为如何能够从简单规则中产生。但当我了解到最近利用统计力学的思想来构建理想化神经网络模型的努力时,我至少有足够的好奇心去设置模拟,以尝试更多地了解这些模型。
But what I did wasn’t a success. I could neither get the models to do anything of significant practical interest—nor did I manage to derive any good theoretical understanding of them. I kept wondering, though, what relationship there might be between cellular automata that “just run”, and systems like neural nets that can also “learn”. And in fact in 1985 I tried to make a minimal cellular-automaton-based model to explore this. It was what I’m now calling a “vertically layered rule array”. And while in many ways I was already asking the right questions, this was an unfortunate specific choice of system—and my experiments on it didn’t reveal the kinds of phenomena we’re now seeing.
但我所做的并没有成功。我既无法让模型做出任何具有重要实际意义的事情,也没有设法对它们获得任何良好的理论理解。不过,我一直在想,"仅仅运行"的细胞自动机与可以"学习"的神经网络之间可能有什么关系。事实上,在 1985 年,我尝试制作一个基于细胞自动机的最小模型来探索这个问题。这就是我现在称之为“垂直分层规则阵列”的东西。尽管在许多方面我已经在问正确的问题,但这是一个不幸的特定系统选择——而我在其上的实验并没有揭示我们现在所看到的那种现象。
Years went by. I wrote a section on “Human Thinking” in A New Kind of Science, that discussed the possibility of simple foundational rules for the essence of thinking, and even included a minimal discrete analog of a neural net. At the time, though, I didn’t develop these ideas. By 2017, though, 15 years after the book was published—and knowing about the breakthroughs in deep learning—I had begun to think more concretely about neural nets as getting their power by sampling programs from across the computational universe. But still I didn’t see quite how this would work.
岁月流逝。我在《新科学的种类》中写了一段关于“人类思维”的内容,讨论了思维本质的简单基础规则的可能性,甚至包括了一个最小的离散神经网络类比。然而,当时我并没有深入发展这些想法。到 2017 年,也就是书籍出版 15 年后——并且了解到深度学习的突破——我开始更具体地思考神经网络如何通过从计算宇宙中抽样程序来获得其力量。但我仍然没有完全明白这将如何运作。
Meanwhile, there was a new intuition emerging from practical experience with machine learning: that if you “bashed” almost any system “hard enough”, it would learn. Did that mean that perhaps one didn’t need all the details of neural networks to successfully do machine learning? And could one perhaps make a system whose structure was simple enough that its operation would for example be accessible to visualization? I particularly wondered about this when I was writing an exposition of ChatGPT and LLMs in early 2023. And I kept talking about “LLM science”, but didn’t have much of a chance to work on it.
与此同时,来自机器学习实践经验的新直觉出现了:如果你“猛烈地”攻击几乎任何系统,它就会学习。这是否意味着或许不需要神经网络的所有细节就能成功进行机器学习?是否可以构建一个结构简单到其操作例如可以进行可视化的系统?在 2023 年初我撰写关于 ChatGPT 和LLMs的阐述时,我特别对此感到好奇。我一直在谈论“LLM科学”,但没有太多机会去研究它。
But then, a few months ago, as part of an effort to understand the relation between what science does and what AI does, I tried a kind of “throwaway experiment”—which, to my considerable surprise, seemed to successfully capture some of the essence of what makes biological evolution possible. But what about other adaptive evolution—and in particular, machine learning? The models that seemed to be needed were embarrassingly close to what I’d studied in 1985. But now I had a new intuition—and, thanks to Wolfram Language, vastly better tools. And the result has been my effort here.
但几个月前,作为了解科学与人工智能之间关系的一部分,我尝试了一种“临时实验”——令我颇感惊讶的是,这似乎成功捕捉到了使生物进化成为可能的一些本质。但其他适应性进化呢——特别是机器学习?似乎需要的模型与我在 1985 年研究的内容惊人地相似。但现在我有了新的直觉——并且多亏了 Wolfram Language,工具大大改善。结果就是我在这里的努力。
Of course this is only a beginning. But I’m excited to be able to see what I consider to be the beginnings of foundational science around machine learning. Already there are clear directions for practical applications (which, needless to say, I plan to explore). And there are signs that perhaps we may finally be able to understand just why—and when—the “magic” of machine learning works.
当然,这只是一个开始。但我很高兴能够看到我认为是围绕机器学习的基础科学的开端。已经有了明确的实用应用方向(不必说,我计划去探索)。而且有迹象表明,也许我们最终能够理解机器学习的“魔力”究竟为何而生——以及何时发挥作用。
Thanks 谢谢
Thanks to Richard Assar of the Wolfram Institute for extensive help. Thanks also to Brad Klee, Tianyi Gu, Nik Murzin and Max Niederman for specific results, to George Morgan and others at Symbolica for their early interest, and to Kovas Boguta for suggesting many years ago to link machine learning to the ideas in A New Kind of Science.
感谢沃尔夫拉姆研究所的理查德·阿萨尔提供的广泛帮助。还要感谢布拉德·克利、田怡谷、尼克·穆尔津和马克斯·尼德曼提供的具体结果,感谢乔治·摩根和符号学的其他人对早期工作的关注,以及科瓦斯·博古塔多年前建议将机器学习与《新科学的类型》中的思想联系起来。
Posted in: Artificial Intelligence, Computational Science, New Kind of Science, Ruliology
发布于:人工智能,计算科学,新科学,规则学