I. 从 GPT-4 到 AGI：计算 OOMs 的数量

AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from ~preschooler to ~smart high-schooler abilities in 4 years.
到 2027 年实现 AGI 是非常可能的。从 GPT-2 到 GPT-4 让我们在 4 年内从~幼儿园水平提升到~聪明的高中生水平。
Tracing trendlines in compute (~0.5 orders of magnitude or OOMs/year), algorithmic efficiencies (~0.5 OOMs/year), and “unhobbling” gains (from chatbot to agent), we should expect another preschooler-to-high-schooler-sized qualitative jump by 2027.
跟踪计算资源（每年约 0.5 个数量级或 OOMs）、算法效率（每年约 0.5 个 OOMs）以及“解除限制”的收益（从聊天机器人到智能体），我们预计到 2027 年将再次出现从幼儿园到高中生的质的飞跃。

In this piece: 在这篇文章中：

Look. The models, they just want to learn. You have to understand this. The models, they just want to learn.
看。这些模型，它们只是想要学习。你必须理解这一点。这些模型，它们只是想要学习。
Ilya Sutskever (circa 2015, via Dario Amodei)
伊利亚·苏茨克维 (大约 2015 年，通过达里奥·阿莫代伊)

GPT-4’s capabilities came as a shock to many: an AI system that could write code and essays, could reason through difficult math problems, and ace college exams. A few years ago, most thought these were impenetrable walls.
GPT-4 的能力让许多人感到震惊：一个能够编写代码和文章、能够推理解决困难数学问题并轻松通过大学考试的人工智能系统。几年前，大多数人认为这些是难以逾越的障碍。

But GPT-4 was merely the continuation of a decade of breakneck progress in deep learning.
但 GPT-4 仅仅是过去十年深度学习飞速发展的延续。
A decade earlier, models could barely identify simple images of cats and dogs; four years earlier, GPT-2 could barely string together semi-plausible sentences.
十年前，模型几乎无法识别简单的猫和狗的图片；四年前，GPT-2 几乎无法连成半可信的句子。
Now we are rapidly saturating all the benchmarks we can come up with.
现在我们正迅速饱和我们能想到的所有基准。
And yet this dramatic progress has merely been the result of consistent trends in scaling up deep learning.
然而，这种显著的进步仅仅是深度学习规模化的持续趋势的结果。

There have been people who have seen this for far longer. They were scoffed at, but all they did was trust the trendlines.
一直有人比现在更早看到这一点。他们曾被嘲笑，但他们所做的只是信任趋势线。
The trendlines are intense, and they were right. The models, they just want to learn; you scale them up, and they learn more.
趋势线很强烈，他们是对的。这些模型，它们只是想要学习；你放大它们，它们就学得更多。

I make the following claim: it is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer. That doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph.
我提出以下观点：到 2027 年，模型能够完成 AI 研究员/工程师的工作，这是非常可信的。这不需要相信科幻，只需要相信图表上的直线即可。

Rough estimates of past and future scaleup of effective compute (both physical compute and algorithmic efficiencies), based on the public estimates discussed in this piece.
对过去和未来有效计算（包括物理计算和算法效率）的粗略估计，基于本文讨论的公开估计。
As we scale models, they consistently get smarter, and by “counting the OOMs” we get a rough sense of what model intelligence we should expect in the (near) future.
当我们扩展模型时，它们会持续变得更为智能，通过“计算数量级”我们可以大致了解在（近）未来我们应该期待哪种模型智能。
(This graph shows only the scaleup in base models; “unhobblings” are not pictured.)
（此图表仅显示基础模型的扩展；“解限制”并未在图中展示。）

In this piece, I will simply “count the OOMs” (OOM = order of magnitude, 10x = 1 order of magnitude): look at the trends in 1) compute, 2) algorithmic efficiencies (algorithmic progress that we can think of as growing “effective compute”), and 3) ”unhobbling” gains (fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness).
在这篇文章中，我将简单地“计算 OOMs”（OOM = 数量级，10 倍 = 1 个数量级）：观察 1）计算能力，2）算法效率（我们可以将其视为“有效计算”的增长），以及 3）“解除限制”收益（修复模型默认受限的明显方式，解锁潜在能力并赋予它们工具，从而导致用途上的跃变）的趋势。
We trace the growth in each over four years before GPT-4, and what we should expect in the four years after, through the end of 2027.
我们追踪了在 GPT-4 之前四年中每一项的增长，以及到 2027 年底之后四年的预期增长。
Given deep learning’s consistent improvements for every OOM of effective compute, we can use this to project future progress.
鉴于深度学习在每一个有效计算 OOM 上的持续改进，我们可以用这个来预测未来的进展。

Publicly, things have been quiet for a year since the GPT-4 release, as the next generation of models has been in the oven—leading some to proclaim stagnation and that deep learning is hitting a wall.

1 But by counting the OOMs, we get a peek at what we should actually expect.
公开来说，自从 GPT-4 发布以来，一年以来事情都比较平静，因为下一代模型已经在研发中——这导致一些人宣称出现了停滞，深度学习正遇到瓶颈。但是通过计算 OOMs（超出内存错误），我们可以窥见我们实际上应该期待什么。

The upshot is pretty simple.
结果相当简单。
GPT-2 to GPT-4—from models that were impressive for sometimes managing to string together a few coherent sentences, to models that ace high-school exams—was not a one-time gain.
从 GPT-2 到 GPT-4——从有时能够连贯地拼凑几个句子的模型，到能够通过高中考试的模型——并非一次性收益。
We are racing through the OOMs extremely rapidly, and the numbers indicate we should expect another ~100,000x effective compute scaleup—resulting in another GPT-2-to-GPT-4-sized qualitative jump—over four years.
我们正在极其迅速地穿过阶跃（OOMs），数字表明我们应当预期在四年内实现大约 100,000 倍的有效计算规模提升——这将带来另一个 GPT-2 到 GPT-4 规模的质变。
Moreover, and critically, that doesn’t just mean a better chatbot; picking the many obvious low-hanging fruit on “unhobbling” gains should take us from chatbots to agents, from a tool to something that looks more like drop-in remote worker replacements.
此外，更重要的是，这不仅仅意味着一个更好的聊天机器人；选择众多显而易见的“解除束缚”收益中的低垂果实，应使我们从聊天机器人转变为智能代理，从工具变为更像即插即用的远程工作人员替代品。

While the inference is simple, the implication is striking.
尽管推理简单，但含义惊人。
Another jump like that very well could take us to AGI, to models as smart as PhDs or experts that can work beside us as coworkers.
另一个像那样的跳跃很可能将我们带到 AGI，带到与博士或专家一样聪明的模型，它们可以像同事一样和我们并肩工作。
Perhaps most importantly, if these AI systems could automate AI research itself, that would set in motion intense feedback loops—the topic of the next piece in the series.
也许最重要的是，如果这些 AI 系统能够自动化 AI 研究本身，那将启动强烈的反馈循环——这是本系列下一篇文章的主题。

Even now, barely anyone is pricing all this in. But situational awareness on AI isn’t actually that hard, once you step back and look at the trends.
即使现在，几乎没有人将这些因素全部计入其中。但是一旦你退一步观察趋势，对 AI 的局势认识其实并不难。
If you keep being surprised by AI capabilities, just start counting the OOMs.
如果您一直对人工智能的能力感到惊讶，那就开始计算 OOMs 吧。

The last four years 近四年

We have machines now that we can basically talk to like humans. It’s a remarkable testament to the human capacity to adjust that this seems normal, that we’ve become inured to the pace of progress.
现在我们有了可以像人类一样与之交流的机器。这证明了人类调整适应的能力非常惊人，我们对这种常态感到习以为常，对进步的步伐已经变得麻木。
But it’s worth stepping back and looking at the progress of just the last few years.
但值得我们退一步，看看仅仅在过去几年里的进展。

GPT-2 to GPT-4 GPT-2 到 GPT-4

Let me remind you of how far we came in just the ~4 (!) years leading up to GPT-4.
让我提醒你一下，仅仅在 GPT-4 发布前的 ~4 (!) 年里我们取得了多大的进展。

GPT-2 (2019) ~ preschooler: “Wow, it can string together a few plausible sentences.” A very-cherry-picked example of a semi-coherent story about unicorns in the Andes it generated was incredibly impressive at the time. And yet GPT-2 could barely count to 5 without getting tripped up;

2 when summarizing an article, it just barely outperformed selecting 3 random sentences from the article.
GPT-2（2019）~ 学龄前儿童：“哇，它能够把几个看似合理的句子串在一起。”当时，它生成的一个关于安第斯山脉独角兽的半连贯故事示例非常令人印象深刻。然而 GPT-2 在数数时稍微超过 5 就会出错；在总结一篇文章时，它仅仅比从文章中随机选择 3 个句子稍微表现得好一点。

*Some* *examples* of what people found impressive about GPT-2 at the time. Left: GPT-2 does an ok job on extremely basic reading comprehension questions.
当时人们对于 GPT-2 印象深刻的一些例子。左图：GPT-2 在极其基础的阅读理解问题上表现尚可。
Right: In a cherry-picked sample (best of 10 tries), GPT-2 can write a semi-coherent paragraph that says some semi-relevant things about the Civil War.
右侧：在一个精心挑选的样本（10 次尝试中的最佳）中，GPT-2 能够写出一段半连贯的段落，关于内战说出一些半相关的内容。

Comparing AI capabilities with human intelligence is difficult and flawed, but I think it’s informative to consider the analogy here, even if it’s highly imperfect.
将 AI 的能力与人类智能进行比较是困难且有缺陷的，但我觉得即使这种类比非常不完美，考虑一下也是有益的。
GPT-2 was shocking for its command of language, and its ability to occasionally generate a semi-cohesive paragraph, or occasionally answer simple factual questions correctly.
GPT-2 在语言掌握方面令人震惊，它偶尔能够生成一个半连贯的段落，或者偶尔正确回答简单的实际问题。
It’s what would have been impressive for a preschooler.
这对于一个幼儿园孩子来说可能会感到印象深刻。

GPT-3 (2020)

4 ~ elementary schooler: “Wow, with just some few-shot examples it can do some simple useful tasks.” It started being cohesive over even multiple paragraphs much more consistently, and could correct grammar and do some very basic arithmetic.
GPT-3（2020）~ 小学生：“哇，只需一些简单的示例，它就能执行一些基础的有用任务。” 它在即使是多段落的文本中也变得更加连贯，能够纠正语法并做一些非常基础的算术。
For the first time, it was also commercially useful in a few narrow ways: for example, GPT-3 could generate simple copy for SEO and marketing.
首次，它在一些狭窄的方面也具有商业用途：例如，GPT-3 能够为 SEO 和营销生成简单的文案。

Some examples of what people found impressive about GPT-3 at the time. Top: After a simple instruction, GPT-3 can use a made-up word in a new sentence. Bottom-left: GPT-3 can engage in rich storytelling back-and-forth. Bottom-right: GPT-3 can generate some very simple code.
当时人们对于 GPT-3 印象深刻的一些例子。顶部：在简单的指示后，GPT-3 可以在新句子中使用编造的词。左下：GPT-3 可以进行丰富的故事讲述往返。右下：GPT-3 可以生成一些非常简单的代码。

Again, the comparison is imperfect, but what impressed people about GPT-3 is perhaps what would have been impressive for an elementary schooler: it wrote some basic poetry, could tell richer and coherent stories, could start to do rudimentary coding, could fairly reliably learn from simple instructions and demonstrations, and so on.
再次，这种比较并不完美，但给人留下深刻印象的 GPT-3 可能正是小学毕业生会感到惊讶的地方：它能写一些基础的诗，能讲述更丰富、连贯的故事，能开始进行基本的编码，能相当可靠地从简单的指示和演示中学习，等等。

GPT-4 (2023) ~ smart high schooler: “Wow, it can write pretty sophisticated code and iteratively debug, it can write intelligently and sophisticatedly about complicated subjects, it can reason through difficult high-school competition math, it’s beating the vast majority of high schoolers on whatever tests we can give it, etc.” From code to math to Fermi estimates, it can think and reason.
GPT-4（2023）~ 聪明的高中生：“哇，它能编写相当复杂的代码并迭代调试，它能聪明而深入地撰写关于复杂主题的文章，它能推理解决困难的高中竞赛数学题，它在各种测试中击败了绝大多数高中生，等等。” 从代码到数学到费米估算，它都能思考和推理。
GPT-4 is now useful in my daily tasks, from helping write code to revising drafts.
GPT-4 现在在我的日常任务中很有用，从帮助编写代码到修订草稿。

Some of what people found impressive about GPT-4 when it was released, from the “Sparks of AGI” paper.
人们在 GPT-4 发布时对其印象深刻的一些内容，来自《AGI 的火花》论文。
Top: It’s writing very complicated code (producing the plots shown in the middle) and can reason through nontrivial math problems. Bottom-left: Solving an AP math problem.
顶部：编写非常复杂的代码（生成中间显示的图表）并能推理解决非平凡数学问题。左下角：解决一个 AP 数学问题。
Bottom-right: Solving a fairly complex coding problem. More interesting excerpts from that exploration of GPT-4’s capabilities here.
右下角：解决一个相当复杂的编码问题。关于探索 GPT-4 能力的更多有趣摘录在这里。

On everything from AP exams to the SAT, GPT-4 scores better than the vast majority of high schoolers.
从 AP 考试到 SAT，GPT-4 的得分高于绝大多数高中生。

Of course, even GPT-4 is still somewhat uneven; for some tasks it’s much better than smart high-schoolers, while there are other tasks it can’t yet do.
当然，即便是 GPT-4 仍然有些不均衡；在某些任务上它比聪明的中学生要好得多，而有些任务它还无法完成。
That said, I tend to think most of these limitations come down to obvious ways models are still hobbled, as I’ll discuss in-depth later.
话虽如此，我倾向于认为这些限制大多是因为模型仍然明显的受限，具体我会稍后详细讨论。
The raw intelligence is (mostly) there, even if the models are still artificially constrained; it’ll take extra work to unlock models being able to fully apply that raw intelligence across applications.
原始智能是（大部分）存在的，即使模型仍然受到人为的限制；需要额外的工作来解锁模型，使它们能够在各种应用中完全应用这种原始智能。

*Progress over just four years. Where are* ***you*** *on this line?*
在短短四年内取得进展。你在这条线上的位置在哪里？

The trends in deep learning
深度学习的趋势

The pace of deep learning progress in the last decade has simply been extraordinary.
在过去十年中，深度学习的进展速度简直是非凡的。
A mere decade ago it was revolutionary for a deep learning system to identify simple images.
仅仅十年前，深度学习系统能够识别简单图像还是一项革命性技术。
Today, we keep trying to come up with novel, ever harder tests, and yet each new benchmark is quickly cracked.
现在，我们不断尝试提出新颖的、越来越难的测试，但每个新的基准很快就被破解。
It used to take decades to crack widely-used benchmarks; now it feels like mere months.
过去需要几十年才能破解广泛使用的基准；现在感觉好像只需要几个月。

*Deep learning systems are rapidly reaching or exceeding human-level in many domains. Graphic:* *Our World in Data*
深度学习系统正在快速达到或超过许多领域的人类水平。图表：我们的世界数据

We’re literally running out of benchmarks. As an anecdote, my friends Dan and Collin made a benchmark called MMLU a few years ago, in 2020.
我们实际上正在耗尽基准测试。举个例子，我的朋友 Dan 和 Collin 在几年前，2020 年，制作了一个名为 MMLU 的基准测试。
They hoped to finally make a benchmark that would stand the test of time, equivalent to all the hardest exams we give high school and college students.
他们希望最终能制作出一个能够经受住时间考验的基准测试，相当于我们给高中和大学生所有的最难考试。
Just three years later, it’s basically solved: models like GPT-4 and Gemini get ~90%.
仅仅三年后，它基本上已经被解决：像 GPT-4 和 Gemini 这样的模型能够达到~90%。

More broadly, GPT-4 mostly cracks all the standard high school and college aptitude tests.

5 (And even the one year from GPT-3.5 to GPT-4 often took us from well below median human performance to the top of the human range.)
更广泛地说，GPT-4 大部分通过了所有标准的高中和大学能力测试。（甚至从 GPT-3.5 到 GPT-4 的一年时间，我们的表现就从远低于人类平均水平跃升至人类能力的顶尖水平。）

GPT-4 scores on standardized tests.
GPT-4 在标准化测试中的得分。
Note also the large jump from GPT-3.5 to GPT-4 in human percentile on these tests, often from well below the median human to the very top of the human range.
请注意 GPT-3.5 到 GPT-4 在这些测试中的人类百分比的大幅跃升，通常从远低于平均水平的人类跃升至人类范围的最顶端。
(And this is GPT-3.5, a fairly recent model released less than a year before GPT-4, not the clunky old elementary-school-level GPT-3 we were talking about earlier!)
（而且这是 GPT-3.5，一个在 GPT-4 之前不到一年发布的相当新模型，不是我们之前讨论的那种笨拙的老旧小学水平的 GPT-3！）

Gray: Professional forecasts, made in August 2021, for June 2022 performance on the MATH benchmark (difficult mathematics problems from high-school math competitions).
灰色：2021 年 8 月做出的专业预测，关于 2022 年 6 月在 MATH 基准测试（来自高中数学竞赛的难题）上的表现。
Red star: actual state-of-the-art performance by June 2022, far exceeding even the upper range forecasters gave. The median ML researcher was
红星：截至 2022 年 6 月的实际最先进性能，甚至超过了预测者给出的最高范围。中位数机器学习研究者*even more pessimistic 更加悲观*.

Or consider the MATH benchmark, a set of difficult mathematics problems from high-school math competitions.

6 When the benchmark was released in 2021, the best models only got ~5% of problems right.
或者考虑 MATH 基准测试，这是一组来自高中数学竞赛的难题。当该基准测试在 2021 年发布时，最佳模型仅能正确解决约 5%的问题。
And the original paper noted: “Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue […].
并且原论文指出：“此外，我们发现，如果规模化的趋势继续，仅仅增加预算和模型参数数量将无法实现强大的数学推理[…]。”
To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community”—we would need fundamental new breakthroughs to solve MATH, or so they thought.
“为了在数学问题解决上获得更多牵引力，我们可能需要来自更广泛研究社区的新的算法进展”——我们可能需要根本性的新突破来解决 MATH，或者他们认为。
A survey of ML researchers predicted minimal progress over the coming years;

7 and yet within just a year (by mid-2022), the best models went from ~5% to 50% accuracy; now, MATH is basically solved, with recent performance over 90%.
一项对机器学习研究人员的调查预测，未来几年进展微小；然而仅在一年内（到 2022 年中期），最佳模型的准确度从约 5%提升到了 50%；现在，MATH 基本上已经解决，最近的性能超过了 90%。

Over and over again, year after year, skeptics have claimed “deep learning won’t be able to do X” and have been quickly proven wrong.

8If there’s one lesson we’ve learned from the past decade of AI, it’s that you should never bet against deep learning.
年复一年，怀疑者们总是声称“深度学习无法做到 X”，但很快就被证明是错误的。如果我们从过去十年的 AI 发展中学到了什么，那就是永远不要低估深度学习。

Now the hardest unsolved benchmarks are tests like GPQA, a set of PhD-level biology, chemistry, and physics questions.
现在尚未解决的最困难的基准测试是像 GPQA 这样的测试，它是一组博士级别的生物学、化学和物理问题。
Many of the questions read like gibberish to me, and even PhDs in other scientific fields spending 30+ minutes with Google barely score above random chance.
许多问题对我来说就像天书一样，即使是其他科学领域的博士在谷歌上花费 30 多分钟，得分也几乎和随机猜想的概率差不多。
Claude 3 Opus currently gets ~60%,

9 compared to in-domain PhDs who get ~80%—and I expect this benchmark to fall as well, in the next generation or two.
Claude 3 Opus 目前获得约 60%，而领域内博士约为 80%——我预计这一基准在下一代或两代后也会下降。

Example GPQA questions. Models are already better at this than I am, and we’ll probably crack expert-PhD-level soon…
示例 GPQA 问题。模型在这方面已经比我做得更好，我们可能很快就会突破专家-博士级别的难关…

Counting the OOMs 统计 OOMs

How did this happen? The magic of deep learning is that it just works—and the trendlines have been astonishingly consistent, despite naysayers at every turn.
这是怎么发生的？深度学习的魔力在于它就是有效——而且尽管在每个转折点都有怀疑者，趋势线却惊人地一致。

*The effects of scaling compute, in the example of* *OpenAI Sora*
计算规模化的效果，以 OpenAI Sora 为例.

With each OOM of effective compute, models predictably, reliably get better.

10 If we can count the OOMs, we can (roughly, qualitatively) extrapolate capability improvements.

11 That’s how a few prescient individuals saw GPT-4 coming.
随着每个有效计算的数量级（OOM）的增加，模型以可预测的、可靠的方式变得更好。如果我们能计算 OOMs，我们就可以（大致上、定性地）推断出能力提升。这就是一些有远见的人预见到 GPT-4 出现的方式。

We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:
我们可以将从 GPT-2 到 GPT-4 这四年的进步分解为三个规模升级类别：

Compute: We’re using much bigger computers to train these models.
计算：我们正在使用更大的计算机来训练这些模型。
Algorithmic efficiencies: There’s a continuous trend of algorithmic progress. Many of these act as “compute multipliers,” and we can put them on a unified scale of growing effective compute.
算法效率：算法的进步呈现持续趋势。其中许多作用类似于“计算倍增器”，我们可以将它们放在一个统一的衡量增长有效计算能力的尺度上。
”Unhobbling” gains: By default, models learn a lot of amazing raw capabilities, but they are hobbled in all sorts of dumb ways, limiting their practical value.
“解锁”收益：默认情况下，模型学到了很多惊人的原始能力，但它们在各种各样的愚蠢方式上受到限制，限制了它们的实际价值。
With simple algorithmic improvements like reinforcement learning from human feedback (RLHF), chain-of-thought (CoT), tools, and scaffolding, we can unlock significant latent capabilities.
通过简单的算法改进，如基于人类反馈的强化学习（RLHF）、思维链（CoT）、工具和脚手架，我们可以解锁显著的潜在能力。

We can “count the OOMs” of improvement along these axes: that is, trace the scaleup for each in units of effective compute.
我们可以沿着这些轴线“计算 OOMs”的改进：也就是说，以有效计算单位追踪每个的规模扩大。
3x is 0.5 OOMs; 10x is 1 OOM; 30x is 1.5 OOMs; 100x is 2 OOMs; and so on.
3x 是 0.5 个数量级；10x 是 1 个数量级；30x 是 1.5 个数量级；100x 是 2 个数量级；依此类推。
We can also look at what we should expect on top of GPT-4, from 2023 to 2027.
我们也可以展望从 2023 年到 2027 年，在 GPT-4 之上我们应该期待什么。

I’ll go through each one-by-one, but the upshot is clear: we are rapidly racing through the OOMs. There are potential headwinds in the data wall, which I’ll address—but overall, it seems likely that we should expect another GPT-2-to-GPT-4-sized jump, on top of GPT-4, by 2027.
我会逐个分析，但结论很明显：我们正在快速穿越数量级。数据墙上可能有潜在的风阻，我会提及这一点——但总体而言，到 2027 年，我们似乎应该期待在 GPT-4 之上再有一个相当于 GPT-2 到 GPT-4 大小的跃升。

Compute 计算

I’ll start with the most commonly-discussed driver of recent progress: throwing (a lot) more compute at models.
我将从最近进展最常被讨论的驱动因素开始：向模型投入（大量）计算资源。

Many people assume that this is simply due to Moore’s Law. But even in the old days when Moore’s Law was in its heyday, it was comparatively glacial—perhaps 1-1.5 OOMs per decade.
许多人认为这仅仅是因为摩尔定律。但即使在摩尔定律如日中天的时候，它的进展也是相对缓慢的——或许每十年只有 1-1.5 个数量级的提升。
We are seeing much more rapid scaleups in compute—close to 5x the speed of Moore’s law—instead because of mammoth investment.
我们正在看到计算能力的快速提升——接近摩尔定律速度的 5 倍——这是因为巨额投资。
(Spending even a million dollars on a single model used to be an outrageous thought nobody would entertain, and now that’s pocket change!)
（过去，在单个模型上花费百万美元曾是一个无人敢想的疯狂想法，而现在这就像是零花钱！）

Model 模型	Estimated compute 估计计算量	Growth 增长
GPT-2 (2019)	~4e21 FLOP 约 4e21 浮点运算次数每秒
GPT-3 (2020)	~3e23 FLOP 约 3e23 浮点运算次数每秒	+ ~2 OOMs + ~2 顺序 of magnitude
GPT-4 (2023)	8e24 to 4e25 FLOP 8e24 至 4e25 FLOP	+ ~1.5–2 OOMs + ~1.5–2 阶数量级

Estimates of compute for GPT-2 to GPT-4 by Epoch AI
Epoch AI 对 GPT-2 到 GPT-4 的计算量估计

We can use public estimates from Epoch AI (a source widely respected for its excellent analysis of AI trends) to trace the compute scaleup from 2019 to 2023.
我们可以使用 Epoch AI（一个因其对 AI 趋势的优秀分析而广受尊重的来源）的公开估计来追踪从 2019 年到 2023 年的计算规模增长。
GPT-2 to GPT-3 was a quick scaleup; there was a large overhang of compute, scaling from a smaller experiment to using an entire datacenter to train a large language model.
GPT-2 升级到 GPT-3 是一个快速的扩展；计算资源有很大的富余，从较小的实验扩展到使用整个数据中心来训练大型语言模型。
With the scaleup from GPT-3 to GPT-4, we transitioned to the modern regime: having to build an entirely new (much bigger) cluster for the next model.
从 GPT-3 扩展到 GPT-4 时，我们过渡到了现代模式：必须构建一个全新的（更大的）集群来训练下一个模型。
And yet the dramatic growth continued. Overall, Epoch AI estimates suggest that GPT-4 training used ~3,000x-10,000x more raw compute than GPT-2.
然而，剧烈的增长仍在继续。总体而言，Epoch AI 的估计表明，GPT-4 的训练使用了比 GPT-2 多出大约 3,000 倍到 10,000 倍的原始计算资源。

In broad strokes, this is just the continuation of a longer-running trend.
大致上，这仅仅是更长趋势的延续。
For the last decade and a half, primarily because of broad scaleups in investment (and specializing chips for AI workloads in the form of GPUs and TPUs), the training compute used for frontier AI systems has grown at roughly ~0.5 OOMs/year.
在过去的十五年里，主要是因为投资的广泛增加（以及专门为 AI 工作负载设计的 GPU 和 TPU），用于前沿 AI 系统的训练计算能力以大约每年~0.5 个数量级（OOMs）的速度增长。

*Training compute of notable deep learning models over time. Source:* *Epoch AI*
训练知名深度学习模型随时间的计算能力。来源：Epoch AI

The compute scaleup from GPT-2 to GPT-3 in a year was an unusual overhang, but all the indications are that the longer-run trend will continue.
从 GPT-2 到 GPT-3 的计算规模增长在一年内是非同寻常的激增，但所有迹象都表明长期趋势将会持续。
The SF-rumor-mill is abreast with dramatic tales of huge GPU orders. The investments involved will be extraordinary—but they are in motion.
SF 谣言工厂正在传播关于大量 GPU 订单的戏剧性故事。所涉及的投资将是非凡的——但它们已经在进行中。
I go into this more later in the series, in IIIa. Racing to the Trillion-Dollar Cluster; based on that analysis, an additional 2 OOMs of compute (a cluster in the $10s of billions) seems very likely to happen by the end of 2027; even a cluster closer to +3 OOMs of compute ($100 billion+) seems plausible (and is rumored to be in the works at Microsoft/OpenAI).
我在这系列的后面的 IIIa 部分会详细讲解这个问题，在《冲向万亿美元集群》中；基于那个分析，到 2027 年底，增加 2 个数量级的计算能力（价值数十亿美元的一个集群）看起来非常可能发生；甚至接近+3 个数量级的计算能力（1000 亿美元以上）的集群也似乎合理（并且有传言称微软/OpenAI 正在开发中）。

Algorithmic efficiencies 算法效率

While massive investments in compute get all the attention, algorithmic progress is probably a similarly important driver of progress (and has been dramatically underrated).
虽然大规模的计算投资获得了所有关注，但算法进步可能同样是推动进步的重要驱动因素（并且一直被显著低估）。

To see just how big of a deal algorithmic progress can be, consider the following illustration of the drop in price to attain ~50% accuracy on the MATH benchmark (high school competition math) over just two years. (For comparison, a computer science PhD student who didn’t particularly like math scored 40%, so this is already quite good.) Inference efficiency improved by nearly 3 OOMs—1,000x—in less than two years.
为了了解算法进步有多大影响力，考虑以下说明：在仅仅两年时间内，在 MATH 基准（高中数学竞赛）上达到约 50%准确性的价格下降。（作为对比，一个并不特别喜欢数学的计算机科学博士研究生得分为 40%，所以这已经相当不错。）推理效率在不到两年内提高了近 3 个数量级——1000 倍。

Rough estimate on relative inference cost of attaining ~50% MATH performance.
对达到约 50% MATH 性能的相对推理成本的大致估算。

12

Though these are numbers just for inference efficiency (which may or may not correspond to training efficiency improvements, where numbers are harder to infer from public data), they make clear there is an enormous amount of algorithmic progress possible and happening.
虽然这些数字只是用于推断效率的参考（它们可能与或不对应于训练效率的提升，从公开数据中推断这些数字更为困难），但它们清楚地表明了算法进步的巨大潜力和正在发生的事实。

In this piece, I’ll separate out two kinds of algorithmic progress.
在这篇文章中，我将区分两种算法进步。
Here, I’ll start by covering “within-paradigm” algorithmic improvements—those that simply result in better base models, and that straightforwardly act as compute efficiencies or compute multipliers. For example, a better algorithm might allow us to achieve the same performance but with 10x less training compute.
这里，我将首先介绍“范式内”算法改进——那些仅仅导致更好的基础模型，并且直接表现为计算效率或计算倍增的改进。例如，一个更好的算法可能允许我们以 10 倍的较少训练计算实现相同的性能。
In turn, that would act as a 10x (1 OOM) increase in effective compute.
相反，这将相当于有效计算能力的 10 倍（1 个数量级）增长。
(Later, I’ll cover “unhobbling,” which you can think of as “paradigm-expanding/application-expanding” algorithmic progress that unlocks capabilities of base models.)
（稍后，我将讨论“解除限制”，你可以将其视为“范式扩展/应用扩展”的算法进步，它解锁了基础模型的潜能。）

If we step back and look at the long-term trends, we seem to find new algorithmic improvements at a fairly consistent rate.
如果我们退一步，观察长期趋势，似乎我们以相当稳定的速度发现新的算法改进。
Individual discoveries seem random, and at every turn, there seem insurmountable obstacles—but the long-run trendline is predictable, a straight line on a graph.
个别发现看似随机，而且在每个转折点，似乎都有难以克服的障碍——但长期趋势线是可预测的，图表上的一条直线。
Trust the trendline. 信任趋势线。

We have the best data for ImageNet (where algorithmic research has been mostly public and we have data stretching back a decade), for which we have consistently improved compute efficiency by roughly ~0.5 OOMs/year across the 9-year period between 2012 and 2021.
我们拥有 ImageNet（算法研究大部分是公开的，并且我们有追溯到十年前的数据）的最佳数据，在过去从 2012 年到 2021 年的 9 年期间，我们一直以大约~0.5 OOMs/年的速度提高了计算效率。

We can measure algorithmic progress: how much less compute is needed in 2021 compared to 2012 to train a model with the same performance?
我们可以衡量算法的进步：与 2012 年相比，2021 年训练一个性能相同的模型需要减少多少计算量？
We see a trend of ~0.5 OOMs/yr of algorithmic efficiency. Source:
我们观察到算法效率每年提高约 0.5 个数量级。来源：*Erdil and Besiroglu 2022*.
Erdil 和 Besiroglu 2022。

That’s a huge deal: that means 4 years later, we can achieve the same performance for ~100x less compute (and concomitantly, much higher performance for the same compute!).
这可是个大事情：这意味着 4 年后，我们可以用大约 1/100 的计算量达到相同的性能（同时，相同计算量下的性能也会大幅提高！）。

Unfortunately, since labs don’t publish internal data on this, it’s harder to measure algorithmic progress for frontier LLMs over the last four years.
不幸的是，由于实验室不会公布这些内部数据，因此很难衡量过去四年前沿LLMs算法的进展。
EpochAI has new work replicating their results on ImageNet for language modeling, and estimate a similar ~0.5 OOMs/year of algorithmic efficiency trend in LLMs from 2012 to 2023.
EpochAI 有新的工作复现了他们在 ImageNet 上的语言建模结果，并估计从 2012 年到 2023 年，LLMs的算法效率趋势每年大约提高了~0.5 个数量级。
(This has wider error bars though, and doesn’t capture some more recent gains, since the leading labs have stopped publishing their algorithmic efficiencies.)
（这虽然有更宽的误差条，但没有捕捉到一些最近的收益，因为领先的实验室已经停止公布他们的算法效率。）

Estimates by Epoch AI of algorithmic efficiencies in language modeling. Their estimates suggest we’ve made ~4 OOMs of efficiency gains in 8 years.
Epoch AI 对语言模型算法效率的估计。他们的估计表明，在 8 年内，我们已经实现了大约 4 个数量级的效率提升。

More directly looking at the last 4 years, GPT-2 to GPT-3 was basically a simple scaleup (according to the paper), but there have been many publicly-known and publicly-inferable gains since GPT-3:
更直接地观察过去 4 年，从 GPT-2 到 GPT-3 基本上是一个简单的规模扩大（根据论文），但自 GPT-3 以来已经有许多公开已知和可推断的增益：

We can infer gains from API costs:
我们可以从 API 成本中推断收益：

13
- GPT-4, on release, cost ~the same as GPT-3 when it was released, despite the absolutely enormous performance increase.
  
  14 (If we do a naive and oversimplified back-of-the-envelope estimate based on scaling laws, this suggests that perhaps roughly half the effective compute increase from GPT-3 to GPT-4 came from algorithmic improvements.
  
  15)
  GPT-4 发布时，成本与 GPT-3 发布时大致相同，尽管性能有了极其巨大的提升。（如果我们根据规模法则做一个简单而直观的粗略估计，这可能表明从 GPT-3 到 GPT-4 的有效计算能力增加中，大约有一半来自于算法改进。）
- Since the GPT-4 release a year ago, OpenAI prices for GPT-4-level models have fallen another 6x/4x (input/output) with the release of GPT-4o.
  自从一年前 GPT-4 发布以来，随着 GPT-4o 的发布，OpenAI 对 GPT-4 级别模型的定价又下降了 6 倍/4 倍（输入/输出）。
- Gemini 1.5 Flash, recently released, offers between “GPT-3.75-level” and GPT-4-level performance,
  双子座 1.5 闪存，最近发布，性能介于“GPT-3.75 级别”和 GPT-4 级别之间，
  
  16 while costing 85x/57x (input/output) less than the original GPT-4 (extraordinary gains!).
  在成本上比原始 GPT-4 低 85x/57x（输入/输出）（惊人的收益！）。
Chinchilla scaling laws give a 3x+ (0.5 OOMs+) efficiency gain.
Chinchilla 缩放法则提供了 3x+（0.5 个数量级+）的效率增益。

17
Gemini 1.5 Pro claimed major compute efficiency gains (outperforming Gemini 1.0 Ultra, while using “significantly less” compute), with Mixture of Experts (MoE) as a highlighted architecture change. Other papers also claim a substantial multiple on compute from MoE.
Gemini 1.5 Pro 声称实现了显著的计算效率提升（超越 Gemini 1.0 Ultra，同时使用“大幅减少”的计算资源），以混合专家（MoE）作为突出的架构变更。其他论文也声称 MoE 带来了计算资源的显著倍增。
There have been many tweaks and gains on architecture, data, training stack, etc., all the time.
在架构、数据、训练堆栈等方面进行了许多调整和提升，一直如此。

18

Put together, public information suggests that the GPT-2 to GPT-4 jump included 1-2 OOMs of algorithmic efficiency gains.
综合公开信息，GPT-2 到 GPT-4 的提升包含了 1-2 个数量级的算法效率增益。

Over the 4 years following GPT-4, we should expect the trend to continue:

20 on average 0.5 OOMs/yr of compute efficiency, i.e. ~2 OOMs of gains compared to GPT-4 by 2027.
在 GPT-4 之后的 4 年里，我们应该预期这一趋势将会继续：平均每年计算效率提高 0.5 个数量级，即到 2027 年相比于 GPT-4 将有大约 2 个数量级的提升。
While compute efficiencies will become harder to find as we pick the low-hanging fruit, AI lab investments in money and talent to find new algorithmic improvements are growing rapidly.

21 (The publicly-inferable inference cost efficiencies, at least, don’t seem to have slowed down at all.) On the high end, we could even see more fundamental, Transformer-like breakthroughs

22 with even bigger gains.
尽管随着我们摘取容易获得的果实，计算效率的提升将变得越来越困难，但 AI 实验室在资金和人才方面的投入，以寻找新的算法改进正迅速增长。（至少，公开可推断的推理成本效率并没有出现任何减缓的迹象。）在高端领域，我们甚至可能见证到类似 Transformer 的根本性突破，带来更大的收益。

Put together, this suggests we should expect something like 1-3 OOMs of algorithmic efficiency gains (compared to GPT-4) by the end of 2027, maybe with a best guess of ~2 OOMs.
综合来看，这表明我们预计到 2027 年底，算法效率将获得大约 1-3 个数量级的提升（相较于 GPT-4），最佳猜测可能是大约 2 个数量级。

The data wall 数据墙

There is a potentially important source of variance for all of this: we’re running out of internet data.
所有这些的一个潜在重要变化来源是：我们的互联网数据正在耗尽。
That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.
这可能意味着，很快，对更大语言模型在更多抓取数据上进行预训练的简单方法可能会开始遇到严重的瓶颈。

Frontier models are already trained on much of the internet. Llama 3, for example, was trained on over 15T tokens. Common Crawl, a dump of much of the internet used for LLM training, is >100T tokens raw, though much of that is spam and duplication (e.g., a relatively simple deduplication leads to 30T tokens, implying Llama 3 would already be using basically all the data).
前沿模型已经在互联网的大部分内容上进行了训练。例如，Llama 3 就在超过 15T 的标记上进行过训练。Common Crawl 是用于 LLM 训练的互联网大部分内容的快照，原始标记量超过 100T，尽管其中大部分是垃圾邮件和重复内容（例如，相对简单的去重后可得到 30T 的标记，这意味着 Llama 3 已经基本上在使用所有这些数据）。
Moreover, for more specific domains like code, there are many fewer tokens still, e.g. public github repos are estimated to be in low trillions of tokens.
此外，对于代码等更具体的领域，仍然有更少的标记，例如，公开的 GitHub 仓库估计只有低万亿级别的标记。

You can go somewhat further by repeating data, but academic work on this suggests that repetition only gets you so far, finding that after 16 epochs (a 16-fold repetition), returns diminish extremely fast to nil.
您可以通过重复数据来进一步处理，但学术研究显示，重复只能带来有限的提升，发现在经过 16 个周期（即 16 倍的重复）后，回报会极快地降至零。
At some point, even with more (effective) compute, making your models better can become much tougher because of the data constraint.
在某些时候，即便拥有更多（有效的）计算资源，由于数据限制，提高模型的性能可能会变得更加困难。
This isn’t to be understated: we’ve been riding the scaling curves, riding the wave of the language-modeling-pretraining-paradigm, and without something new here, this paradigm will (at least naively) run out.
这一点不容小觑：我们一直在沿着扩展曲线行驶，乘着语言模型预训练范式的浪潮，如果没有新的东西出现，这个范式（至少在直观上）将会耗尽。
Despite the massive investments, we’d plateau.All of the labs are rumored to be making massive research bets on new algorithmic improvements or approaches to get around this.
尽管投资巨大，我们还是达到了平稳期。所有实验室都传言正在对新的算法改进或方法进行大规模的研究押注，以绕过这个问题。
Researchers are purportedly trying many strategies, from synthetic data to self-play and RL approaches. Industry insiders seem to be very bullish: Dario Amodei (CEO of Anthropic) recently said on a podcast: “if you look at it very naively we’re not that far from running out of data […] My guess is that this will not be a blocker […] There’s just many different ways to do it.” Of course, any research results on this are proprietary and not being published these days.
研究人员据说正在尝试多种策略，从合成数据到自我博弈以及强化学习（RL）方法。业内人士似乎非常乐观：Antrhopic 公司的首席执行官达里奥·阿莫代伊（Dario Amodei）最近在一个播客节目中表示：“如果你非常天真地看待这个问题，我们会发现数据很快就要用完了[…]我猜这不会成为障碍[…]有很多不同的方法可以解决这个问题。”当然，目前关于这方面的任何研究成果都是专有的，并没有公开发表。

In addition to insider bullishness, I think there’s a strong intuitive case for why it should be possible to find ways to train models with much better sample efficiency (algorithmic improvements that let them learn more from limited data).
除了内部人士的乐观情绪外，我认为有很强的直观理由说明为什么应该有可能找到方法来训练样本效率更高的模型（算法改进，使它们能够从有限数据中学习更多）。
Consider how you or I would learn from a really dense math textbook:
考虑你或我如何从一本非常密集的数学教科书中学习：

What a modern LLM does during training is, essentially, very very quickly skim the textbook, the words just flying by, not spending much brain power on it.
现代LLM在训练过程中所做的，本质上，就是非常非常快速地浏览教科书，文字就像飞一样掠过，不会在上面花费太多脑力。
Rather, when you or I read that math textbook, we read a couple pages slowly; then have an internal monologue about the material in our heads and talk about it with a few study-buddies; read another page or two; then try some practice problems, fail, try them again in a different way, get some feedback on those problems, try again until we get a problem right; and so on, until eventually the material “clicks.”
相反，当你或我阅读那本数学教科书时，我们会慢慢地读几页；然后在脑海中对该材料进行内部独白，并与几个学习伙伴讨论它；再读一两页；然后尝试一些练习题，失败，以不同的方式再次尝试，对这些题目获得一些反馈，再尝试，直到我们做对一个问题；如此往复，直到材料“理解透了”。
You or I also wouldn’t learn much at all from a pass through a dense math textbook if all we could do was breeze through it like LLMs.
如果我们只能像LLMs那样浏览一本厚重的数学教科书，那么你或我也几乎学不到任何东西。

23
But perhaps, then, there are ways to incorporate aspects of how humans would digest a dense math textbook to let the models learn much more from limited data.
但也许，那样的话，有办法将人类如何消化一本厚重的数学教科书的方式融入其中，让模型从有限的数据中学到更多。
In a simplified sense, this sort of thing—having an internal monologue about material, having a discussion with a study-buddy, trying and failing at problems until it clicks—is what many synthetic data/self-play/RL approaches are trying to do.
在简化的意义上，这类事情——对材料进行内心独白，与学习伙伴讨论，尝试问题并直到找到感觉为止——正是许多合成数据/自我博弈/强化学习方法试图实现的。

24

The old state of the art of training models was simple and naive, but it worked, so nobody really tried hard to crack these approaches to sample efficiency.
旧的模型训练技术水平简单且天真，但它有效，所以没有人真正努力去破解这些方法以提高样本效率。
Now that it may become more of a constraint, we should expect all the labs to invest billions of dollars and their smartest minds into cracking it.
现在这可能变得更加限制，我们应该预期所有实验室投入数十亿美元以及他们最聪明的人才来破解它。
A common pattern in deep learning is that it takes a lot of effort (and many failed projects) to get the details right, but eventually some version of the obvious and simple thing just works.
深度学习中的一个常见模式是，要想把细节做对需要付出很多努力（和许多失败的项目），但最终某个版本明显且简单的东西就是有效。
Given how deep learning has managed to crash through every supposed wall over the last decade, my base case is that it will be similar here.
考虑到深度学习在过去十年里是如何突破每一个被认为是障碍的东西，我的基本假设是这里也会是类似的情况。

Moreover, it actually seems possible that cracking one of these algorithmic bets like synthetic data could dramatically improve models. Here’s an intuition pump.
此外，实际上破解这些算法赌注之一，比如合成数据，似乎可以显著提高模型的性能。这是一个直觉泵。
Current frontier models like Llama 3 are trained on the internet—and the internet is mostly crap, like e-commerce or SEO or whatever.
当前前沿模型如 Llama 3 是在互联网上训练的——而互联网上大部分内容都是垃圾，比如电子商务或 SEO 之类的。
Many LLMs spend the vast majority of their training compute on this crap, rather than on really high-quality data (e.g. reasoning chains of people working through difficult science problems).
许多LLMs将他们训练计算的大部分时间花在这些垃圾上，而不是真正的高质量数据（例如人们通过困难科学问题时的推理链）。
Imagine if you could spend GPT-4-level compute on entirely extremely high-quality data—it could be a much, much more capable model.
想象一下，如果你能在完全极其高质量的数据上使用 GPT-4 级别的计算——它可能成为一个更加强大的模型。

A look back at AlphaGo—the first AI system that beat the world champions at the game of Go, decades before it was thought possible—is useful here as well.
回顾 AlphaGo——这是第一个在围棋游戏中击败世界冠军的人工智能系统，比人们认为可能的时间早了几十年——在这里同样很有意义。

In step 1, AlphaGo was trained by imitation learning on expert human Go games. This gave it a foundation.
在第一步中，AlphaGo 通过模仿学习专家人类的围棋游戏进行训练。这为它奠定了基础。
In step 2, AlphaGo played millions of games against itself.
在第二步中，AlphaGo 与自己玩了数百万次游戏。
This let it become superhuman at Go: remember the famous move 37 in the game against Lee Sedol, an extremely unusual but brilliant move a human would never have played.
这让它在国际象棋上变得超凡脱俗：记得与李世石对弈中著名的第 37 步，这是一个人类绝不可能走出的极其不寻常但出色的棋步。

Developing the equivalent of step 2 for LLMs is a key research problem for overcoming the data wall (and, moreover, will ultimately be the key to surpassing human-level intelligence).
为 LLMs 开发相当于步骤 2 的内容是克服数据墙的关键研究问题（而且，这最终将是超越人类智能水平的关键）。

All of this is to say that data constraints seem to inject large error bars either way into forecasting the coming years of AI progress.
所有这些都是说，数据约束似乎无论如何都会在预测未来几年人工智能进展时引入较大的误差条。
There’s a very real chance things stall out (LLMs might still be as big of a deal as the internet, but we wouldn’t get to truly crazy AGI).
有一个非常真实的可能性，事情会停滞不前（LLMs 可能仍然和互联网一样重要，但我们可能无法真正实现疯狂的 AGI）。
But I think it’s reasonable to guess that the labs will crack it, and that doing so will not just keep the scaling curves going, but possibly enable huge gains in model capability.
但我认为猜测实验室将会破解这个问题是合理的，而且这样做不仅会保持规模曲线的增长，甚至可能使模型能力获得巨大提升。

As an aside, this also means that we should expect more variance between the different labs in coming years compared to today.
顺便说一下，这也意味着在接下来的几年里，我们预计不同实验室之间的差异会比今天更大。
Up until recently, the state of the art techniques were published, so everyone was basically doing the same thing.
直到最近，最先进的技术才被公布，所以基本上每个人都在做同样的事情。
(And new upstarts or open source projects could easily compete with the frontier, since the recipe was published.) Now, key algorithmic ideas are becoming increasingly proprietary.
（新的创业公司或开源项目可以轻松与前沿竞争，因为配方已经公布。）现在，关键的算法思想正变得越来越专有。
I’d expect labs’ approaches to diverge much more, and some to make faster progress than others—even a lab that seems on the frontier now could get stuck on the data wall while others make a breakthrough that lets them race ahead.
我预计实验室的方法会更多地分化，一些会比其他实验室取得更快的进展——甚至现在看起来处于前沿的实验室也可能在数据墙上卡住，而其他实验室取得突破，让他们迅速领先。
And open source will have a much harder time competing. It will certainly make things interesting.
并且开源将面临更加艰难的竞争。这无疑会使事情变得有趣。
(And if and when a lab figures it out, their breakthrough will be the key to AGI, key to superintelligence—one of the United States’ most prized secrets.)
（而且无论何时实验室搞清楚了，他们的突破将是实现 AGI 的关键，实现超级智能的关键——美国最珍贵的秘密之一。）

Unhobbling 卸羁绊

Finally, the hardest to quantify—but no less important—category of improvements: what I’ll call “unhobbling.”
最后，最难量化但同样重要的改进类别：我将称之为“解除束缚”。

Imagine if when asked to solve a hard math problem, you had to instantly answer with the very first thing that came to mind.
想象一下，如果当被要求解决一个难题时，你必须立刻用第一个浮现在脑海中的答案来回答。
It seems obvious that you would have a hard time, except for the simplest problems. But until recently, that’s how we had LLMs solve math problems. Instead, most of us work through the problem step-by-step on a scratchpad, and are able to solve much more difficult problems that way.
显然，在解决最简单的问题之外，您会感到很困难。但直到最近，我们一直是让LLMs以这种方式解决数学问题。相反，我们大多数人会在草稿纸上逐步解决问题，这样能够解决更复杂的问题。
“Chain-of-thought” prompting unlocked that for LLMs. Despite excellent raw capabilities, they were much worse at math than they could be because they were hobbled in an obvious way, and it took a small algorithmic tweak to unlock much greater capabilities.
“链式思维”提示为LLMs解锁了这一点。尽管它们的原始能力非常出色，但它们在数学方面的表现比应有的水平差得多，因为它们明显受到了限制，只需进行一个小小的算法调整就能解锁更大的能力。

We’ve made huge strides in “unhobbling” models over the past few years.
在过去几年里，我们在“解除模型限制”方面取得了巨大的进步。
These are algorithmic improvements beyond just training better base models—and often only use a fraction of pretraining compute—that unleash model capabilities:
这些是超越仅训练更好的基础模型的算法改进——通常只使用预训练计算的一小部分——释放模型的能力：

Reinforcement learning from human feedback (RLHF). Base models have incredible latent capabilities,

26 but they’re raw and incredibly hard to work with.
基于人类反馈的强化学习（RLHF）。基础模型具有惊人的潜在能力，但它们是原始的，非常难以处理。
While the popular conception of RLHF is that it merely censors swear words, RLHF has been key to making models actually useful and commercially valuable (rather than making models predict random internet text, get them to actually apply their capabilities to try to answer your question!).
虽然人们普遍认为 RLHF 只是审查脏话，但 RLHF 一直是使模型真正实用和具有商业价值的关键（而不是让模型预测随机的互联网文本，而是让它们实际上应用它们的能力来尝试回答你的问题！）。
This was the magic of ChatGPT—well-done RLHF made models usable and useful to real people for the first time. The original InstructGPT paper has a great quantification of this: an RLHF’d small model was equivalent to a non-RLHF’d >100x larger model in terms of human rater preference.
这就是 ChatGPT 的魔力——精心制作的 RLHF 使得模型首次对真实人群变得可用和有用。原始的 InstructGPT 论文中对这一点进行了很好的量化：一个经过 RLHF 训练的小型模型在人类评分偏好的表现上相当于一个未经 RLHF 训练的超过 100 倍大小的模型。
Chain of Thought (CoT). As discussed. CoT started being widely used just 2 years ago and can provide the equivalent of a >10x effective compute increase on math/reasoning problems.
思维链（CoT）。如前所述。CoT 在仅仅 2 年前开始被广泛使用，并能在数学/推理问题上提供相当于 >10 倍的有效计算能力增加。
Scaffolding. 搭架。
Think of CoT++: rather than just asking a model to solve a problem, have one model make a plan of attack, have another propose a bunch of possible solutions, have another critique it, and so on.
想象一下 CoT++：不是仅仅让模型去解决问题，而是让一个模型制定攻击计划，让另一个提出一堆可能的解决方案，让另一个进行评估，等等。
For example, on HumanEval (coding problems), simple scaffolding enables GPT-3.5 to outperform un-scaffolded GPT-4.
例如，在 HumanEval（编码问题）上，简单的脚手架使得 GPT-3.5 的表现超过了未使用脚手架的 GPT-4。
On SWE-Bench (a benchmark of solving real-world software engineering tasks), GPT-4 can only solve ~2% correctly, while with Devin’s agent scaffolding it jumps to 14-23%. (Unlocking agency is only in its infancy though, as I’ll discuss more later.)
在 SWE-Bench（解决现实世界软件工程任务的基准测试）上，GPT-4 只能正确解决约 2%，而使用 Devin 的代理脚手架后，正确率跃升至 14-23%。（解锁代理能力目前还处于初级阶段，我稍后会讨论更多。）
Tools: Imagine if humans weren’t allowed to use calculators or computers.
工具：想象一下，如果人类不允许使用计算器或计算机。
We’re only at the beginning here, but ChatGPT can now use a web browser, run some code, and so on.
我们才刚刚开始，但现在 ChatGPT 已经可以使用网络浏览器，运行一些代码等等。
Context length. Models have gone from 2k token context (GPT-3) to 32k context (GPT-4 release) to 1M+ context (Gemini 1.5 Pro). This is a huge deal.
上下文长度。模型已经从 2k 标记的上下文（GPT-3）发展到 32k 上下文（GPT-4 发布）再到 1M+上下文（Gemini 1.5 Pro）。这非常重要。
A much smaller base model with, say, 100k tokens of relevant context can outperform a model that is much larger but only has, say, 4k relevant tokens of context—more context is effectively a large compute efficiency gain.

27 More generally, context is key to unlocking many applications of these models: for example, many coding applications require understanding large parts of a codebase in order to usefully contribute new code; or, if you’re using a model to help you write a document at work, it really needs the context from lots of related internal docs and conversations.
一个更小的基模型，比如说拥有 10 万 tokens 的相关上下文，可以超越一个更大但只有比如说 4000 个相关上下文 tokens 的模型——更多的上下文实际上是一种巨大的计算效率提升。更普遍地说，上下文是解锁这些模型许多应用的关键：例如，许多编程应用需要理解代码库的大部分内容，才能有效地贡献新代码；或者，如果你在工作中使用模型帮助你撰写文档，它真的需要来自许多相关内部文档和对话的上下文。
Gemini 1.5 Pro, with its 1M+ token context, was even able to learn a new language (a low-resource language not on the internet) from scratch, just by putting a dictionary and grammar reference materials in context!
Gemini 1.5 Pro，凭借其超过 100 万 tokens 的上下文，甚至能够从头开始学习一种新语言（一种网络上资源稀缺的语言），只需将词典和语法参考资料放入上下文中即可！
Posttraining improvements. The current GPT-4 has substantially improved compared to the original GPT-4 when released, according to John Schulman due to posttraining improvements that unlocked latent model capability: on reasoning evals it’s made substantial gains (e.g., ~50% -> 72% on MATH, ~40% to ~50% on GPQA) and on the LMSys leaderboard, it’s made nearly 100-point elo jump (comparable to the difference in elo between Claude 3 Haiku and the much larger Claude 3 Opus, models that have a ~50x price difference).
训练后改进。根据 John Schulman 的说法，当前的 GPT-4 与最初发布的 GPT-4 相比有了实质性改进，这归因于训练后的改进，这些改进解锁了潜在的模型能力：在推理评估中取得了显著提升（例如，从约 50%提升到 MATH 的 72%，从约 40%提升到 GPQA 的约 50%），在 LMSys 排行榜上，几乎实现了 100 点的 Elo 跃升（与 Claude 3 Haiku 和更大的 Claude 3 Opus 之间的 Elo 差距相当，这些模型的价格差异约为 50 倍）。

A survey by Epoch AI of some of these techniques, like scaffolding, tool use, and so on, finds that techniques like this can typically result in effective compute gains of 5-30x on many benchmarks.
Epoch AI 对这些建议技术，如脚手架、工具使用等进行调查，发现这类技术通常可以在许多基准测试中实现 5-30 倍的计算效率提升。
METR (an organization that evaluates models) similarly found very large performance improvements on their set of agentic tasks, via unhobbling from the same GPT-4 base model: from 5% with just the base model, to 20% with the GPT-4 as posttrained on release, to nearly 40% today from better posttraining, tools, and agent scaffolding.
METR（一个评估模型的组织）同样发现，通过从相同的 GPT-4 基础模型中解除限制，在其一系列代理任务上性能得到了非常大的提升：从仅使用基础模型时的 5%，到在发布后经过微调的 GPT-4 达到 20%，再到如今由于更好的微调、工具和代理支架而接近 40%。

*Performance on METR’s agentic tasks, over time via better unhobbling. Source:* *Model Evaluation and Threat Research*
METR 代理任务的表现，通过更好的解除束缚随时间提升。来源：模型评估与威胁研究

While it’s hard to put these on a unified effective compute scale with compute and algorithmic efficiencies, it’s clear these are huge gains, at least on a roughly similar magnitude as the compute scaleup and algorithmic efficiencies.
尽管将这些成果与计算和算法效率放在统一的效能尺度上比较是困难的，但显然这些是巨大的进步，至少在大概相似的量级上与计算规模的扩大和算法效率的提升相当。
(It also highlights the central role of algorithmic progress: the 0.5 OOMs/year of compute efficiencies, already significant, are only part of the story, and put together with unhobbling algorithmic progress overall is maybe even a majority of the gains on the current trend.)
（它还突显了算法进步的核心作用：每年 0.5 个数量级的计算效率提升，虽然已经相当显著，但只是故事的一部分，与整体解除限制的算法进步相结合，可能甚至占据了当前趋势下增益的大部分。）

“Unhobbling” is what actually enabled these models to become useful—and I’d argue that much of what is holding back many commercial applications today is the need for further “unhobbling” of this sort.
“解绑”实际上使得这些模型变得有用——我认为，许多商业应用今天的发展受限，很大程度上是因为需要进一步进行这种类型的“解绑”。
Indeed, models today are still incredibly hobbled! For example:
确实，今天的模型仍然非常受限！例如：

They don’t have long-term memory.
它们没有长期记忆。
They can’t use a computer (they still only have very limited tools).
它们不能使用计算机（它们仍然只有非常有限的工具）。
They still mostly don’t think before they speak.
他们仍然在大多数情况下说话前不思考。
When you ask ChatGPT to write an essay, that’s like expecting a human to write an essay via their initial stream-of-consciousness.
当你让 ChatGPT 写一篇文章时，这就好比期望一个人通过他们的初始意识流来写一篇文章。

28
They can (mostly) only engage in short back-and-forth dialogues, rather than going away for a day or a week, thinking about a problem, researching different approaches, consulting other humans, and then writing you a longer report or pull request.
他们只能（大部分情况下）进行简短的双向对话，而不是离开一整天或一周，思考一个问题，研究不同的方法，咨询其他人，然后给你写一份更长的报告或拉取请求。
They’re mostly not personalized to you or your application (just a generic chatbot with a short prompt, rather than having all the relevant background on your company and your work).
它们大多没有针对您或您的应用程序进行个性化（只是一个带有简短提示的通用聊天机器人，而不是拥有关于您的公司和您工作的所有相关背景信息）。

The possibilities here are enormous, and we’re rapidly picking low-hanging fruit here. This is critical: it’s completely wrong to just imagine “GPT-6 ChatGPT.” With continued unhobbling progress, the improvements will be step-changes compared to GPT-6 + RLHF.
这里的可能性非常大，我们正在迅速摘取低垂的果实。这是关键：仅仅想象“GPT-6 ChatGPT”是完全错误的。随着持续解除限制的进步，与 GPT-6 + RLHF 相比，改进将是阶梯式的变化。
By 2027, rather than a chatbot, you’re going to have something that looks more like an agent, like a coworker.
到 2027 年，你将拥有的不再是聊天机器人，而是一个看起来更像代理、类似同事的东西。

From chatbot to agent-coworker
从聊天机器人到代理-同事

What could ambitious unhobbling over the coming years look like? The way I think about it, there are three key ingredients:
接下来几年雄心勃勃的解除束缚可能看起来像什么？我认为，有三个关键要素：

1. Solving the “onboarding problem”
1. 解决“入职问题”

GPT-4 has the raw smarts to do a decent chunk of many people’s jobs, but it’s sort of like a smart new hire that just showed up 5 minutes ago: it doesn’t have any relevant context, hasn’t read the company docs or Slack history or had conversations with members of the team, or spent any time understanding the company-internal codebase.
GPT-4 具备足够的智能来完成许多人工作的大部分内容，但它有点像一个刚刚入职 5 分钟的新员工：它没有相关背景，没有阅读公司文档或 Slack 历史记录，也没有与团队成员进行过对话，或者花时间了解公司内部的代码库。
A smart new hire isn’t that useful 5 minutes after arriving—but they are quite useful a month in!
一个聪明的新员工在到达后 5 分钟内并不那么有用——但一个月后他们就会非常有用！
It seems like it should be possible, for example via very-long-context, to “onboard” models like we would a new human coworker.
看起来应该是有可能的，例如通过非常长的上下文，来“上线”模型，就像我们会对待一个新的人类同事一样。
This alone would be a huge unlock.
仅这一点就会是一个巨大的突破。

2. The test-time compute overhang (reasoning/error correction/system II for longer-horizon problems)
2. 测试时间的计算过剩（对于更长周期问题的推理/错误修正/系统 II）

Right now, models can basically only do short tasks: you ask them a question, and they give you an answer. But that’s extremely limiting.
目前，模型基本上只能执行短任务：你问它们一个问题，它们给你一个答案。但这极其有限制。
Most useful cognitive work humans do is longer horizon—it doesn’t just take 5 minutes, but hours, days, weeks, or months.
人类做的最有用的认知工作通常是长期任务——它不是只需要 5 分钟，而是需要小时、天、周或月。

A scientist that could only think about a difficult problem for 5 minutes couldn’t make any scientific breakthroughs.
一位科学家如果只花 5 分钟思考一个难题，无法取得任何科学突破。
A software engineer that could only write skeleton code for a single function when asked wouldn’t be very useful—software engineers are given a larger task, and they then go make a plan, understand relevant parts of the codebase or technical tools, write different modules and test them incrementally, debug errors, search over the space of possible solutions, and eventually submit a large pull request that’s the culmination of weeks of work.
一个只能在被要求时编写单个函数的骨架代码的软件工程师并不会很有用——软件工程师被分配一个更大的任务，然后他们制定计划，理解代码库的相关部分或技术工具，编写不同的模块并逐步测试，调试错误，搜索可能的解决方案空间，并最终提交一个大型合并请求，这是几周工作的结晶。
And so on.
等等。

In essence, there is a large test-time compute overhang. Think of each GPT-4 token as a word of internal monologue when you think about a problem.
实质上，存在很大的测试时间计算过剩。想象一下，当你思考一个问题时，每个 GPT-4 标记就像是你内心独白的一个词。
Each GPT-4 token is quite smart, but it can currently only really effectively use on the order of ~hundreds of tokens for chains of thought coherently (effectively as though you could only spend a few minutes of internal monologue/thinking on a problem or project).
每个 GPT-4 代币都非常智能，但目前它只能有效地使用大约~几百个代币来连贯地构建思考链（效果就像你只能在一个问题或项目上花费几分钟的内心独白/思考时间）。

What if it could use millions of tokens to think about and work on really hard problems or bigger projects?
如果它能使用数百万个代币来思考和解决真正困难的问题或更大的项目会怎样？

Number of tokens 代币数量	Equivalent to me working on something for… 等同于我在某件事上工作…
100s	A few minutes 几分钟	ChatGPT (we are here) ChatGPT（我们在哪里）
1000s	Half an hour 半小时	+1 OOMs test-time compute +1 OOMs 测试时间计算
10,000s	Half a workday 半个工作日	+2 OOMs +2 个数量级
100,000s 10 万+	A workweek 一个工作周	+3 OOMs +3 个数量级
Millions 百万	Multiple months 多个月	+4 OOMs +4 阶数量级

Assuming a human thinking at ~100 tokens/minute and working 40 hours/week, translating “how long a model thinks” in tokens to human-time on a given problem/project.
假设人类思考速度为每分钟约 100 个标记（tokens）并且每周工作 40 小时，将“模型思考时间”以标记换算为给定问题/项目的人类时间。

Even if the “per-token” intelligence were the same, it’d be the difference between a smart person spending a few minutes vs. a few months on a problem. I don’t know about you, but there’s much, much, much more I am capable of in a few months vs. a few minutes.
即使“每个标记”的智能相同，也会是一个聪明的人在一个问题上花费几分钟与几个月之间的区别。我不知道你怎么样，但我在几个月内能做的东西远远超过几分钟内能做的。
If we could unlock “being able to think and work on something for months-equivalent, rather than a few-minutes-equivalent” for models, it would unlock an insane jump in capability. There’s a huge overhang here, many OOMs worth.
如果我们能解锁模型“能够思考和持续工作数月之久，而不是几分钟”，这将带来能力上的巨大飞跃。这里有巨大的潜力，许多数量级的差异。

Right now, models can’t do this yet.
目前，模型还做不到这一点。
Even with recent advances in long-context, this longer context mostly only works for the consumption of tokens, not the production of tokens—after a while, the model goes off the rails or gets stuck.
即使在长上下文处理方面取得了最近的进展，这种更长的上下文大多数情况下仅适用于消费令牌，而不是生成令牌——过了一段时间后，模型会脱离轨道或卡住。
It’s not yet able to go away for a while to work on a problem or project on its own.
它还不能暂时离开去独立地处理一个问题或项目。

But unlocking test-time compute might merely be a matter of relatively small “unhobbling” algorithmic wins.
但解锁测试时间的计算能力可能仅仅是一个相对较小的“解除束缚”算法胜利的问题。
Perhaps a small amount of RL helps a model learn to error correct (“hm, that doesn’t look right, let me double check that”), make plans, search over possible solutions, and so on.
或许少量的强化学习（RL）可以帮助模型学会错误校正（“嗯，这看起来不太对，让我再仔细检查一下”），制定计划，搜索可能的解决方案等等。
In a sense, the model already has most of the raw capabilities, it just needs to learn a few extra skills on top to put it all together.
在某种意义上，模型已经具备了大部分原始能力，它只是需要在此基础上学习一些额外的技能来整合所有功能。

In essence, we just need to teach the model a sort of System II outer loop

30 that lets it reason through difficult, long-horizon projects.
实质上，我们只需要教会模型一种类似系统 II 外部循环的东西，让它能够推理解决困难、长远目标的项目。

If we succeed at teaching this outer loop, instead of a short chatbot answer of a couple paragraphs, imagine a stream of millions of words (coming in more quickly than you can read them) as the model thinks through problems, uses tools, tries different approaches, does research, revises its work, coordinates with others, and completes big projects on its own.
如果我们成功教会这个外循环，而不仅仅是短短几段对话机器人的回答，想象一下，当模型思考问题、使用工具、尝试不同方法、进行研究、修订其工作、与其他人协调并独立完成大型项目时，数百万字的流（比您阅读的速度还要快）。

Trading off test-time and train-time compute in other ML domains
在其他机器学习领域中权衡测试时间和训练时间的计算资源

In other domains, like AI systems for board games, it’s been demonstrated that you can use more test-time compute (also called inference-time compute) to substitute for training compute.

Jones (2021): A smaller model can do as well as a much larger model at the game of Hex if you give it more test-time compute (“more time to think”). In this domain, they find that one can spend ~1.2 OOMs more compute at test-time to get performance equivalent to a model with ~1 OOMs more training compute.

If a similar relationship held in our case, if we could unlock + 4 OOMs of test-time compute, that might be equivalent to + 3OOMs of pretraining compute, i.e. very roughly something like the jump between GPT-3 and GPT-4. (I.e., solving this “unhobbling” would be equivalent to a huge OOM scaleup.)

3. Using a computer 使用计算机

This is perhaps the most straightforward of the three. ChatGPT right now is basically like a human that sits in an isolated box that you can text.
这可能是三者中最直接的。现在的 ChatGPT 基本上就像一个坐在孤立盒子里的真人，你可以给它发短信。
While early unhobbling improvements teach models to use individual isolated tools, I expect that with multimodal models we will soon be able to do this in one fell swoop: we will simply enable models to use a computer like a human would.
虽然早期的简化改进教会模型使用独立的孤立工具，但我预计，随着多模态模型的出现，我们很快就能一次性做到这一点：我们只需简单地让模型像人类一样使用计算机。

That means joining your Zoom calls, researching things online, messaging and emailing people, reading shared docs, using your apps and dev tooling, and so on.
这意味着加入你的 Zoom 通话，在线研究事物，给人发消息和电子邮件，阅读共享文档，使用你的应用程序和开发工具等。
(Of course, for models to make the most use of this in longer-horizon loops, this will go hand-in-hand with unlocking test-time compute.)
当然，为了使模型在更长的时间范围内充分利用这一点，这将与解锁测试时间的计算能力相伴随。

By the end of this, I expect us to get something that looks a lot like a drop-in remote worker.
到了这个阶段，我期望我们能得到一个看起来很像即插即用的远程工作人员的东西。
An agent that joins your company, is onboarded like a new human hire, messages you and colleagues on Slack and uses your softwares, makes pull requests, and that, given big projects, can do the model-equivalent of a human going away for weeks to independently complete the project.
一个加入贵公司的代理，像新员工一样入职，通过 Slack 与您和同事交流，使用您的软件，提交拉取请求，并且，在大型项目中，能够执行相当于人类离开数周独立完成项目的模型级操作。
You’ll probably need somewhat better base models than GPT-4 to unlock this, but possibly not even that much better—a lot of juice is in fixing the clear and basic ways models are still hobbled.
您可能需要比 GPT-4 略好的基础模型来解锁这一功能，但可能甚至不需要好那么多——大量改进来自于修复模型仍然受限的明显且基本的方式。

*A very early peek at what this might look like is* *Devin*, an early prototype of unlocking the “agency-overhang”/”test-time compute overhang” on models on the path to creating a fully automated software engineer.
一个非常早期的预览，展示了这可能会是什么样子的是 Devin，一个早期原型，用于解锁“代理超限”/“测试时间计算超限”在通往创建完全自动化的软件工程师的模型路径上。
I don’t know how well Devin works in practice, and this demo is still very limited compared to what proper chatbot → agent unhobbling would yield, but it’s a useful teaser of the sort of thing coming soon.
我不知道 Devin 在实际中的工作效果如何，这个演示与真正的聊天机器人→代理解耦所能达到的效果相比仍然非常有限，但这是一个即将到来事物的有用预览。

By the way, I expect the centrality of unhobbling to lead to a somewhat interesting “sonic boom” effect in terms of commercial applications.
顺便说一下，我预计解耦的中心性将导致在商业应用方面产生一种有趣的“声爆”效应。
Intermediate models between now and the drop-in remote worker will require tons of schlep to change workflows and build infrastructure to integrate and derive economic value from.
在现在和可插入式远程工作者之间的中间模型将需要大量的努力来改变工作流程和构建基础设施以整合并从中获取经济价值。
The drop-in remote worker will be dramatically easier to integrate—just, well, drop them in to automate all the jobs that could be done remotely.
集成即插即用的远程工作人员将大大简化——只需，嗯，将他们放入以自动化所有可以远程完成的工作。
It seems plausible that the schlep will take longer than the unhobbling, that is, by the time the drop-in remote worker is able to automate a large number of jobs, intermediate models won’t yet have been fully harnessed and integrated—so the jump in economic value generated could be somewhat discontinuous.
看起来很可能这个过程会比解除限制耗时更长，也就是说，到即插即用的远程工作人员能够自动化大量工作的时候，中间模型可能还没有被完全利用和集成——因此产生的经济价值增长可能是 somewhat 不连续的。

The next four years 未来的四年

Summary of the estimates on drives of progress in the four years preceding GPT-4, and what we should expect in the four years following GPT-4.
GPT-4 之前四年的进步驱动概述，以及我们应期待 GPT-4 之后四年的情况。

Putting the numbers together, we should (roughly) expect another GPT-2-to-GPT-4-sized jump in the 4 years following GPT-4, by the end of 2027.
将数字结合起来，我们大致可以预期在 GPT-4 之后的 4 年里，即在 2027 年年底前，会有一个与 GPT-2 到 GPT-4 规模相当的跃升。

GPT-2 to GPT-4 was roughly a 4.5–6 OOM base effective compute scaleup (physical compute and algorithmic efficiencies), plus major “unhobbling” gains (from base model to chatbot).
GPT-2 到 GPT-4 大致是一个 4.5–6 个数量级的基数有效计算规模提升（物理计算和算法效率），加上主要的“解绑”增益（从基模型到聊天机器人）。
In the subsequent 4 years, we should expect 3–6 OOMs of base effective compute scaleup (physical compute and algorithmic efficiencies)—with perhaps a best guess of ~5 OOMs—plus step-changes in utility and applications unlocked by “unhobbling” (from chatbot to agent/drop-in remote worker).
在接下来的 4 年里，我们预期会有 3–6 个数量级的基数有效计算规模提升（物理计算和算法效率）——或许最佳猜测是大约 5 个数量级——加上通过“解绑”（从聊天机器人到代理人/即插即用的远程工作者）解锁的效用和应用上的阶跃变化。

To put this in perspective, suppose GPT-4 training took 3 months. In 2027, a leading AI lab will be able to train a GPT-4-level model in a minute.

31 The OOM effective compute scaleup will be dramatic.
为了更全面地了解这一点，假设 GPT-4 的训练耗时 3 个月。到 2027 年，一家领先的 AI 实验室将能够在一分钟内训练出 GPT-4 级别的模型。OOM 有效计算规模的提升将是显著的。

Where will that take us?
这将把我们带向何方？

Summary of counting the OOMs.
计算 OOMs 的摘要。

GPT-2 to GPT-4 took us from ~preschooler to ~smart high-schooler; from barely being able to output a few cohesive sentences to acing high-school exams and being a useful coding assistant.
GPT-2 到 GPT-4 使我们从 ~幼儿园水平跃升至 ~聪明的高中生；从几乎只能输出几个连贯的句子到轻松通过高中考试并成为有用的编程助手。
That was an insane jump. If this is the intelligence gap we’ll cover once more, where will that take us?

32 We should not be surprised if that takes us very, very far.
那是一个疯狂的飞跃。如果我们再次覆盖这样的智能差距，那会带我们走向何方？如果这使我们走得更远，我们也不应该感到惊讶。
Likely, it will take us to models that can outperform PhDs and the best experts in a field.
很可能，它将带我们走向能够超越博士和领域内最佳专家的模型。

(One neat way to think about this is that the current trend of AI progress is proceeding at roughly 3x the pace of child development.
(一种思考这个问题的简洁方式是，当前 AI 进展的趋势大约是以儿童发展速度的 3 倍进行。)
Your 3x-speed-child just graduated high school; it’ll be taking your job before you know it!)
你的 3 倍速孩子刚刚高中毕业；在你意识到之前，它将会接手你的工作！)

Again, critically, don’t just imagine an incredibly smart ChatGPT: unhobbling gains should mean that this looks more like a drop-in remote worker, an incredibly smart agent that can reason and plan and error-correct and knows everything about you and your company and can work on a problem independently for weeks.
再次强调，关键在于不要仅仅想象一个极其聪明的 ChatGPT：释放潜能的增益应该意味着它更像是一个即插即用的远程工作者，一个能够推理、计划、纠错且对你和你的公司了如指掌的极其聪明的代理，能够独立地在一个问题上工作数周。)

We are on course for AGI by 2027.
我们将在 2027 年实现通用人工智能（AGI）。
These AI systems will basically be able to automate basically all cognitive jobs (think: all jobs that could be done remotely).
这些人工智能系统将基本上能够自动化几乎所有的认知工作（想想：所有可以远程完成的工作）。

To be clear—the error bars are large.
谨此说明——误差条幅很大。
Progress could stall as we run out of data, if the algorithmic breakthroughs necessary to crash through the data wall prove harder than expected.
进展可能会停滞，因为我们耗尽数据，如果突破数据壁垒所需的算法创新比预期的要困难。
Maybe unhobbling doesn’t go as far, and we are stuck with merely expert chatbots, rather than expert coworkers.
或许解除限制并没有那么大的效果，我们可能只能停留在拥有专家级聊天机器人，而不是专家级同事。
Perhaps the decade-long trendlines break, or scaling deep learning hits a wall for real this time.
也许长达十年的趋势线会中断，或者这次深度学习的扩展真的遇到了壁垒。
(Or an algorithmic breakthrough, even simple unhobbling that unleashes the test-time compute overhang, could be a paradigm-shift, accelerating things further and leading to AGI even earlier.)
(或者算法突破，即便是简单的解除限制，释放测试时间的计算过剩，也可能引发范式转变，进一步加速进程，甚至更早实现通用人工智能 AGI。)

In any case, we are racing through the OOMs, and it requires no esoteric beliefs, merely trend extrapolation of straight lines, to take the possibility of AGI—true AGI—by 2027 extremely seriously.
无论如何，我们正在快速穿越数量级（OOMs），这不需要深奥的信仰，只需对直线趋势的外推，就可以非常认真地看待到 2027 年实现真正的通用人工智能 AGI 的可能性。)

It seems like many are in the game of downward-defining AGI these days, as just as really good chatbot or whatever.
如今似乎很多人都在玩着将通用人工智能 AGI 向下定义的游戏，就像一个真正优秀的聊天机器人或 whatever。
What I mean is an AI system that could fully automate my or my friends’ job, that could fully do the work of an AI researcher or engineer.
我的意思是一个能够完全自动化我或我朋友的工作的 AI 系统，能够完全承担 AI 研究员或工程师的工作。
Perhaps some areas, like robotics, might take longer to figure out by default.
或许某些领域，比如机器人技术，可能默认情况下需要更长的时间来弄清楚。
And the societal rollout, e.g. in medical or legal professions, could easily be slowed by societal choices or regulation.
而在社会推广方面，例如在医疗或法律职业中，可能会很容易因为社会选择或法规而放缓。
But once models can automate AI research itself, that’s enough—enough to kick off intense feedback loops—and we could very quickly make further progress, the automated AI engineers themselves solving all the remaining bottlenecks to fully automating everything.
但一旦模型能够自动化 AI 研究本身，这就足够了——足以引发强烈的反馈循环——我们可能会很快取得进一步的进展，自动化的 AI 工程师 themselves 解决所有剩余的瓶颈，实现完全自动化的一切。
In particular, millions of automated researchers could very plausibly compress a decade of further algorithmic progress into a year or less.
特别是，数百万个自动化研究者很可能将十年的算法进步压缩到一年或更短的时间内。
AGI will merely be a small taste of the superintelligence soon to follow. (More on that in the next piece.)
AGI 只不过是即将到来的超级智能的一小部分。（下一部分会有更多相关内容。）

In any case, do not expect the vertiginous pace of progress to abate. The trendlines look innocent, but their implications are intense.
无论如何，不要期待进步的眩晕速度会减缓。趋势线看起来无害，但它们的含义却是强烈的。
As with every generation before them, every new generation of models will dumbfound most onlookers; they’ll be incredulous when, very soon, models solve incredibly difficult science problems that would take PhDs days, when they’re whizzing around your computer doing your job, when they’re writing codebases with millions of lines of code from scratch, when every year or two the economic value generated by these models 10xs.
与之前的每一代人一样，每一代新模型都会让大多数旁观者感到震惊；他们将会难以置信，当模型很快解决那些即使是博士也需要数天才能解决的极其困难的科学问题时，当它们在你的电脑周围飞快地完成你的工作时，当它们从零开始编写数百万行代码的代码库时，每隔一年或两年，这些模型产生的经济价值就会增长 10 倍。
Forget scifi, count the OOMs: it’s what we should expect. AGI is no longer a distant fantasy.
忘记科幻，计算数量级：这是我们应当期待的。通用人工智能（AGI）不再是遥远的幻想。
Scaling up simple deep learning techniques has just worked, the models just want to learn, and we’re about to do another 100,000x+ by the end of 2027.
扩展简单的深度学习技术已经取得了成功，模型只是想要学习，我们将在 2027 年年底之前实现另一个 100,000 倍以上的增长。
It won’t be long before they’re smarter than us.
它们很快就会比我们更聪明。

GPT-4 is just the beginning—where will we be four years later?
GPT-4 只是一个开始——四年后我们将在哪里？
Do not make the mistake of underestimating the rapid pace of deep learning progress (as illustrated by
不要犯低估深度学习进展快速步伐的错误（如下所示）。*progress in GANs*). GANs 的进展）。

Next post in series:
系列中的下一篇文章：
II. From AGI to Superintelligence: the Intelligence Explosion
II. 从 AGI 到超级智能：智能大爆炸

Addendum. Racing through the OOMs: It’s this decade or bust
附录。穿越 OOMs 的竞赛：要么这个十年，要么失败

I used to be more skeptical of short timelines to AGI.
我曾经对 AGI 的短期时间线持怀疑态度。
One reason is that it seemed unreasonable to privilege this decade, concentrating so much AGI-probability-mass on it (it seemed like a classic fallacy to think “oh we’re so special”).
一个原因是，将这个十年特别看待，将大量的 AGI 可能性质量集中于此（这看起来像是典型的谬误，认为“我们很特殊”）。
I thought we should be uncertain about what it takes to get AGI, which should lead to a much more “smeared-out” probability distribution over when we might get AGI.
我认为我们应该对实现 AGI 所需的条件不确定，这应该导致我们对可能实现 AGI 的时间有一个更加“分散”的概率分布。

However, I’ve changed my mind: critically, our uncertainty over what it takes to get AGI should be over OOMs (of effective compute), rather than over years.
然而，我改变了我的想法：关键的是，我们对实现 AGI 所需条件的未知应该是关于有效计算量的数量级（OOMs），而不是关于年份。

We’re racing through the OOMs this decade. Even at its bygone heyday, Moore’s law was only 1–1.5 OOMs/decade.
我们在这个十年里正飞速穿越 OOM（数量级）。即使在它过去的黄金时代，摩尔定律也只有 1-1.5 个 OOM/十年。
I estimate that we will do ~5 OOMs in 4 years, and over ~10 this decade overall.
我估计我们在 4 年内将实现大约 5 个 OOM，而这个十年总体上超过大约 10 个 OOM。

We’ve been racing through the OOMs this decade; after the early 2030s, we will face a slow slog.
我们在这个十年里一直在经历运算量级（OOMs）的飞速增长；到了 2030 年代初，我们将面临缓慢的跋涉。

In essence, we’re in the middle of a huge scaleup reaping one-time gains this decade, and progress through the OOMs will be multiples slower thereafter.
实质上，我们正处于一个大规模扩张阶段，在这个十年里获得了一次性收益，而之后的运算量级（OOMs）进步将会慢得多。
If this scaleup doesn’t get us to AGI in the next 5-10 years, it might be a long way out.
如果这次扩张在未来 5-10 年内不能让我们达到通用人工智能（AGI），那么可能还有很长的路要走。

Spending scaleup: Spending a million dollars on a model used to be outrageous; by the end of the decade, we will likely have $100B or $1T clusters.
投资规模升级：过去投资一百万元用于模型建设曾是不可思议的；但到本十年末，我们可能会有 1000 亿或 1 万亿的集群。
Going much higher than that will be hard; that’s already basically the feasible limit (both in terms of what big business can afford, and even just as a fraction of GDP).
超过这个规模将会很困难；这已经基本上是可行的极限（无论是从大企业能负担的角度，还是仅作为 GDP 的一部分）。
Thereafter all we have is glacial 2%/year trend real GDP growth to increase this.
此后，我们只能依靠每年 2%的缓慢实际 GDP 增长来增加这个数字。
Hardware gains: AI hardware has been improving much more quickly than Moore’s law. That’s because we’ve been specializing chips for AI workloads.
硬件增益：AI 硬件的改进速度已经远远超过了摩尔定律。这是因为我们一直在为 AI 工作负载专门化芯片。
For example, we’ve gone from CPUs to GPUs; adapted chips for Transformers; and we’ve gone down to much lower precision number formats, from fp64/fp32 for traditional supercomputing to fp8 on H100s.
例如，我们从 CPU 转向了 GPU；为 Transformer 适应了芯片；并且我们已经将数字格式降低到更低的精度，从传统超级计算的 fp64/fp32 降到了 H100 上的 fp8。
These are large gains, but by the end of the decade we’ll likely have totally-specialized AI-specific chips, without much further beyond-Moore’s law gains possible.
这些是巨大的增益，但到本世纪末，我们可能拥有完全专门的 AI 特定芯片，而没有太多超出摩尔定律增益的可能。
Algorithmic progress: In the coming decade, AI labs will invest tens of billions in algorithmic R&D, and all the smartest people in the world will be working on this; from tiny efficiencies to new paradigms, we’ll be picking lots of the low-hanging fruit.
算法进步：在未来十年，AI 实验室将在算法研发上投入数十亿美元，全世界最聪明的人都将致力于此；从微小的效率提升到新的范例，我们将摘取许多触手可及的果实。
We probably won’t reach any sort of hard limit (though “unhobblings” are likely finite), but at the very least the pace of improvements should slow down, as the rapid growth (in $ and human capital investments) necessarily slows down (e.g., most of the smart STEM talent will already be working on AI).
我们可能不会达到任何硬性限制（尽管“解除限制”的可能性是有限的），但至少改进的速度应该会放缓，因为随着资金和人力资本投资的快速增长必然放缓（例如，大部分聪明的 STEM 人才可能已经在从事 AI 工作了）。
(That said, this is the most uncertain to predict, and the source of most of the uncertainty on the OOMs in the 2030s on the plot above.)
（话虽如此，这是最不确定的预测，也是上面图表中 2030 年代数量级不确定性（OOMs）的主要来源。）

Put together, this means we are racing through many more OOMs in the next decade than we might in multiple decades thereafter.
综合来看，这意味着在未来十年我们将经历比之后几十年可能经历的 OOM（阶跃）要多得多。
Maybe it’s enough—and we get AGI soon—or we might be in for a long, slow slog. You and I can reasonably disagree on the median time to AGI, depending on how hard we think achieving AGI will be—but given how we’re racing through the OOMs right now, certainly your modal AGI year should sometime later this decade or so.
或许这已经足够——我们很快就会得到 AGI（通用人工智能），或者我们可能面临一场漫长而缓慢的苦战。您和我可以合理地就到达 AGI 的中位时间持不同意见，这取决于我们认为实现 AGI 的难度有多大——但鉴于我们目前正快速经历 OOMs，您预期的 AGI 年份应该在本十年的某个时候。

Matthew Barnett has a nice related visualization of this, considering just compute and biological bounds.
马修·巴内特有一个很好的相关可视化，仅考虑计算和生物学的边界。

Predictions they’ve made every year for the last decade, and which they’ve been consistently wrong about…
他们过去十年每年所做的预测，而这些预测他们一直都在错误中…
From SSC: Janelle Shane asks GPT-2 its ten favorite animals:
来自 SSC：Janelle Shane 问 GPT-2 它最喜欢的十种动物：
Prompt: My 10 favorite animals are: 1.
提示：我最喜欢的十种动物是：1.
My ten favorite animals are:
我最喜欢的十种动物是：
Zebras with a white scar on the back
背后有白色疤痕的斑马
Insiduous spiders and octopus
潜伏的蜘蛛和章鱼
Frog with large leaves, hopefully black
青蛙带有大叶子，希望是黑色的
Cockatiel with scales 鹦鹉带有鳞片
Razorbill with wings hanging about 4 inches from one’s face and a heart tattoo on a frog
剪刀鸟的翅膀悬停在距离脸部大约 4 英寸的位置，青蛙身上有一个心形纹身 It seems there was a mistake in the provided text as it does not contain any technology sector terms. If you provide a YAML-formatted document with technology-related content, I will translate it accordingly
Cockatric interlocking tetrabods that can be blind, cut, and eaten raw:
鹦鹉螺互锁四足体，可以盲目、切割、生食：
Black and white desert crocodiles living in sunlight
黑白沙漠鳄鱼在阳光下生活
Zebra and many other pea bugs
斑马和其他许多豌豆虫
From the GPT-2 paper, Section 3.6.
来自 GPT-2 论文，第 3.6 节。
I mean clunky old GPT-3 here, not the dramatically-improved GPT-3.5 you might know from ChatGPT.
我指的是笨重的旧版 GPT-3，而不是你可能从 ChatGPT 了解到的显著改进后的 GPT-3.5。
And no, these tests aren’t in the training set.
并且不，这些测试不在训练集中。
AI labs put real effort into ensuring these evals are uncontaminated, because they need good measurements in order to do good science. A recent analysis on this by ScaleAI confirmed that the leading labs aren’t overfitting to the benchmarks (though some smaller LLM developers might be juicing their numbers).
AI 实验室确实付出了很大努力来确保这些评估是不受污染的，因为它们需要好的测量结果来进行好的科学研究。ScaleAI 最近的分析证实，领先的实验室并没有对基准过度拟合（尽管一些较小的LLM开发者可能会篡改他们的数据）。
In the original paper, it was noted: “We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.”
在原始论文中，提到：“我们还评估了人类在 MATH 上的表现，发现一个不喜欢数学的计算机科学博士生的得分约为 40%，而一位三次 IMO 金牌得主得分达到 90%，这表明 MATH 对人类来说同样具有挑战性。”
A coauthor notes: “When our group first released the MATH dataset, at least one [ML researcher colleague] told us that it was a pointless dataset because it was too far outside the range of what ML models could accomplish (indeed, I was somewhat worried about this myself).”
一位合作作者指出：“当我们的团队首次发布 MATH 数据集时，至少有一位机器学习研究同事告诉我们这个数据集没有意义，因为它超出了机器学习模型所能达到的范围（实际上，我对此也有些担心）。”
Here’s Yann LeCun predicting in 2022 that even GPT-5000 won’t be able to reason about physical interactions with the real world; GPT-4 obviously does it with ease a year later.
这是 Yann LeCun 在 2022 年预测，即使 GPT-5000 也无法推理关于现实世界的物理互动；而一年后的 GPT-4 显然能够轻松做到这一点。

Here’s Gary Marcus’s walls predicted after GPT-2 being solved by GPT-3, and the walls he predicted after GPT-3 being solved by GPT-4.
这里的加里·马库斯预测了在 GPT-2 被 GPT-3 解决之后的墙壁，以及他预测在 GPT-3 被 GPT-4 解决之后的墙壁。

Here’s Prof. Bryan Caplan losing his first-ever public bet (after previously famously having a perfect public betting track record).
这是布莱恩·卡普兰教授首次在公开场合输掉赌注（此前他曾在公开投注中保持不败记录）。
In January 2023, after GPT-3.5 got a D on his economics midterm, Prof. Caplan bet Matthew Barnett that no AI would get an A on his economics midterms by 2029. Just two months later, when GPT-4 came out, it promptly scored an A on his midterm (and it would have been one of the highest scores in his class).
在 2023 年 1 月，GPT-3.5 在经济学期中考试中得了 D 之后，卡普兰教授和马修·巴内特打赌，到 2029 年之前没有 AI 能在他的经济学期中考试中得到 A。仅仅两个月后，当 GPT-4 发布时，它立刻在他的期中考试中得到了 A（并且这将是班上最高分之一）。
On the diamond set, majority voting of the model trying 32 times with chain-of-thought.
在钻石集上，模型尝试 32 次进行链式思维的大多数投票。
And it’s worth noting just how consistent these trendlines are. Combining the original scaling laws paper with some of the estimates on compute and compute efficiency scaling since then implies a consistent scaling trend for over 15 orders of magnitude (over 1,000,000,000,000,000x in effective compute)!
值得注意的是这些趋势线的连贯性有多么一致。将原始的扩展定律论文与自那时以来的计算和计算效率扩展的估计相结合，表明超过 15 个数量级（超过 1,000,000,000,000,000 倍的有效计算）的持续扩展趋势！
A common misconception is that scaling only holds for perplexity loss, but we see very clear and consistent scaling behavior on downstream performance on benchmarks as well.
一个常见的误解是，扩展性仅适用于困惑度损失，但我们同样在基准测试的下游性能上观察到非常清晰且一致的扩展行为。
It’s usually just a matter of finding the right log-log graph.
这通常只是找到正确的对数-对数图表的问题。
For example, in the GPT-4 blog post, they show consistent scaling behavior for performance on coding problems over 6 OOMs (1,000,000x) of compute, using MLPR (mean log pass rate).
例如，在 GPT-4 的博客文章中，他们展示了在编码问题上的性能在 6 个数量级（1,000,000x）的计算扩展下具有一致的扩展行为，使用 MLPR（平均对数通过率）。
The “Are Emergent Abilities a Mirage?” paper makes a similar point; with the right choice of metric, there is almost always a consistent trend for performance on downstream tasks.
“《涌现能力是否海市蜃楼？》这篇论文提出了类似观点；选择合适的评价指标，几乎总是存在一个在下游任务性能上的一致趋势。”

More generally, the “scaling hypothesis” qualitative observation—very clear trends on model capability with scale—predates loss-scaling-curves; the “scaling laws” work was just a formal measurement of this.
更普遍地说，“缩放假设”的定性观察——模型能力随规模增长呈现非常清晰的趋势——早于损失缩放曲线；而“缩放定律”的工作只是对此的正式度量。
1. Gemini 1.5 Flash scores 54.9% on MATH, and costs $0.35/$1.05 (input/output) per million tokens.
1. Gemini 1.5 Flash 在 MATH 上得分 54.9%，每个百万标记的成本为$0.35/$1.05（输入/输出）。
GPT-4 scored 42.5% on MATH prelease and 52.9% on MATH in early 2023, and cost $30/$60 (input/output) per million tokens; that’s 85x/57x (input/output) more expensive per token than Gemini 1.5 Flash.
GPT-4 在发布前的 MATH 上得分 42.5%，2023 年初在 MATH 上得分 52.9%，每个百万标记的成本为$30/$60（输入/输出）；这比 Gemini 1.5 Flash 每个标记的成本高出 85 倍/57 倍（输入/输出）。
To be conservative, I use an estimate of 30x cost decrease above (accounting for Gemini 1.5 Flash possibly using more tokens to reason through problems).
为了保守起见，我使用上述 30 倍的成本降低估计（考虑到 Gemini 1.5 Flash 可能使用更多标记来进行问题推理）。
2. Minerva540B scores 50.3% on MATH, using majority voting among 64 samples.
2. Minerva540B 在 MATH 上的得分为 50.3%，通过在 64 个样本中采用多数投票。
A knowledgeable friend estimates the base model here is probably 2-3x more expensive to inference than GPT-4.
一位知识渊博的朋友估计这里的基模型在推理上可能比 GPT-4 贵 2-3 倍。
However, Minerva seems to use somewhat fewer tokens per answer on a quick spot check.
然而，在快速检查中，Minerva 似乎在每答案中使用的标记数要少一些。
More importantly, Minerva needed 64 samples to achieve that performance, naively implying a 64x multiple on cost if you e.g. naively ran this via an inference API.
更重要的是，Minerva 需要 64 个样本才能达到这种性能，天真地暗示如果你例如天真地通过推理 API 运行这个，成本将增加 64 倍。
In practice, prompt tokens can be cached when running an eval; given a few-shot prompt, prompt tokens are likely a majority of the cost, even accounting for output tokens.
在实际操作中，当运行评估时可以将提示令牌缓存；给定一个少量样本的提示，提示令牌很可能是成本的主要部分，即使考虑到输出令牌。
Supposing output tokens are a third of the cost for getting a single sample, that would imply only a ~20x increase in cost from the maj@64 with caching.
假设输出令牌获取单个样本的成本是三分之一，那么这将意味着即使有缓存，从 maj@64 的成本也只有大约 20 倍的增加。
To be conservative, I use the rough number of a 20x cost decrease in the above (even if the naive decrease in inference cost from running this via an API would be larger).
为了保守起见，我使用了一个粗略的数字，即上述成本的 20 倍减少（即使通过 API 运行此操作产生的推理成本减少可能会更大）。
Though these are inference efficiencies (rather than necessarily training efficiencies), and to some extent will reflect inference-specific optimizations, a) they suggest enormous amounts of algorithmic progress is possible and happening in general, and b) it’s often the case that an algorithmic improvements is both a training efficiency gain and an inference efficiency, for example by reducing the number of parameters necessary.
虽然这些都是推理效率（而不是必然的训练效率），在某种程度上反映了针对推理的特定优化，但 a)它们表明在一般情况下算法进步的可能性巨大且正在发生，b)算法改进通常既带来训练效率的提升，也带来推理效率的提升，例如通过减少必要的参数数量。
GPT-3: $60/1M tokens, GPT-4: $30/1M input tokens and $60/1M output tokens.
GPT-3：$60/100 万 tokens，GPT-4：$30/100 万输入 tokens 和$60/100 万输出 tokens。
Chinchilla scaling laws say that one should scale parameter count and data equally.
Chinchilla 缩放法则指出，应当同等比例地缩放参数数量和数据。
That is, parameter count grows “half the OOMs” of the OOMs that effective training compute grows.
即，参数数量的增长是有效训练计算增长的数量级的一半。
At the same time, parameter count is intuitively roughly proportional to inference costs.
同时，参数数量直观上大致与推理成本成正比。
All else equal, constant inference costs thus implies that half of the OOMs of effective compute growth were “canceled out” by algorithmic win.
在其他条件不变的情况下，常数推理成本因此意味着有效计算增长的一半 OOMs 被算法优势“抵消”了。
That said, to be clear, this is a very naive calculation (just meant for a rough illustration) that is wrong in various ways.
话虽如此，需要明确的是，这是一个非常天真的计算（仅用于粗略说明），在各个方面都是错误的。
There may be inference-specific optimizations (that don’t translate into training efficiency); there may be training efficiencies that don’t reduce parameter count (and thus don’t translate into inference efficiency); and so on.
可能存在特定于推理的优化（这些优化不转化为训练效率）；可能存在不减少参数数量的训练效率（因此不转化为推理效率）；等等。
Gemini 1.5 Flash ranks similarly to GPT-4 (higher than original GPT-4, lower than updated versions of GPT-4) on LMSys, a chatbot leaderboard, and has similar performance on MATH and GPQA (evals that measure reasoning) as the original GPT-4, while landing roughly in the middle between GPT-3.5 and GPT-4 on MMLU (an eval that more heavily weights towards measuring knowledge).
Gemini 1.5 Flash 在 LMSys 上排名与 GPT-4 相似（高于原始 GPT-4，低于更新版本的 GPT-4），这是一个聊天机器人排行榜，同时在 MATH 和 GPQA（衡量推理能力的评估）上的性能与原始 GPT-4 相似，而在 MMLU（一个更侧重于衡量知识的评估）上则大致位于 GPT-3.5 和 GPT-4 之间。
At ~GPT-3 scale, more than 3x at larger scales.
在 ~GPT-3 规模时，大于 3 倍，在更大规模时更是如此。
For example, this paper contains a comparison of a GPT-3-style vanilla Transformer to various simple changes to architecture and training recipe published over the years (RMSnorms instead of layernorm, different positional embeddings, SwiGlu activation, AdamW optimizer instead of Adam, etc.), what they call “Transformer++”, implying a 6x gain at least at small scale.
例如，本文包含了对 GPT-3 风格的 vanilla Transformer 与多年来发布的各种架构和训练配方简单变化（如使用 RMSnorms 代替 layernorm，不同的位置编码，SwiGlu 激活函数，AdamW 优化器代替 Adam 等）的比较，他们称之为“Transformer++”，意味着在小规模上至少有 6 倍的提升。
If we take the trend of 0.5 OOMs/year, and 4 years between GPT-2 and GPT-4 release, that would be 2 OOMs. However, GPT-2 to GPT-3 was a simple scaleup (after big gains from e.g. Transformers), and OpenAI claims GPT-4 pretraining finished in 2022, which could mean we’re looking at closer to 2 years worth of algorithmic progress that we should be counting here.
如果我们按照每年 0.5 个 OOMs 的增长趋势，以及 GPT-2 和 GPT-4 发布之间 4 年的时间，那么将是 2 个 OOMs。然而，GPT-2 到 GPT-3 只是一个简单的规模扩大（在例如 Transformers 带来的巨大收益之后），OpenAI 声称 GPT-4 的预训练在 2022 年完成，这意味着我们在这里可能要计算接近 2 年的算法进步。
1 OOM of algorithmic efficiency seems like a conservative lower bound.
算法效率的低限似乎是一个保守的 1 个数量级。
At the very least, given over a decade of consistent algorithmic improvements, the burden of proof would be on those who would suggest it will all suddenly come to a halt!
至少，鉴于超过十年的持续算法改进，举证责任将在于那些认为这一切会突然停止的人！
The economic returns to a 3x compute efficiency will be measured in the $10s of billions or more, given cluster costs.
计算效率提升 3 倍所带来的经济回报将以数十亿美元甚至更多来衡量，考虑到集群成本。
Very roughly something like a ~10x gain.
非常粗略地说，大概有 10 倍的收益。
And just rereading the same textbook over and over again might result in memorization, not understanding.
而仅仅是一次又一次地重读同一本教科书可能会导致记忆，而不是理解。
I take it that’s how many wordcels pass math classes!
我猜这就是多少 wordcels 通过数学课的方式！
One other way of thinking about it I find interesting: there is a “missing-middle” between pretraining and in-context learning.
我觉得另一种思考方式也很有趣：在预训练和上下文学习之间存在“缺失的中层”。
In-context learning is incredible (and competitive with human sample efficiency). For example, the Gemini 1.5 Pro paper discusses giving the model instructional materials (a textbook, a dictionary) on Kalamang, a language spoken by fewer than 200 people and basically not present on the internet, in context—and the model learns to translate from English to Kalamang at human-level!
上下文学习非常惊人（并且与人类样本效率具有竞争力）。例如，Gemini 1.5 Pro 的论文讨论了在 Kalamang 语言上下文中给模型教学材料（教科书、词典），Kalamang 是一种只有不到 200 人说的语言，基本上不在互联网上出现，模型能够学会在上下文中将英语翻译成 Kalamang，达到人类水平！
In context, the model is able to learn from the textbook as well as a human could (and much better than it would learn from just chucking that one textbook into pretraining).
在上下文中，模型能够像人类一样从教科书中学习（并且比仅仅将那一本教科书投入预训练中学习得好得多）。

When a human learns from a textbook, they’re able to distill their short-term memory/learnings into long-term memory/long-term skills with practice; however, we don’t have an equivalent way to distill in-context learning “back to the weights.” Synthetic data/self-play/RL/etc are trying to fix that: let the model learn by itself, then think about it and practice what it learned, distilling that learning back into the weights.
当人类从教科书中学习时，他们能够通过练习将短期记忆/学习转化为长期记忆/长期技能；然而，我们没有一个相当的方法将上下文学习“蒸馏回权重”。合成数据/自我游戏/强化学习等正在尝试解决这个问题：让模型自己学习，然后思考并练习它所学的内容，将那种学习再蒸馏回权重中。
See also Andrej Karpathy’s talk discussing this here.
参见 Andrej Karpathy 的演讲，讨论这个问题。
That’s the magic of unsupervised learning, in some sense: to better predict the next token, to make perplexity go down, models learn incredibly rich internal representations, everything from (famously) sentiment to complex world models.
这就是无监督学习的魔力所在，从某种意义上来说：为了更好地预测下一个标记，降低困惑度，模型学习极其丰富的内部表示，从（著名的）情感到复杂的世界模型无所不包。
But, out of the box, they’re hobbled: they’re using their incredible internal representations merely to predict the next token in random internet text, and rather than applying them in the best way to actually try to solve your problem.
但是，开箱即用的情况下，它们受到了限制：它们仅仅是在使用它们惊人的内部表示来预测随机互联网文本中的下一个标记，而没有以最佳方式应用它们来真正尝试解决你的问题。
See Figure 7 from the updated Gemini 1.5 whitepaper, comparing perplexity vs. context for Gemini 1.5 Pro and Gemini 1.5 Flash (a much cheaper and presumably smaller model).
参见图 7，来自更新的 Gemini 1.5 白皮书，对比了 Gemini 1.5 Pro 和 Gemini 1.5 Flash（一个更便宜且假设规模更小的模型）的困惑度与上下文。
People are working on this though!
人们正在这方面努力！
Which makes sense—why would it have learned the skills for longer-horizon reasoning and error correction?
这是有道理的——为什么它会学会更长远推理和错误校正的技能呢？
There’s very little data on the internet in the form of “my complete internal monologue, reasoning, all the relevant steps over the course of a month as I work on a project.” Unlocking this capability will require a new kind of training, for it to learn these extra skills.
互联网上几乎没有以“我在一个月内对一个项目工作的完整内心独白、推理、所有相关步骤”形式的数据。解锁这种能力将需要一种新的训练方式，让它学会这些额外的技能。

Or as Gwern put it (private correspondence): “‘Brain the size of a galaxy, and what do they ask me to do?
或者正如 Gwern 所说（私人通信）：“‘大脑的体积相当于一个星系，他们却让我做什么？预测基准测试上的拼写错误答案！’沮丧的神经网络 Marvin 抱怨道。” {{0}}
Predict the misspelled answers on benchmarks!’ Marvin the depressed neural network moaned.”
"预测基准测试上的拼写错误答案！" 马文这个沮丧的神经网络哀叹。”
System I vs. 系统 I 与
System II is a useful way of thinking about current capabilities of LLMs—including their limitations and dumb mistakes—and what might be possible with RL and unhobbling.
系统 II 是思考当前 LLMs 能力的一种有用方式——包括它们的局限性和愚蠢错误——以及使用强化学习（RL）解除限制后可能实现的内容。
Think of this way: when you are driving, most of the time you are on autopilot (system I, what models mostly do right now).
这样思考：当你驾驶时，大部分时间你都是在自动驾驶（系统 I，目前模型主要在做的事情）。
But when you encounter a complex construction zone or novel intersection, you might ask your passenger-seat-companion to pause your conversation for a moment while you figure out—actually think about—what’s going on and what to do.
但当您遇到复杂的施工区域或新颖的交叉口时，您可能会要求坐在副驾驶座位上的同伴暂停一下您的对话，以便您思考一下当前的情况和应该怎么做。
If you were forced to go about life with only system I (closer to models today), you’d have a lot of trouble.
如果您被迫只依赖系统 I（更接近当今的模型）来生活，您会遇到很多麻烦。
Creating the ability for system II reasoning loops is a central unlock.
创建系统 II 推理循环的能力是一个关键解锁。
On the best guess assumptions on physical compute and algorithmic efficiency scaleups described above, and simplifying parallelism considerations (in reality, it might look more like “1440 (60*24) GPT-4-level models in a day” or similar).
在关于物理计算和算法效率规模提升的最佳猜测假设下，以及简化并行性考虑（实际上，可能更像是“一天内 1440（60*24）个 GPT-4 级别模型”或类似）。
Of course, any benchmark we have today will be saturated.
当然，我们今天拥有的任何基准都会被饱和。
But that’s not saying much; it’s mostly a reflection on the difficulty of making hard-enough benchmarks.
但这并没有说明太多；这主要是因为制作足够困难的基准的难度。