这是用户在 2025-1-30 17:33 为 https://darioamodei.com/on-deepseek-and-export-controls 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Dario Amodei  达里奥·阿莫代伊

On DeepSeek and Export Controls
关于 DeepSeek 和出口管制

A few weeks ago I made the case for stronger US export controls on chips to China. Since then DeepSeek, a Chinese AI company, has managed to — at least in some respects — come close to the performance of US frontier AI models at lower cost.
几周前,我曾主张加强美国对华芯片出口管制。此后,中国人工智能公司深眸(DeepSeek)在某些方面已经——至少在某些方面——以较低的成本接近了美国前沿人工智能模型的性能。

Here, I won't focus on whether DeepSeek is or isn't a threat to US AI companies like Anthropic (although I do believe many of the claims about their threat to US AI leadership are greatly overstated)1. Instead, I'll focus on whether DeepSeek's releases undermine the case for those export control policies on chips. I don't think they do. In fact, I think they make export control policies even more existentially important than they were a week ago2.
这里,我不会关注 DeepSeek 是否对 Anthropic 等美国人工智能公司构成威胁(尽管我相信关于它们对美国人工智能领导地位的威胁的许多说法都被夸大了) 1 。相反,我将关注 DeepSeek 的发布是否会削弱对芯片出口管制政策的论证。我认为不会。事实上,我认为它们使出口管制政策比一周前更加重要 2

Export controls serve a vital purpose: keeping democratic nations at the forefront of AI development. To be clear, they’re not a way to duck the competition between the US and China. In the end, AI companies in the US and other democracies must have better models than those in China if we want to prevail. But we shouldn't hand the Chinese Communist Party technological advantages when we don't have to.
出口管制具有至关重要的作用:使民主国家在人工智能发展方面保持领先地位。需要明确的是,这并不是规避美中竞争的一种方式。最终,如果我们想获胜,美国和其他民主国家的人工智能公司必须拥有比中国更好的模型。但我们不应该无缘无故地向中国共产党提供技术优势。

Three Dynamics of AI Development
人工智能发展的三个动力

Before I make my policy argument, I'm going to describe three basic dynamics of AI systems that it's crucial to understand:
在我阐述政策论点之前,我将描述人工智能系统的三个基本动态,理解这些动态至关重要:

  1. Scaling laws. A property of AI — which I and my co-founders were among the first to document back when we worked at OpenAI — is that all else equal, scaling up the training of AI systems leads to smoothly better results on a range of cognitive tasks, across the board. So, for example, a $1M model might solve 20% of important coding tasks, a $10M might solve 40%, $100M might solve 60%, and so on. These differences tend to have huge implications in practice — another factor of 10 may correspond to the difference between an undergraduate and PhD skill level — and thus companies are investing heavily in training these models.
    规模法则。AI 的一个特性——我和我的联合创始人曾在 OpenAI 工作时率先记录了这一特性——是,在其他条件相同的情况下,扩大 AI 系统的训练规模会导致一系列认知任务的结果平稳提升,全面提升。例如,一个 100 万美元的模型可能解决 20%的重要编码任务,一个 1000 万美元的模型可能解决 40%,一个 1 亿美元的模型可能解决 60%,以此类推。这些差异在实践中往往具有巨大的影响——另一个数量级的差异可能对应于本科生和博士生技能水平的差异——因此,公司正在大力投资训练这些模型。

  2. Shifting the curve. The field is constantly coming up with ideas, large and small, that make things more effective or efficient: it could be an improvement to the architecture of the model (a tweak to the basic Transformer architecture that all of today's models use) or simply a way of running the model more efficiently on the underlying hardware. New generations of hardware also have the same effect. What this typically does is shift the curve: if the innovation is a 2x "compute multiplier" (CM), then it allows you to get 40% on a coding task for $5M instead of $10M; or 60% for $50M instead of $100M, etc. Every frontier AI company regularly discovers many of these CM's: frequently small ones (~1.2x), sometimes medium-sized ones (~2x), and every once in a while very large ones (~10x). Because the value of having a more intelligent system is so high, this shifting of the curve typically causes companies to spend more, not less, on training models: the gains in cost efficiency end up entirely devoted to training smarter models, limited only by the company's financial resources. People are naturally attracted to the idea that "first something is expensive, then it gets cheaper" — as if AI is a single thing of constant quality, and when it gets cheaper, we'll use fewer chips to train it. But what's important is the scaling curve: when it shifts, we simply traverse it faster, because the value of what's at the end of the curve is so high. In 2020, my team published a paper suggesting that the shift in the curve due to algorithmic progress is ~1.68x/year. That has probably sped up significantly since; it also doesn't take efficiency and hardware into account. I'd guess the number today is maybe ~4x/year. Another estimate is here. Shifts in the training curve also shift the inference curve, and as a result large decreases in price holding constant the quality of model have been occurring for years. For instance, Claude 3.5 Sonnet which was released 15 months later than the original GPT-4 outscores GPT-4 on almost all benchmarks, while having a ~10x lower API price.
    曲线转移。该领域不断涌现各种大小的创意,使事情更有效或更高效:这可能是对模型架构的改进(对当今所有模型使用的基本 Transformer 架构进行调整),也可能只是在底层硬件上更有效地运行模型的方法。新一代硬件也具有同样的效果。这通常会使曲线转移:如果创新是 2 倍的“计算倍增器”(CM),那么它允许你以 500 万美元而不是 1000 万美元获得编码任务的 40%;或者以 5000 万美元而不是 1 亿美元获得 60% 等。每家前沿 AI 公司定期发现许多这样的 CM:通常是小型的 (~1.2x),有时是中型的 (~2x),偶尔也有非常大的 (~10x)。由于拥有更智能系统的价值如此之高,这种曲线的转移通常会导致公司在模型训练上投入更多,而不是更少:成本效率的提升最终完全用于训练更智能的模型,仅受公司财务资源的限制。 人们自然而然地被“先贵后便宜”的想法所吸引——仿佛 AI 是一件品质恒定的单一事物,便宜了之后,我们就会用更少的芯片来训练它。但重要的是规模曲线:当它发生变化时,我们只是更快地遍历它,因为曲线末端的东西价值很高。2020 年,我的团队发表了一篇论文,指出由于算法进步导致的曲线变化约为每年 1.68 倍。这速度自那以后可能大大加快了;它也没有考虑效率和硬件。我猜今天的数字可能是每年约 4 倍。另一个估算在这里。训练曲线的变化也会改变推理曲线,因此,多年来,在模型质量保持不变的情况下,价格大幅下降。例如,Claude 3.5 Sonnet 比最初的 GPT-4 晚发布 15 个月,但在几乎所有基准测试中都优于 GPT-4,而 API 价格却低了约 10 倍。

  3. Shifting the paradigm. Every once in a while, the underlying thing that is being scaled changes a bit, or a new type of scaling is added to the training process. From 2020-2023, the main thing being scaled was pretrained models: models trained on increasing amounts of internet text with a tiny bit of other training on top. In 2024, the idea of using reinforcement learning (RL) to train models to generate chains of thought has become a new focus of scaling. Anthropic, DeepSeek, and many other companies (perhaps most notably OpenAI who released their o1-preview model in September) have found that this training greatly increases performance on certain select, objectively measurable tasks like math, coding competitions, and on reasoning that resembles these tasks. This new paradigm involves starting with the ordinary type of pretrained models, and then as a second stage using RL to add the reasoning skills. Importantly, because this type of RL is new, we are still very early on the scaling curve: the amount being spent on the second, RL stage is small for all players. Spending $1M instead of $0.1M is enough to get huge gains. Companies are now working very quickly to scale up the second stage to hundreds of millions and billions, but it's crucial to understand that we're at a unique "crossover point" where there is a powerful new paradigm that is early on the scaling curve and therefore can make big gains quickly.
    改变范式。每隔一段时间,被扩展的底层事物就会发生一些变化,或者在训练过程中会增加一种新的扩展类型。从 2020 年到 2023 年,主要被扩展的是预训练模型:在越来越多的互联网文本上进行训练的模型,并在其之上进行少量其他训练。2024 年,使用强化学习 (RL) 来训练模型生成思维链的想法已成为扩展的新焦点。Anthropic、DeepSeek 和许多其他公司(也许最值得注意的是在 9 月份发布了其 o1-预览模型的 OpenAI)已经发现,这种训练大大提高了某些特定、客观可衡量的任务(如数学、编码竞赛和类似这些任务的推理)的性能。这种新的范式包括首先使用普通类型的预训练模型,然后作为第二阶段使用 RL 来添加推理技能。重要的是,由于这种类型的 RL 比较新,我们仍然处于扩展曲线的早期阶段:所有参与者在第二阶段 RL 上的支出都很少。花费 100 万美元而不是 10 万美元就足以获得巨大的收益。 公司现在正快速推进第二阶段的规模扩张,目标是达到数亿甚至数十亿的规模,但关键在于,我们正处于一个独特的“交叉点”:一个强大的新范式正处于规模扩张曲线的早期阶段,因此能够迅速取得巨大进展。

DeepSeek's Models  DeepSeek 的模型

The three dynamics above can help us understand DeepSeek's recent releases. About a month ago, DeepSeek released a model called "DeepSeek-V3" that was a pure pretrained model3 — the first stage described in #3 above. Then last week, they released "R1", which added a second stage. It's not possible to determine everything about these models from the outside, but the following is my best understanding of the two releases.
上述三种动态有助于我们理解 DeepSeek 最近发布的模型。大约一个月前,DeepSeek 发布了一个名为“DeepSeek-V3”的纯预训练模型 3 ——即上面#3 中描述的第一阶段。然后在上周,他们发布了“R1”,它增加了一个第二阶段。我们无法从外部完全了解这些模型,但以下是笔者对这两个版本的最佳理解。

DeepSeek-V3 was actually the real innovation and what should have made people take notice a month ago (we certainly did). As a pretrained model, it appears to come close to the performance of4 state of the art US models on some important tasks, while costing substantially less to train (although, we find that Claude 3.5 Sonnet in particular remains much better on some other key tasks, such as real-world coding). DeepSeek's team did this via some genuine and impressive innovations, mostly focused on engineering efficiency. There were particularly innovative improvements in the management of an aspect called the "Key-Value cache", and in enabling a method called "mixture of experts" to be pushed further than it had before.
DeepSeek-V3 实际上是真正的创新,本应该在一个月前就引起人们的注意(我们当然注意到了)。作为一个预训练模型,它在一些重要任务上的性能似乎接近 4 最先进的美国模型,同时训练成本大大降低(尽管我们发现,Claude 3.5 Sonnet 在其他一些关键任务上,例如现实世界的编码,仍然要好得多)。DeepSeek 团队通过一些真正令人印象深刻的创新实现了这一点,主要集中在工程效率上。在“键值缓存”这个方面的管理以及推动“专家混合”方法超越以往方面,都进行了特别创新的改进。

However, it's important to look closer:
然而,仔细看看很重要:

R1, which is the model that was released last week and which triggered an explosion of public attention (including a ~17% decrease in Nvidia's stock price), is much less interesting from an innovation or engineering perspective than V3. It adds the second phase of training — reinforcement learning, described in #3 in the previous section — and essentially replicates what OpenAI has done with o1 (they appear to be at similar scale with similar results)8. However, because we are on the early part of the scaling curve, it’s possible for several companies to produce models of this type, as long as they’re starting from a strong pretrained model. Producing R1 given V3 was probably very cheap. We’re therefore at an interesting “crossover point”, where it is temporarily the case that several companies can produce good reasoning models. This will rapidly cease to be true as everyone moves further up the scaling curve on these models.
R1,也就是上周发布并引发公众广泛关注的模型(包括英伟达股价下跌约 17%),从创新或工程角度来看远不如 V3 有趣。它增加了第二阶段的训练——强化学习,在上一节的#3 中有所描述——并且基本上复制了 OpenAI 使用 o1 所做的事情(它们似乎规模相似,结果也相似) 8 。然而,由于我们正处于规模曲线的早期阶段,只要从强大的预训练模型开始,几家公司都有可能生产出这种类型的模型。鉴于 V3,生产 R1 可能非常便宜。因此,我们正处于一个有趣的“交叉点”,暂时情况下,几家公司可以生产出良好的推理模型。随着每个人在这些模型上进一步沿着规模曲线向上移动,这种情况将迅速不再成立。

Export Controls  出口管制

All of this is just a preamble to my main topic of interest: the export controls on chips to China. In light of the above facts, I see the situation as follows:
所有这些都只是我主要感兴趣的话题的序言:对华芯片出口管制。鉴于上述事实,我认为情况如下:

Given my focus on export controls and US national security, I want to be clear on one thing. I don't see DeepSeek themselves as adversaries and the point isn't to target them in particular. In interviews they've done, they seem like smart, curious researchers who just want to make useful technology.
鉴于我关注出口管制和美国国家安全,我想明确一点。我不认为 DeepSeek 本身就是对手,重点也不是特别针对他们。在他们接受的采访中,他们似乎是一群聪明、好奇的研究人员,只想制造有用的技术。

But they're beholden to an authoritarian government that has committed human rights violations, has behaved aggressively on the world stage, and will be far more unfettered in these actions if they're able to match the US in AI. Export controls are one of our most powerful tools for preventing this, and the idea that the technology getting more powerful, having more bang for the buck, is a reason to lift our export controls makes no sense at all.
但他们受制于一个犯下侵犯人权行为、在世界舞台上表现侵略性、并且如果他们在人工智能方面能够与美国匹敌,其行为将更加不受约束的专制政府。出口管制是我们阻止这种情况发生的最有力工具之一,认为技术越来越强大、性价比越来越高是取消出口管制理由的想法根本说不通。

Footnotes  脚注

  1. 1I’m not taking any position on reports of distillation from Western models in this essay. Here, I’ll just take DeepSeek at their word that they trained it the way they said in the paper.
    1 本文中,我不对西方模型蒸馏的报道发表任何立场。在这里,我只是相信 DeepSeek 按照他们在论文中所说的那样训练了它。↩

  2. 2Incidentally, I think the release of the DeepSeek models is clearly not bad for Nvidia, and that a double-digit (~17%) drop in their stock in reaction to this was baffling. The case for this release not being bad for Nvidia is even clearer than it not being bad for AI companies. But my main goal in this piece is to defend export control policies.
    顺便说一句,我认为 DeepSeek 模型的发布对英伟达来说显然并非坏事,而其股价对此反应下跌两位数(约 17%)令人费解。该发布对英伟达并非坏事的论据甚至比对人工智能公司并非坏事的论据更清晰。但我在这篇文章中的主要目标是为出口管制政策辩护。↩

  3. 3To be completely precise, it was a pretrained model with the tiny amount of RL training typical of models before the reasoning paradigm shift.
    3 精确地说,它是一个预训练模型,只经过了少量强化学习训练,这在推理范式转变之前的模型中很常见。↩

  4. 4It is stronger on some very narrow tasks.
    4 它在一些非常狭窄的任务上更强大。

  5. 5This is the number quoted in DeepSeek's paper — I am taking it at face value, and not doubting this part of it, only the comparison to US company model training costs, and the distinction between the cost to train a specific model (which is the $6M) and the overall cost of R&D (which is much higher). However we also can't be completely sure of the $6M — model size is verifiable but other aspects like quantity of tokens are not.
    5 这是 DeepSeek 论文中引用的数字——我对此处内容不做怀疑,只对与美国公司模型训练成本的比较以及特定模型训练成本(即 600 万美元)与研发总成本(要高得多)之间的区别表示质疑。但是,我们也不能完全确定这 600 万美元——模型大小是可以验证的,但其他方面,如标记数量则无法验证。↩

  6. 6In some interviews I said they had "50,000 H100's" which was a subtly incorrect summary of the reporting and which I want to correct here. By far the best known "Hopper chip" is the H100 (which is what I assumed was being referred to), but Hopper also includes H800's, and H20's, and DeepSeek is reported to have a mix of all three, adding up to 50,000. That doesn't change the situation much, but it's worth correcting. I'll discuss the H800 and H20 more when I talk about export controls.
    6 在一些采访中,我说他们拥有“50000 个 H100”,这是一个略有不准确的总结,我想在这里更正一下。目前最著名的“Hopper 芯片”是 H100(这也是我当时认为所指的),但 Hopper 还包括 H800 和 H20,据报道 DeepSeek 拥有这三种芯片的混合体,总计 50000 个。这并没有太大改变情况,但值得更正。我会在讨论出口管制时更多地讨论 H800 和 H20。↩

  7. 7Note: I expect this gap to grow greatly on the next generation of clusters, because of export controls.
    7 注意:由于出口管制,我预计下一代集群的差距会大幅扩大。↩

  8. 8I suspect one of the principal reasons R1 gathered so much attention is that it was the first model to show the user the chain-of-thought reasoning that the model exhibits (OpenAI's o1 only shows the final answer). DeepSeek showed that users find this interesting. To be clear this is a user interface choice and is not related to the model itself.
    我怀疑 R1 吸引如此多关注的一个主要原因是,它是第一个向用户展示其模型所展现的思维链推理的模型(OpenAI 的 o1 只显示最终答案)。DeepSeek 的研究表明用户对此很感兴趣。需要明确的是,这是一个用户界面选择,与模型本身无关。↩

  9. 9Note that China's own chips won't be able to compete with US-made chips any time soon. As I wrote in my recent op-ed with Matt Pottinger: "China's best AI chips, the Huawei Ascend series, are substantially less capable than the leading chip made by U.S.-based Nvidia. China also may not have the production capacity to keep pace with growing demand. There's not a single noteworthy cluster of Huawei Ascend chips outside China today, suggesting that China is struggling to meet its domestic needs...".
    9 请注意,中国的芯片在短期内无法与美国制造的芯片竞争。正如我最近与马特·波廷格在评论文章中写道的那样:“中国最好的 AI 芯片,华为昇腾系列,其能力远低于美国英伟达公司生产的领先芯片。中国可能也没有生产能力来满足不断增长的需求。目前在中国以外地区,没有一个值得关注的华为昇腾芯片集群,这表明中国正在努力满足其国内需求……”↩

  10. 10To be clear, the goal here is not to deny China or any other authoritarian country the immense benefits in science, medicine, quality of life, etc. that come from very powerful AI systems. Everyone should be able to benefit from AI. The goal is to prevent them from gaining military dominance.
    10 需要明确的是,此处的目标并非要剥夺中国或任何其他专制国家从强大的 AI 系统中获得的巨大益处,这些益处体现在科学、医学、生活质量等方面。每个人都应该能够从 AI 中受益。目标是防止它们获得军事优势。↩

  11. 11Several links, as there have been several rounds. To cover some of the major actions: One, two, three, four.
    11 几个链接,因为已经进行了几轮。涵盖一些主要行动:一、二、三、四。↩