On DeepSeek and Export Controls
关于 DeepSeek 和出口管制
A few weeks ago I
made the case
for stronger US export controls on chips to China. Since then
DeepSeek, a Chinese AI company, has managed to — at least in some
respects —
come close to the performance
of US frontier AI models at lower cost.
几周前,我曾主张加强美国对华芯片出口管制。此后,中国人工智能公司深眸(DeepSeek)在某些方面已经——至少在某些方面——以较低的成本接近了美国前沿人工智能模型的性能。
Here, I won't focus on whether DeepSeek is or isn't a threat to US AI
companies like Anthropic (although I do believe many of the claims
about their threat to US AI leadership are greatly overstated)1. Instead, I'll focus on whether DeepSeek's releases undermine the
case for those export control policies on chips. I don't think they
do. In fact,
I think they make export control policies even more existentially
important than they were a week ago2.
这里,我不会关注 DeepSeek 是否对 Anthropic 等美国人工智能公司构成威胁(尽管我相信关于它们对美国人工智能领导地位的威胁的许多说法都被夸大了) 1 。相反,我将关注 DeepSeek 的发布是否会削弱对芯片出口管制政策的论证。我认为不会。事实上,我认为它们使出口管制政策比一周前更加重要 2 。
Export controls serve a vital purpose: keeping democratic nations at
the forefront of AI development. To be clear, they’re not a way to
duck the competition between the US and China. In the end, AI
companies in the US and other democracies must have better models than
those in China if we want to prevail. But we shouldn't hand the
Chinese Communist Party technological advantages when we don't have
to.
出口管制具有至关重要的作用:使民主国家在人工智能发展方面保持领先地位。需要明确的是,这并不是规避美中竞争的一种方式。最终,如果我们想获胜,美国和其他民主国家的人工智能公司必须拥有比中国更好的模型。但我们不应该无缘无故地向中国共产党提供技术优势。
Three Dynamics of AI Development
人工智能发展的三个动力
Before I make my policy argument, I'm going to describe three basic
dynamics of AI systems that it's crucial to understand:
在我阐述政策论点之前,我将描述人工智能系统的三个基本动态,理解这些动态至关重要:
-
Scaling laws. A property of AI — which I and my co-founders were among the first to document back when we worked at OpenAI — is that all else equal, scaling up the training of AI systems leads to smoothly better results on a range of cognitive tasks, across the board. So, for example, a $1M model might solve 20% of important coding tasks, a $10M might solve 40%, $100M might solve 60%, and so on. These differences tend to have huge implications in practice — another factor of 10 may correspond to the difference between an undergraduate and PhD skill level — and thus companies are investing heavily in training these models.
规模法则。AI 的一个特性——我和我的联合创始人曾在 OpenAI 工作时率先记录了这一特性——是,在其他条件相同的情况下,扩大 AI 系统的训练规模会导致一系列认知任务的结果平稳提升,全面提升。例如,一个 100 万美元的模型可能解决 20%的重要编码任务,一个 1000 万美元的模型可能解决 40%,一个 1 亿美元的模型可能解决 60%,以此类推。这些差异在实践中往往具有巨大的影响——另一个数量级的差异可能对应于本科生和博士生技能水平的差异——因此,公司正在大力投资训练这些模型。 -
Shifting the curve. The field is constantly coming up with ideas, large and small, that make things more effective or efficient: it could be an improvement to the architecture of the model (a tweak to the basic Transformer architecture that all of today's models use) or simply a way of running the model more efficiently on the underlying hardware. New generations of hardware also have the same effect. What this typically does is shift the curve: if the innovation is a 2x "compute multiplier" (CM), then it allows you to get 40% on a coding task for $5M instead of $10M; or 60% for $50M instead of $100M, etc. Every frontier AI company regularly discovers many of these CM's: frequently small ones (~1.2x), sometimes medium-sized ones (~2x), and every once in a while very large ones (~10x). Because the value of having a more intelligent system is so high, this shifting of the curve typically causes companies to spend more, not less, on training models: the gains in cost efficiency end up entirely devoted to training smarter models, limited only by the company's financial resources. People are naturally attracted to the idea that "first something is expensive, then it gets cheaper" — as if AI is a single thing of constant quality, and when it gets cheaper, we'll use fewer chips to train it. But what's important is the scaling curve: when it shifts, we simply traverse it faster, because the value of what's at the end of the curve is so high. In 2020, my team published a paper suggesting that the shift in the curve due to algorithmic progress is ~1.68x/year. That has probably sped up significantly since; it also doesn't take efficiency and hardware into account. I'd guess the number today is maybe ~4x/year. Another estimate is here. Shifts in the training curve also shift the inference curve, and as a result large decreases in price holding constant the quality of model have been occurring for years. For instance, Claude 3.5 Sonnet which was released 15 months later than the original GPT-4 outscores GPT-4 on almost all benchmarks, while having a ~10x lower API price.
曲线转移。该领域不断涌现各种大小的创意,使事情更有效或更高效:这可能是对模型架构的改进(对当今所有模型使用的基本 Transformer 架构进行调整),也可能只是在底层硬件上更有效地运行模型的方法。新一代硬件也具有同样的效果。这通常会使曲线转移:如果创新是 2 倍的“计算倍增器”(CM),那么它允许你以 500 万美元而不是 1000 万美元获得编码任务的 40%;或者以 5000 万美元而不是 1 亿美元获得 60% 等。每家前沿 AI 公司定期发现许多这样的 CM:通常是小型的 (~1.2x),有时是中型的 (~2x),偶尔也有非常大的 (~10x)。由于拥有更智能系统的价值如此之高,这种曲线的转移通常会导致公司在模型训练上投入更多,而不是更少:成本效率的提升最终完全用于训练更智能的模型,仅受公司财务资源的限制。 人们自然而然地被“先贵后便宜”的想法所吸引——仿佛 AI 是一件品质恒定的单一事物,便宜了之后,我们就会用更少的芯片来训练它。但重要的是规模曲线:当它发生变化时,我们只是更快地遍历它,因为曲线末端的东西价值很高。2020 年,我的团队发表了一篇论文,指出由于算法进步导致的曲线变化约为每年 1.68 倍。这速度自那以后可能大大加快了;它也没有考虑效率和硬件。我猜今天的数字可能是每年约 4 倍。另一个估算在这里。训练曲线的变化也会改变推理曲线,因此,多年来,在模型质量保持不变的情况下,价格大幅下降。例如,Claude 3.5 Sonnet 比最初的 GPT-4 晚发布 15 个月,但在几乎所有基准测试中都优于 GPT-4,而 API 价格却低了约 10 倍。 -
Shifting the paradigm. Every once in a while, the underlying thing that is being scaled changes a bit, or a new type of scaling is added to the training process. From 2020-2023, the main thing being scaled was pretrained models: models trained on increasing amounts of internet text with a tiny bit of other training on top. In 2024, the idea of using reinforcement learning (RL) to train models to generate chains of thought has become a new focus of scaling. Anthropic, DeepSeek, and many other companies (perhaps most notably OpenAI who released their o1-preview model in September) have found that this training greatly increases performance on certain select, objectively measurable tasks like math, coding competitions, and on reasoning that resembles these tasks. This new paradigm involves starting with the ordinary type of pretrained models, and then as a second stage using RL to add the reasoning skills. Importantly, because this type of RL is new, we are still very early on the scaling curve: the amount being spent on the second, RL stage is small for all players. Spending $1M instead of $0.1M is enough to get huge gains. Companies are now working very quickly to scale up the second stage to hundreds of millions and billions, but it's crucial to understand that we're at a unique "crossover point" where there is a powerful new paradigm that is early on the scaling curve and therefore can make big gains quickly.
改变范式。每隔一段时间,被扩展的底层事物就会发生一些变化,或者在训练过程中会增加一种新的扩展类型。从 2020 年到 2023 年,主要被扩展的是预训练模型:在越来越多的互联网文本上进行训练的模型,并在其之上进行少量其他训练。2024 年,使用强化学习 (RL) 来训练模型生成思维链的想法已成为扩展的新焦点。Anthropic、DeepSeek 和许多其他公司(也许最值得注意的是在 9 月份发布了其 o1-预览模型的 OpenAI)已经发现,这种训练大大提高了某些特定、客观可衡量的任务(如数学、编码竞赛和类似这些任务的推理)的性能。这种新的范式包括首先使用普通类型的预训练模型,然后作为第二阶段使用 RL 来添加推理技能。重要的是,由于这种类型的 RL 比较新,我们仍然处于扩展曲线的早期阶段:所有参与者在第二阶段 RL 上的支出都很少。花费 100 万美元而不是 10 万美元就足以获得巨大的收益。 公司现在正快速推进第二阶段的规模扩张,目标是达到数亿甚至数十亿的规模,但关键在于,我们正处于一个独特的“交叉点”:一个强大的新范式正处于规模扩张曲线的早期阶段,因此能够迅速取得巨大进展。
DeepSeek's Models DeepSeek 的模型
The three dynamics above can help us understand DeepSeek's recent
releases. About a month ago, DeepSeek released a model called "DeepSeek-V3" that was a pure pretrained model3 —
the first stage described in #3 above. Then last week, they released
"R1", which
added a second stage. It's not possible to determine everything about
these models from the outside, but the following is my best
understanding of the two releases.
上述三种动态有助于我们理解 DeepSeek 最近发布的模型。大约一个月前,DeepSeek 发布了一个名为“DeepSeek-V3”的纯预训练模型 3 ——即上面#3 中描述的第一阶段。然后在上周,他们发布了“R1”,它增加了一个第二阶段。我们无法从外部完全了解这些模型,但以下是笔者对这两个版本的最佳理解。
DeepSeek-V3 was actually the real innovation and what
should have made people take notice a month ago (we certainly
did). As a pretrained model, it appears to come
close to the performance of4
state of the art US models on some important tasks, while costing
substantially less to train (although, we find that Claude 3.5 Sonnet
in particular remains much better on some other key tasks, such as
real-world coding). DeepSeek's team did this via some genuine and
impressive innovations, mostly focused on engineering efficiency.
There were particularly innovative improvements in the management of
an aspect called the "Key-Value cache", and in enabling a method
called "mixture of experts" to be pushed further than it had before.
DeepSeek-V3 实际上是真正的创新,本应该在一个月前就引起人们的注意(我们当然注意到了)。作为一个预训练模型,它在一些重要任务上的性能似乎接近 4 最先进的美国模型,同时训练成本大大降低(尽管我们发现,Claude 3.5 Sonnet 在其他一些关键任务上,例如现实世界的编码,仍然要好得多)。DeepSeek 团队通过一些真正令人印象深刻的创新实现了这一点,主要集中在工程效率上。在“键值缓存”这个方面的管理以及推动“专家混合”方法超越以往方面,都进行了特别创新的改进。
However, it's important to look closer:
然而,仔细看看很重要:
-
DeepSeek does not "do for $6M5 what cost US AI companies billions". I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors). Sonnet's training was conducted 9-12 months ago, and DeepSeek's model was trained in November/December, while Sonnet remains notably ahead in many internal and external evals. Thus, I think a fair statement is "DeepSeek produced a model close to the performance of US models 7-10 months older, for a good deal less cost (but not anywhere near the ratios people have suggested)".
DeepSeek 并没有做到“用 600 万美元 5 做到美国 AI 公司花费数十亿美元才能做到的事情”。我只能代表 Anthropic 发言,Claude 3.5 Sonnet 是一个中等规模的模型,训练成本为几千万美元(我不会给出确切数字)。此外,3.5 Sonnet 的训练过程没有任何涉及更大或更昂贵模型的环节(与一些传闻相反)。Sonnet 的训练是在 9-12 个月前进行的,而 DeepSeek 的模型是在 11/12 月份训练的,Sonnet 在许多内部和外部评估中仍然明显领先。因此,我认为一个公平的说法是:“DeepSeek 以低得多的成本(但远没有人们所说的那样)制造了一个性能接近 7-10 个月前美国模型的模型”。 -
If the historical trend of the cost curve decrease is ~4x per year, that means that in the ordinary course of business — in the normal trends of historical cost decreases like those that happened in 2023 and 2024 — we’d expect a model 3-4x cheaper than 3.5 Sonnet/GPT-4o around now. Since DeepSeek-V3 is worse than those US frontier models — let’s say by ~2x on the scaling curve, which I think is quite generous to DeepSeek-V3 — that means it would be totally normal, totally "on trend", if DeepSeek-V3 training cost ~8x less than the current US models developed a year ago. I’m not going to give a number but it’s clear from the previous bullet point that even if you take DeepSeek’s training cost at face value, they are on-trend at best and probably not even that. For example this is less steep than the original GPT-4 to Claude 3.5 Sonnet inference price differential (10x), and 3.5 Sonnet is a better model than GPT-4. All of this is to say that DeepSeek-V3 is not a unique breakthrough or something that fundamentally changes the economics of LLM’s; it’s an expected point on an ongoing cost reduction curve. What’s different this time is that the company that was first to demonstrate the expected cost reductions was Chinese. This has never happened before and is geopolitically significant. However, US companies will soon follow suit — and they won’t do this by copying DeepSeek, but because they too are achieving the usual trend in cost reduction.
如果成本曲线下降的历史趋势每年约为 4 倍,这意味着在正常的业务过程中——在像 2023 年和 2024 年那样发生的历史成本下降的正常趋势中——我们预计模型的价格将比现在的 3.5 Sonnet/GPT-4o 便宜 3-4 倍。由于 DeepSeek-V3 比那些美国前沿模型差——让我们假设在规模曲线上的差距约为 2 倍,我认为这对 DeepSeek-V3 已经相当慷慨了——这意味着如果 DeepSeek-V3 的训练成本比一年前开发的当前美国模型低约 8 倍,这将是完全正常的,“符合趋势”的。我不会给出具体的数字,但从之前的要点可以清楚地看出,即使你按 DeepSeek 的训练成本面值计算,他们充其量也只是符合趋势,甚至可能还不及。例如,这比最初的 GPT-4 到 Claude 3.5 Sonnet 推理价格差异 (10 倍) 更不陡峭,而 3.5 Sonnet 比 GPT-4 更好。所有这一切都是说,DeepSeek-V3 bukanlah 一种独特的突破,也不是从根本上改变LLM经济学的东西;它只是持续成本下降曲线上的一个预期点。 这次不同的是,率先证明预期成本降低的公司是中国公司。这前所未有,具有重要的地缘政治意义。然而,美国公司很快也会效仿——他们不会抄袭 DeepSeek,而是因为他们也实现了成本降低的通常趋势。 -
Both DeepSeek and US AI companies have much more money and many more chips than they used to train their headline models. The extra chips are used for R&D to develop the ideas behind the model, and sometimes to train larger models that are not yet ready (or that needed more than one try to get right). It's been reported — we can't be certain it is true — that DeepSeek actually had 50,000 Hopper generation chips6, which I'd guess is within a factor ~2-3x of what the major US AI companies have (for example, it's 2-3x less than the xAI "Colossus" cluster)7. Those 50,000 Hopper chips cost on the order of ~$1B. Thus, DeepSeek's total spend as a company (as distinct from spend to train an individual model) is not vastly different from US AI labs.
深度学习公司和美国人工智能公司拥有的资金和用于训练其主要模型的芯片都比以前多得多。额外的芯片用于研发,以发展模型背后的理念,有时也用于训练尚未准备好(或需要多次尝试才能成功)的更大模型。据报道——我们无法确定其真实性——深度学习公司实际上拥有 50,000 个 Hopper 一代芯片 6 ,我估计这与主要美国人工智能公司拥有的数量相差约 2-3 倍(例如,它比 xAI 的“巨人”集群少 2-3 倍) 7 。这 50,000 个 Hopper 芯片的成本约为 10 亿美元。因此,深度学习公司的总支出(与训练单个模型的支出不同)与美国人工智能实验室的总支出并没有太大差异。 -
It’s worth noting that the "scaling curve" analysis is a bit oversimplified, because models are somewhat differentiated and have different strengths and weaknesses; the scaling curve numbers are a crude average that ignores a lot of details. I can only speak to Anthropic’s models, but as I’ve hinted at above, Claude is extremely good at coding and at having a well-designed style of interaction with people (many people use it for personal advice or support). On these and some additional tasks, there’s just no comparison with DeepSeek. These factors don’t appear in the scaling numbers.
值得注意的是,“规模曲线”分析有点过于简化,因为模型之间存在差异,各有优缺点;规模曲线数据是一个粗略的平均值,忽略了很多细节。我只能谈论 Anthropic 的模型,但正如我上面暗示的那样,Claude 在编码和与人进行精心设计的交互方面非常出色(许多人用它来寻求个人建议或支持)。在这些任务以及一些额外任务上,它与 DeepSeek 根本无法相比。这些因素在规模数据中没有体现。
R1, which is the model that was released last week and which triggered
an explosion of public attention (including a
~17% decrease
in Nvidia's stock price), is much less interesting from an innovation
or engineering perspective than V3. It adds the second phase of
training — reinforcement learning, described in #3 in the previous
section — and essentially replicates what OpenAI has done with o1
(they appear to be at similar scale with similar results)8. However, because we are on the early part of the scaling curve,
it’s possible for several companies to produce models of this type, as
long as they’re starting from a strong pretrained model. Producing R1
given V3 was probably very cheap. We’re therefore at an interesting
“crossover point”, where it is temporarily the case that several
companies can produce good reasoning models. This will rapidly cease
to be true as everyone moves further up the scaling curve on these
models.
R1,也就是上周发布并引发公众广泛关注的模型(包括英伟达股价下跌约 17%),从创新或工程角度来看远不如 V3 有趣。它增加了第二阶段的训练——强化学习,在上一节的#3 中有所描述——并且基本上复制了 OpenAI 使用 o1 所做的事情(它们似乎规模相似,结果也相似) 8 。然而,由于我们正处于规模曲线的早期阶段,只要从强大的预训练模型开始,几家公司都有可能生产出这种类型的模型。鉴于 V3,生产 R1 可能非常便宜。因此,我们正处于一个有趣的“交叉点”,暂时情况下,几家公司可以生产出良好的推理模型。随着每个人在这些模型上进一步沿着规模曲线向上移动,这种情况将迅速不再成立。
Export Controls 出口管制
All of this is just a preamble to my main topic of interest: the
export controls on chips to China. In light of the above facts, I see
the situation as follows:
所有这些都只是我主要感兴趣的话题的序言:对华芯片出口管制。鉴于上述事实,我认为情况如下:
-
There is an ongoing trend where companies spend more and more on training powerful AI models, even as the curve is periodically shifted and the cost of training a given level of model intelligence declines rapidly. It's just that the economic value of training more and more intelligent models is so great that any cost gains are more than eaten up almost immediately — they're poured back into making even smarter models for the same huge cost we were originally planning to spend. To the extent that US labs haven't already discovered them, the efficiency innovations DeepSeek developed will soon be applied by both US and Chinese labs to train multi-billion dollar models. These will perform better than the multi-billion models they were previously planning to train — but they'll still spend multi-billions. That number will continue going up, until we reach AI that is smarter than almost all humans at almost all things.
公司在训练强大的 AI 模型上的投入越来越多,即使模型训练成本曲线周期性地发生变化,训练特定智能水平的模型成本也在迅速下降。只是训练越来越智能的模型的经济价值如此巨大,以至于任何成本收益几乎都会立即被吞噬——这些收益会被重新投入到以相同的巨额成本创建更智能的模型中。美国实验室如果还没有发现 DeepSeek 开发的效率创新,很快就会被美国和中国的实验室应用于训练数十亿美元的模型。这些模型的性能将优于他们之前计划训练的数十亿美元模型——但他们仍然会花费数十亿美元。这个数字将继续上升,直到我们达到在几乎所有方面都比几乎所有人类都聪明的 AI。 -
Making AI that is smarter than almost all humans at almost all things will require millions of chips, tens of billions of dollars (at least), and is most likely to happen in 2026-2027. DeepSeek's releases don't change this, because they're roughly on the expected cost reduction curve that has always been factored into these calculations.
创造比几乎所有人类在几乎所有方面都更聪明的 AI,需要数百万芯片,至少数百亿美元,最有可能发生在 2026-2027 年。DeepSeek 的发布并没有改变这一点,因为它们的成本降低曲线大致符合一直以来都考虑在这些计算中的预期。 -
This means that in 2026-2027 we could end up in one of two starkly different worlds. In the US, multiple companies will definitely have the required millions of chips (at the cost of tens of billions of dollars). The question is whether China will also be able to get millions of chips9.
这意味着到 2026-2027 年,我们可能会面临两种截然不同的世界。在美国,多家公司肯定能够获得数百万所需的芯片(代价是数百亿美元)。问题是中国能否也获得数百万芯片 9 。-
If they can, we'll live in a bipolar world, where both
the US and China have powerful AI models that will cause
extremely rapid advances in science and technology — what I've
called "countries of geniuses in a datacenter". A bipolar world would not necessarily be balanced
indefinitely. Even if the US and China were at parity in AI
systems, it seems likely that China could direct more talent,
capital, and focus to military applications of the technology.
Combined with its large industrial base and military-strategic
advantages, this could help China take a commanding lead on the
global stage, not just for AI but for everything.
如果他们能做到,我们将生活在一个两极世界,美国和中国都拥有强大的 AI 模型,这将导致科学技术的极速发展——我称之为“数据中心里的天才之国”。两极世界不一定会无限期地保持平衡。即使美国和中国在 AI 系统方面势均力敌,中国似乎也更有可能将更多的人才、资本和重点投入到该技术的军事应用中。结合其庞大的工业基础和军事战略优势,这可能帮助中国在全球舞台上占据主导地位,而不仅仅是在 AI 领域,而是各个领域。 -
If China can't get millions of chips, we'll (at least
temporarily) live in a unipolar world, where only the
US and its allies have these models. It's unclear whether the
unipolar world will last, but there's at least the possibility
that,
because AI systems can eventually help make even smarter AI
systems, a temporary lead could be parlayed into a durable
advantage10. Thus, in this world, the US and its allies might take a
commanding and long-lasting lead on the global stage.
如果中国无法获得数百万芯片,我们将(至少暂时)生活在一个单极世界,只有美国及其盟友拥有这些模型。单极世界能否持续存在尚不清楚,但至少有可能,因为人工智能系统最终可以帮助制造更智能的人工智能系统,暂时的领先地位可以转化为持久的优势 10 。因此,在这个世界中,美国及其盟友可能会在全球舞台上占据主导地位并长期领先。
-
If they can, we'll live in a bipolar world, where both
the US and China have powerful AI models that will cause
extremely rapid advances in science and technology — what I've
called "countries of geniuses in a datacenter". A bipolar world would not necessarily be balanced
indefinitely. Even if the US and China were at parity in AI
systems, it seems likely that China could direct more talent,
capital, and focus to military applications of the technology.
Combined with its large industrial base and military-strategic
advantages, this could help China take a commanding lead on the
global stage, not just for AI but for everything.
-
Well-enforced export controls11 are the only thing that can prevent China from getting millions of chips, and are therefore the most important determinant of whether we end up in a unipolar or bipolar world.
严格执行的出口管制 11 是阻止中国获得数百万芯片的唯一方法,因此也是决定我们最终是走向单极世界还是两极世界最重要的因素。 -
The performance of DeepSeek does not mean the export controls failed. As I stated above, DeepSeek had a moderate-to-large number of chips, so it's not surprising that they were able to develop and then train a powerful model. They were not substantially more resource-constrained than US AI companies, and the export controls were not the main factor causing them to "innovate". They are simply very talented engineers and show why China is a serious competitor to the US.
DeepSeek 的性能并不意味着出口管制失败。正如我上面所说,DeepSeek 拥有数量中等偏多的芯片,因此他们能够开发并训练出强大的模型并不令人意外。他们并没有比美国的 AI 公司受到更多的资源限制,出口管制也不是导致他们“创新”的主要因素。他们只是非常有才华的工程师,也展现了为什么中国是美国的一个强大竞争对手。 -
DeepSeek also does not show that China can always obtain the chips it needs via smuggling, or that the controls always have loopholes. I don't believe the export controls were ever designed to prevent China from getting a few tens of thousands of chips. $1B of economic activity can be hidden, but it's hard to hide $100B or even $10B. A million chips may also be physically difficult to smuggle. It's also instructive to look at the chips DeepSeek is currently reported to have. This is a mix of H100's, H800's, and H20's, according to SemiAnalysis, adding up to 50k total. H100's have been banned under the export controls since their release, so if DeepSeek has any they must have been smuggled (note that Nvidia has stated that DeepSeek's advances are "fully export control compliant"). H800's were allowed under the initial round of 2022 export controls, but were banned in Oct 2023 when the controls were updated, so these were probably shipped before the ban. H20's are less efficient for training and more efficient for sampling — and are still allowed, although I think they should be banned. All of that is to say that it appears that a substantial fraction of DeepSeek's AI chip fleet consists of chips that haven't been banned (but should be); chips that were shipped before they were banned; and some that seem very likely to have been smuggled. This shows that the export controls are actually working and adapting: loopholes are being closed; otherwise, they would likely have a full fleet of top-of-the-line H100's. If we can close them fast enough, we may be able to prevent China from getting millions of chips, increasing the likelihood of a unipolar world with the US ahead.
DeepSeek 也没有显示中国总是可以通过走私获得所需的芯片,或者控制措施总是存在漏洞。我不认为出口管制的设计初衷是为了阻止中国获得几万块芯片。10 亿美元的经济活动可以隐藏,但要隐藏 1000 亿美元甚至 100 亿美元就很难了。一百万块芯片也可能难以走私。看看 DeepSeek 目前据报道拥有的芯片也很有启发意义。据 SemiAnalysis 称,这些芯片包括 H100、H800 和 H20,总计 5 万块。H100 自发布以来就在出口管制范围内被禁止,所以如果 DeepSeek 拥有任何 H100,那一定是走私的(注意,英伟达已声明 DeepSeek 的进展“完全符合出口管制”)。H800 在 2022 年第一轮出口管制中是被允许的,但在 2023 年 10 月管制更新后被禁止,所以这些芯片可能是禁令之前运送的。H20 的训练效率较低,采样效率较高——并且仍然允许,尽管我认为它们也应该被禁止。 所有这些都表明,DeepSeek 的 AI 芯片集群中相当一部分是由未被禁止(但应该被禁止)的芯片、在被禁止前发货的芯片以及一些很可能被走私的芯片组成的。这表明出口管制实际上正在发挥作用并不断适应:漏洞正在被堵上;否则,他们很可能拥有全部顶级 H100 芯片。如果我们能够足够快地堵上这些漏洞,我们或许能够阻止中国获得数百万芯片,从而增加美国领先的单极世界出现的可能性。
Given my focus on export controls and US national security, I want to
be clear on one thing. I don't see DeepSeek themselves as adversaries
and the point isn't to target them in particular. In interviews
they've done, they seem like smart, curious researchers who just want
to make useful technology.
鉴于我关注出口管制和美国国家安全,我想明确一点。我不认为 DeepSeek 本身就是对手,重点也不是特别针对他们。在他们接受的采访中,他们似乎是一群聪明、好奇的研究人员,只想制造有用的技术。
But they're beholden to an authoritarian government that has committed
human rights violations, has behaved aggressively on the world stage,
and will be far more unfettered in these actions if they're able to
match the US in AI. Export controls are one of
our most powerful tools
for preventing this, and the idea that the technology getting
more powerful, having more bang for the buck, is a
reason to lift our export controls makes no sense at all.
但他们受制于一个犯下侵犯人权行为、在世界舞台上表现侵略性、并且如果他们在人工智能方面能够与美国匹敌,其行为将更加不受约束的专制政府。出口管制是我们阻止这种情况发生的最有力工具之一,认为技术越来越强大、性价比越来越高是取消出口管制理由的想法根本说不通。
Footnotes 脚注
-
1I’m not taking any position on reports of distillation from Western models in this essay. Here, I’ll just take DeepSeek at their word that they trained it the way they said in the paper. ↩
1 本文中,我不对西方模型蒸馏的报道发表任何立场。在这里,我只是相信 DeepSeek 按照他们在论文中所说的那样训练了它。↩ -
2Incidentally, I think the release of the DeepSeek models is clearly not bad for Nvidia, and that a double-digit (~17%) drop in their stock in reaction to this was baffling. The case for this release not being bad for Nvidia is even clearer than it not being bad for AI companies. But my main goal in this piece is to defend export control policies. ↩
顺便说一句,我认为 DeepSeek 模型的发布对英伟达来说显然并非坏事,而其股价对此反应下跌两位数(约 17%)令人费解。该发布对英伟达并非坏事的论据甚至比对人工智能公司并非坏事的论据更清晰。但我在这篇文章中的主要目标是为出口管制政策辩护。↩ -
3To be completely precise, it was a pretrained model with the tiny amount of RL training typical of models before the reasoning paradigm shift. ↩
3 精确地说,它是一个预训练模型,只经过了少量强化学习训练,这在推理范式转变之前的模型中很常见。↩ -
4It is stronger on some very narrow tasks. ↩
4 它在一些非常狭窄的任务上更强大。 -
5This is the number quoted in DeepSeek's paper — I am taking it at face value, and not doubting this part of it, only the comparison to US company model training costs, and the distinction between the cost to train a specific model (which is the $6M) and the overall cost of R&D (which is much higher). However we also can't be completely sure of the $6M — model size is verifiable but other aspects like quantity of tokens are not. ↩
5 这是 DeepSeek 论文中引用的数字——我对此处内容不做怀疑,只对与美国公司模型训练成本的比较以及特定模型训练成本(即 600 万美元)与研发总成本(要高得多)之间的区别表示质疑。但是,我们也不能完全确定这 600 万美元——模型大小是可以验证的,但其他方面,如标记数量则无法验证。↩ -
6In some interviews I said they had "50,000 H100's" which was a subtly incorrect summary of the reporting and which I want to correct here. By far the best known "Hopper chip" is the H100 (which is what I assumed was being referred to), but Hopper also includes H800's, and H20's, and DeepSeek is reported to have a mix of all three, adding up to 50,000. That doesn't change the situation much, but it's worth correcting. I'll discuss the H800 and H20 more when I talk about export controls. ↩
6 在一些采访中,我说他们拥有“50000 个 H100”,这是一个略有不准确的总结,我想在这里更正一下。目前最著名的“Hopper 芯片”是 H100(这也是我当时认为所指的),但 Hopper 还包括 H800 和 H20,据报道 DeepSeek 拥有这三种芯片的混合体,总计 50000 个。这并没有太大改变情况,但值得更正。我会在讨论出口管制时更多地讨论 H800 和 H20。↩ -
7Note: I expect this gap to grow greatly on the next generation of clusters, because of export controls. ↩
7 注意:由于出口管制,我预计下一代集群的差距会大幅扩大。↩ -
8I suspect one of the principal reasons R1 gathered so much attention is that it was the first model to show the user the chain-of-thought reasoning that the model exhibits (OpenAI's o1 only shows the final answer). DeepSeek showed that users find this interesting. To be clear this is a user interface choice and is not related to the model itself. ↩
我怀疑 R1 吸引如此多关注的一个主要原因是,它是第一个向用户展示其模型所展现的思维链推理的模型(OpenAI 的 o1 只显示最终答案)。DeepSeek 的研究表明用户对此很感兴趣。需要明确的是,这是一个用户界面选择,与模型本身无关。↩ -
9Note that China's own chips won't be able to compete with US-made chips any time soon. As I wrote in my recent op-ed with Matt Pottinger: "China's best AI chips, the Huawei Ascend series, are substantially less capable than the leading chip made by U.S.-based Nvidia. China also may not have the production capacity to keep pace with growing demand. There's not a single noteworthy cluster of Huawei Ascend chips outside China today, suggesting that China is struggling to meet its domestic needs...". ↩
9 请注意,中国的芯片在短期内无法与美国制造的芯片竞争。正如我最近与马特·波廷格在评论文章中写道的那样:“中国最好的 AI 芯片,华为昇腾系列,其能力远低于美国英伟达公司生产的领先芯片。中国可能也没有生产能力来满足不断增长的需求。目前在中国以外地区,没有一个值得关注的华为昇腾芯片集群,这表明中国正在努力满足其国内需求……”↩ -
10To be clear, the goal here is not to deny China or any other authoritarian country the immense benefits in science, medicine, quality of life, etc. that come from very powerful AI systems. Everyone should be able to benefit from AI. The goal is to prevent them from gaining military dominance. ↩
10 需要明确的是,此处的目标并非要剥夺中国或任何其他专制国家从强大的 AI 系统中获得的巨大益处,这些益处体现在科学、医学、生活质量等方面。每个人都应该能够从 AI 中受益。目标是防止它们获得军事优势。↩ -
11Several links, as there have been several rounds. To cover some of the major actions: One, two, three, four. ↩
11 几个链接,因为已经进行了几轮。涵盖一些主要行动:一、二、三、四。↩