Listen to this post: 收听此帖子:
It’s Monday, January 27. Why haven’t you written about DeepSeek yet?
今天是 1 月 27 日星期一。你为什么还没写关于 DeepSeek 的文章呢?
I did! I wrote about R1
last Tuesday.
我写了!我上周二写了关于R1
的内容。
I totally forgot about that.
我完全忘记了。
I take responsibility. I stand by the post, including the two biggest takeaways that I highlighted (emergent chain-of-thought via pure reinforcement learning, and the power of distillation), and I mentioned the low cost (which I expanded on in Sharp Tech) and chip ban implications, but those observations were too localized to the current state of the art in AI. What I totally failed to anticipate were the broader implications this news would have to the overall meta-discussion, particularly in terms of the U.S. and China.
我承担责任。我坚持我的帖子,包括我强调的两个最重要的要点(通过纯粹的强化学习实现涌现链式思维,以及蒸馏的力量),我还提到了低成本(我在Sharp Tech中对此进行了扩展)和芯片禁令的影响,但这些观察结果过于局限于当前人工智能技术的最新水平。我完全没有预料到这条新闻对整体元讨论的更广泛影响,尤其是在美国和中国方面。
Is there precedent for such a miss?
这种失误有先例吗?
There is. In September 2023 Huawei announced the Mate 60 Pro with a SMIC-manufactured 7nm chip. The existence of this chip wasn’t a surprise for those paying close attention: SMIC had made a 7nm chip a year earlier (the existence of which I had noted even earlier than that), and TSMC had shipped 7nm chips in volume using nothing but DUV lithography (later iterations of 7nm were the first to use EUV). Intel had also made 10nm (TSMC 7nm equivalent) chips years earlier using nothing but DUV, but couldn’t do so with profitable yields; the idea that SMIC could ship 7nm chips using their existing equipment, particularly if they didn’t care about yields, wasn’t remotely surprising — to me, anyways.
有的。2023 年 9 月,华为发布了搭载中芯国际制造的 7nm 芯片的 Mate 60 Pro。对于密切关注的人来说,这款芯片的存在并不意外:中芯国际早一年就制造出了 7nm 芯片(我甚至更早之前就注意到了它的存在),而台积电也早已使用仅 DUV 光刻技术就大量生产了 7nm 芯片(7nm 的后续迭代版本才首次使用 EUV)。英特尔也早在几年前就使用仅 DUV 光刻技术制造出了 10nm(相当于台积电 7nm)芯片,但无法以有利可图的良率生产;中芯国际能够使用现有设备生产 7nm 芯片的想法,尤其是在他们不关心良率的情况下,对我来说一点也不奇怪。
What I totally failed to anticipate was the overwrought reaction in Washington D.C. The dramatic expansion in the chip ban that culminated in the Biden administration transforming chip sales to a permission-based structure was downstream from people not understanding the intricacies of chip production, and being totally blindsided by the Huawei Mate 60 Pro. I get the sense that something similar has happened over the last 72 hours: the details of what DeepSeek has accomplished — and what they have not — are less important than the reaction and what that reaction says about people’s pre-existing assumptions.
我完全没有预料到华盛顿特区的过度反应。芯片禁令的戏剧性扩大,最终导致拜登政府将芯片销售转变为许可制,这源于人们不了解芯片生产的复杂性,并且完全被华为 Mate 60 Pro 打了个措手不及。我的感觉是,在过去的 72 小时里也发生了类似的事情:DeepSeek 取得的成就——以及他们没有取得的成就——的细节,远不如人们的反应及其反应所反映出的预先假设重要。
So what did DeepSeek announce?
那么 DeepSeek 宣布了什么呢?
The most proximate announcement to this weekend’s meltdown was R1
, a reasoning model that is similar to OpenAI’s o1
. However, many of the revelations that contributed to the meltdown — including DeepSeek’s training costs — actually accompanied the V3
announcement over Christmas. Moreover, many of the breakthroughs that undergirded V3
were actually revealed with the release of the V2
model last January.
与本周末的市场暴跌最直接相关的公告是R1
,这是一个类似于 OpenAI 的o1
的推理模型。然而,导致市场暴跌的许多消息——包括 DeepSeek 的训练成本——实际上是在圣诞节期间发布的V3
公告中出现的。此外,支撑V3
的许多突破实际上是在去年 1 月发布V2
模型时透露的。
Is this model naming convention the greatest crime that OpenAI has committed?
OpenAI 犯下的最大罪行是不是这种模型命名约定?
Second greatest; we’ll get to the greatest momentarily.
第二大罪行;我们马上就会谈到最大的罪行。
Let’s work backwards: what was the V2 model, and why was it important?
我们倒过来看:V2 模型是什么,为什么它很重要?
The DeepSeek-V2
model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.
DeepSeek-V2
模型引入了两个重要的突破:DeepSeekMoE 和 DeepSeekMLA。“MoE”在 DeepSeekMoE 中指的是“专家混合”。一些模型,例如 GPT-3.5,在训练和推理过程中都会激活整个模型;然而事实证明,并非模型的每个部分都对当前主题必要。MoE 将模型分成多个“专家”,并且只激活必要的专家;GPT-4 是一个据信拥有 16 个专家,每个专家大约有 1100 亿个参数的 MoE 模型。
DeepSeekMoE, as implemented in V2
, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.
DeepSeekMoE,如V2
版本中所实现的,在这个概念上引入了重要的创新,包括区分更细粒度的专业专家和具有更通用能力的共享专家。至关重要的是,DeepSeekMoE 还引入了新的训练负载均衡和路由方法;传统上,MoE 以牺牲训练效率为代价换取高效的推理,但 DeepSeek 的方法也提高了训练效率。
DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.
DeepSeekMLA 是一个更大的突破。推理的最大限制之一是所需的巨大内存量:你需要将模型加载到内存中,还需要加载整个上下文窗口。上下文窗口在内存方面尤其昂贵,因为每个标记都需要一个键和对应的值;DeepSeekMLA,或多头潜在注意力,使得压缩键值存储成为可能,从而大大减少了推理过程中的内存使用。
I’m not sure I understood any of that.
我不确定我是否理解了任何内容。
The key implications of these breakthroughs — and the part you need to understand — only became apparent with V3
, which added a new approach to load balancing (further reducing communications overhead) and multi-token prediction in training (further densifying each training step, again reducing overhead): V3
was shockingly cheap to train. DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of $2/GPU hour, comes out to a mere $5.576 million.
这些突破的关键意义——也是您需要理解的部分——直到V3
出现才变得清晰。V3 增加了一种新的负载均衡方法(进一步降低了通信开销)和训练中的多标记预测(进一步增加了每个训练步骤的密度,再次降低了开销):V3
的训练成本低得令人震惊。DeepSeek 声称模型训练耗费了 2788000 个 H800 GPU 小时,以每 GPU 小时 2 美元计算,总成本仅为 557.6 万美元。
That seems impossibly low.
这似乎低得不可思议。
DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3
paper:
DeepSeek 明确表示这些成本仅用于最终训练运行,不包括所有其他费用;来自 V3
论文:
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.
最后,我们再次强调 DeepSeek-V3 经济的训练成本(详见表 1),这是通过我们对算法、框架和硬件的优化协同设计实现的。在预训练阶段,在每个万亿 token 上训练 DeepSeek-V3 只需要 180K H800 GPU 小时,即在我们拥有 2048 个 H800 GPU 的集群上需要 3.7 天。因此,我们的预训练阶段在不到两个月内完成,共耗费 2664K GPU 小时。加上上下文长度扩展的 119K GPU 小时和后期训练的 5K GPU 小时,DeepSeek-V3 的全部训练成本仅为 2.788M GPU 小时。假设 H800 GPU 的租赁价格为每 GPU 小时 2 美元,我们的总训练成本仅为 557.6 万美元。需要注意的是,上述成本仅包括 DeepSeek-V3 的正式训练,不包括先前在架构、算法或数据上的研究和消融实验相关的成本。
So no, you can’t replicate DeepSeek the company for $5.576 million.
所以,你不可能以 557.6 万美元复制 DeepSeek 公司。
I still don’t believe that number.
我还是不信那个数字。
Actually, the burden of proof is on the doubters, at least once you understand the V3
architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3
. Again, this was just the final run, not the total cost, but it’s a plausible number.
实际上,一旦你理解了V3
架构,举证责任就在怀疑者身上。记住 DeepSeekMoE 的那一部分:V3 有 6710 亿个参数,但每个 token 只计算活跃专家中的 370 亿个参数;这相当于每个 token 有 3333 亿次浮点运算。这里我应该提到 DeepSeek 的另一个创新:虽然参数以 BF16 或 FP32 精度存储,但它们在计算中被降低到 FP8 精度;2048 个 H800 GPU 的容量为 3.97 艾浮点运算,即 39.7 亿亿次浮点运算。同时,训练集包含 14.8 万亿个 token;一旦你完成所有计算,就会发现 280 万个 H800 小时足以训练V3
。再说一次,这只是最后的运行,而不是总成本,但这仍然是一个合理的数字。
Scale AI CEO Alexandr Wang said they have 50,000 H100s.
Scale AI 首席执行官亚历山大·王表示他们拥有 5 万个 H100
I don’t know where Wang got his information; I’m guessing he’s referring to this November 2024 tweet from Dylan Patel, which says that DeepSeek had “over 50k Hopper GPUs”. H800s, however, are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions.
我不知道王先生从哪里得到的信息;我猜他指的是迪伦·帕特尔 2024 年 11 月的这条推文,推文中说 DeepSeek 拥有“超过 5 万个 Hopper GPU”。然而,H800 也是 Hopper GPU,只是由于美国制裁,它们的内存带宽比 H100 要受限得多。
Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.
事情是这样的:我上面解释的许多创新都是为了克服使用 H800 而不是 H100 所带来的内存带宽不足的问题。此外,如果你真的计算了上一个问题,你会意识到 DeepSeek 实际上计算能力过剩;这是因为 DeepSeek 实际上专门编程了每个 H800 上的 132 个处理单元中的 20 个来管理芯片间通信。这实际上在 CUDA 中是无法实现的。DeepSeek 的工程师不得不降到 PTX 级别,这是一种 Nvidia GPU 的低级指令集,基本上就像汇编语言一样。这是一种疯狂的优化级别,只有在使用 H800 时才有意义。
Meanwhile, DeepSeek also makes their models available for inference: that requires a whole bunch of GPUs above-and-beyond whatever was used for training.
同时,DeepSeek 也提供其模型用于推理:这需要大量超出训练所用 GPU 的 GPU。
So was this a violation of the chip ban?
这算违反芯片禁令吗?
Nope. H100s were prohibited by the chip ban, but not H800s. Everyone assumed that training leading edge models required more interchip memory bandwidth, but that is exactly what DeepSeek optimized both their model structure and infrastructure around.
不。芯片禁令禁止使用 H100,但不禁止使用 H800。每个人都认为训练领先的模型需要更大的芯片间内存带宽,但 DeepSeek 正是围绕其模型结构和基础设施对这一点进行了优化。
Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.
再次强调,DeepSeek 在设计此模型时做出的所有决策只有在受限于 H800 的情况下才有意义;如果 DeepSeek 能够使用 H100,他们可能会使用更大的训练集群,并且会减少专门针对克服带宽不足而进行的优化。
So V3
is a leading edge model?
所以V3
是领先的模型?
It’s definitely competitive with OpenAI’s 4o
and Anthropic’s Sonnet-3.5, and appears to be better than Llama’s biggest model. What does seem likely is that DeepSeek was able to distill those models to give V3
high quality tokens to train on.
它在竞争力方面绝对可以与 OpenAI 的4o
和 Anthropic 的Sonnet-3.5 相媲美,并且似乎比 Llama 的最大模型更好。看起来 DeepSeek 很可能能够提炼这些模型,从而为V3
提供高质量的训练代币。
What is distillation? 什么是蒸馏?
Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.
蒸馏是一种从另一个模型中提取理解的方法;你可以向教师模型发送输入并记录输出,然后用它来训练学生模型。这就是你如何从 GPT-4 获得 GPT-4 Turbo 模型的方法。对于公司来说,在自己模型上进行蒸馏更容易,因为他们拥有完全的访问权限,但你仍然可以通过 API 以一种稍微笨拙的方式进行蒸馏,或者如果你有创意的话,甚至可以通过聊天客户端进行蒸馏。
Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access, via IP banning, rate limiting, etc. It’s assumed to be widespread in terms of model training, and is why there are an ever-increasing number of models converging on GPT-4o
quality. This doesn’t mean that we know for a fact that DeepSeek distilled 4o
or Claude, but frankly, it would be odd if they didn’t.
蒸馏显然违反了各种模型的服务条款,但阻止它的唯一方法是实际切断访问,例如通过 IP 封禁、速率限制等。据推测,它在模型训练中非常普遍,这也是越来越多的模型质量趋于 GPT-4o
的原因。这并不意味着我们确切知道 DeepSeek 是否蒸馏了4o
或Claude,但坦白说,如果他们没有这么做,那就奇怪了。
Distillation seems terrible for leading edge models.
蒸馏对于领先模型似乎很糟糕。
It is! On the positive side, OpenAI and Anthropic and Google are almost certainly using distillation to optimize the models they use for inference for their consumer-facing apps; on the negative side, they are effectively bearing the entire cost of training the leading edge, while everyone else is free-riding on their investment.
是的!积极方面,OpenAI、Anthropic 和谷歌几乎肯定都在使用蒸馏来优化其面向消费者的应用程序中使用的推理模型;消极方面,他们实际上承担了训练前沿技术的全部成本,而其他人则可以免费搭乘他们的投资成果。
Indeed, this is probably the core economic factor undergirding the slow divorce of Microsoft and OpenAI. Microsoft is interested in providing inference to its customers, but much less enthused about funding $100 billion data centers to train leading edge models that are likely to be commoditized long before that $100 billion is depreciated.
的确,这可能是支撑微软和 OpenAI 缓慢分道扬镳的核心经济因素。微软有兴趣为其客户提供推理服务,但对投资 1000 亿美元建设数据中心来训练最先进的模型的热情要低得多,因为这些模型很可能在 1000 亿美元折旧之前就成为商品。
Is this why all of the Big Tech stock prices are down?
这就是所有大型科技公司股价下跌的原因吗?
In the long run, model commoditization and cheaper inference — which DeepSeek has also demonstrated — is great for Big Tech. A world where Microsoft gets to provide inference to its customers for a fraction of the cost means that Microsoft has to spend less on data centers and GPUs, or, just as likely, sees dramatically higher usage given that inference is so much cheaper. Another big winner is Amazon: AWS has by-and-large failed to make their own quality model, but that doesn’t matter if there are very high quality open source models that they can serve at far lower costs than expected.
长期来看,模型商品化和更便宜的推理(DeepSeek 也已证明)对大型科技公司来说是件好事。微软能够以极低成本为客户提供推理服务的世界意味着微软在数据中心和 GPU 上的支出减少,或者同样可能的是,由于推理成本大幅降低,使用量会大幅增加。另一个大赢家是亚马逊:AWS 基本上未能打造出自己的高质量模型,但这并不重要,因为他们可以提供高质量的开源模型,其成本远低于预期。
Apple is also a big winner. Dramatically decreased memory requirements for inference make edge inference much more viable, and Apple has the best hardware for exactly that. Apple Silicon uses unified memory, which means that the CPU, GPU, and NPU (neural processing unit) have access to a shared pool of memory; this means that Apple’s high-end hardware actually has the best consumer chip for inference (Nvidia gaming GPUs max out at 32GB of VRAM, while Apple’s chips go up to 192 GB of RAM).
苹果也是大赢家。大幅降低推理的内存需求使得边缘推理更加可行,而苹果拥有最适合这项工作的硬件。苹果硅芯片使用统一内存,这意味着 CPU、GPU 和 NPU(神经处理单元)可以访问共享的内存池;这意味着苹果的高端硬件实际上拥有最好的消费级推理芯片(英伟达游戏 GPU 的 VRAM 最大为 32GB,而苹果芯片的 RAM 高达 192GB)。
Meta, meanwhile, is the biggest winner of all. I already laid out last fall how every aspect of Meta’s business benefits from AI; a big barrier to realizing that vision is the cost of inference, which means that dramatically cheaper inference — and dramatically cheaper training, given the need for Meta to stay on the cutting edge — makes that vision much more achievable.
与此同时,Meta 是最大的赢家。我去年秋天就已经阐述过 Meta 业务的各个方面如何从 AI 中获益;实现这一愿景的一大障碍是推理成本,这意味着大幅降低推理成本——鉴于 Meta 需要保持领先地位,也意味着大幅降低训练成本——使得这一愿景更容易实现。
Google, meanwhile, is probably in worse shape: a world of decreased hardware requirements lessens the relative advantage they have from TPUs. More importantly, a world of zero-cost inference increases the viability and likelihood of products that displace search; granted, Google gets lower costs as well, but any change from the status quo is probably a net negative.
与此同时,谷歌的情况可能更糟:硬件需求降低的世界减少了他们从 TPU 中获得的相对优势。更重要的是,零成本推理的世界增加了取代搜索产品的可行性和可能性;当然,谷歌的成本也降低了,但任何偏离现状的变化可能都是负面的。
I asked why the stock prices are down; you just painted a positive picture!
我问股价为什么下跌;你却只描绘了一幅乐观的景象!
My picture is of the long run; today is the short run, and it seems likely the market is working through the shock of R1’s existence.
我的图景是长远来看的;今天是短期,市场似乎正在消化 R1 存在的冲击。
Wait, you haven’t even talked about R1
yet.
等一下,你甚至还没谈到R1
呢。
R1
is a reasoning model like OpenAI’s o1
. It has the ability to think through a problem, producing much higher quality results, particularly in areas like coding, math, and logic (but I repeat myself).R1
是一个类似于 OpenAI 的o1
的推理模型。它能够思考问题,产生更高质量的结果,尤其是在编码、数学和逻辑等领域(但我重复了)。
Is this more impressive than V3
?
这比V3
更令人印象深刻吗?
Actually, the reason why I spent so much time on V3
is that that was the model that actually demonstrated a lot of the dynamics that seem to be generating so much surprise and controversy. R1
is notable, however, because o1
stood alone as the only reasoning model on the market, and the clearest sign that OpenAI was the market leader.
实际上,我之所以在V3
上花费了这么多时间,是因为该模型实际上展示了许多正在引发诸多惊讶和争议的动态。R1
值得关注,然而,因为o1
是当时市场上唯一独立存在的推理模型,也是 OpenAI 是市场领导者的最明显标志。
R1
undoes the o1
mythology in a couple of important ways. First, there is the fact that it exists. OpenAI does not have some sort of special sauce that can’t be replicated. Second, R1
— like all of DeepSeek’s models — has open weights (the problem with saying “open source” is that we don’t have the data that went into creating it). This means that instead of paying OpenAI to get reasoning, you can run R1
on the server of your choice, or even locally, at dramatically lower cost.R1
从几个重要方面推翻了o1
的神话。首先,它确实存在。OpenAI 并没有什么无法复制的特殊秘方。其次,R1
——就像 DeepSeek 的所有模型一样——具有开放权重(说“开源”的问题在于我们没有创建它的数据)。这意味着,您可以选择在您选择的服务器上运行R1
,甚至在本地运行,而无需支付给 OpenAI 进行推理的费用,成本也大大降低。
How did DeepSeek make R1
?
DeepSeek 是如何制作R1
的?
DeepSeek actually made two models: R1
and R1
-Zero. I actually think that R1
-Zero is the bigger deal; as I noted above, it was my biggest focus in last Tuesday’s Update:
DeepSeek 实际上创建了两个模型:R1
和R1
-Zero。我实际上认为R1
-Zero 更重要;正如我在上面提到的,它是我在上周二的更新中的最大关注点:
R1
-Zero, though, is the bigger deal in my mind. From the paper:
不过,在我看来,R1
-Zero 更重要。摘自论文:In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-
V3
-Base as the base model and employ GRPO as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1
-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1
-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1
-0912.
本文首次尝试利用纯强化学习 (RL) 来提升语言模型的推理能力。我们的目标是探索LLMs在无需任何监督数据的情况下发展推理能力的潜力,重点关注其通过纯 RL 过程的自我进化。具体来说,我们使用 DeepSeek-V3
-Base 作为基础模型,并采用 GRPO 作为 RL 框架来提升模型在推理方面的性能。在训练过程中,DeepSeek-R1
-Zero 自然而然地展现出许多强大而有趣的推理行为。经过数千步的 RL 训练后,DeepSeek-R1
-Zero 在推理基准测试中表现出超强的性能。例如,在 AIME 2024 上的 pass@1 得分从 15.6%提高到 71.0%,并且通过多数投票,得分进一步提高到 86.7%,达到了 OpenAI-o1
-0912 的性能。Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function. The classic example is AlphaGo, where DeepMind gave the model the rules of Go with the reward function of winning the game, and then let the model figure everything else on its own. This famously ended up working better than other more human-guided techniques.
强化学习是一种机器学习技术,它向机器学习模型提供大量数据和奖励函数。经典的例子是 AlphaGo,DeepMind 向模型提供了围棋规则以及获胜作为奖励函数,然后让模型自行解决其他一切问题。众所周知,这种方法最终比其他更依赖人工指导的技术效果更好。LLMs to date, however, have relied on reinforcement learning with human feedback; humans are in the loop to help guide the model, navigate difficult choices where rewards aren’t obvious, etc. RLHF was the key innovation in transforming GPT-3 into ChatGPT, with well-formed paragraphs, answers that were concise and didn’t trail off into gibberish, etc.
然而,到目前为止,LLMs都依赖于带有人类反馈的强化学习;人类参与其中,以帮助指导模型,在奖励不明确的困难选择中进行导航等。RLHF 是将 GPT-3 转变为 ChatGPT 的关键创新,它具有结构良好的段落、简洁的答案,并且不会偏离主题变成胡言乱语等。
R1
-Zero, however, drops the HF part — it’s just reinforcement learning. DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process. Moreover, the technique was a simple one: instead of trying to evaluate step-by-step (process supervision), or doing a search of all possible answers (a la AlphaGo), DeepSeek encouraged the model to try several different answers at a time and then graded them according to the two reward functions.R1
-Zero 然而放弃了 HF 部分——它只是强化学习。DeepSeek 给模型提供了一组数学、代码和逻辑问题,并设置了两个奖励函数:一个用于正确答案,另一个用于利用思维过程的正确格式。此外,这项技术很简单:DeepSeek 并没有尝试逐步评估(过程监督),也没有搜索所有可能的答案(类似 AlphaGo),而是鼓励模型一次尝试几个不同的答案,然后根据两个奖励函数对它们进行评分。What emerged is a model that developed reasoning and chains-of-thought on its own, including what DeepSeek called “Aha Moments”:
最终形成的模型能够自主发展推理和思维链,包括 DeepSeek 所说的“顿悟时刻”:A particularly intriguing phenomenon observed during the training of DeepSeek-
R1
-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1
-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
在 DeepSeek-R1
-Zero 的训练过程中,一个特别有趣的现象是出现了“顿悟”时刻。如表 3 所示,这个时刻出现在模型的中间版本中。在这个阶段,DeepSeek-R1
-Zero 学会了通过重新评估其初始方法来为问题分配更多思考时间。这种行为不仅证明了模型推理能力的增强,也是强化学习如何带来意想不到的复杂结果的一个引人入胜的例子。This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
这一刻不仅是模型的“顿悟时刻”,也是观察其行为的研究人员的“顿悟时刻”。它强调了强化学习的强大和美丽:我们不是明确地教模型如何解决问题,而是简单地为它提供正确的激励,它就会自主地开发先进的解决问题策略。“顿悟时刻”有力地提醒了强化学习在释放人工智能系统新智能水平方面的潜力,为未来更自主和自适应的模型铺平了道路。This is one of the most powerful affirmations yet of The Bitter Lesson: you don’t need to teach the AI how to reason, you can just give it enough compute and data and it will teach itself!
这是对《痛苦的教训》最有力的肯定之一:你不需要教 AI 如何推理,你只需要给它足够的计算能力和数据,它就会自己学习!Well, almost:
R1
-Zero reasons, but in a way that humans have trouble understanding. Back to the introduction:
嗯,几乎是:R1
——零个理由,但以人类难以理解的方式。回到引言:However, DeepSeek-
R1
-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1
, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3
-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1
-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3
in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3
-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1
, which achieves performance on par with OpenAI-o1
-1217.
然而,DeepSeek-R1
-Zero 面临着可读性差和语言混合等挑战。为了解决这些问题并进一步提高推理性能,我们引入了 DeepSeek-R1
,它结合了少量冷启动数据和多阶段训练流程。具体来说,我们首先收集数千条冷启动数据来微调 DeepSeek-V3
-Base 模型。接下来,我们执行类似于 DeepSeek-R1
-Zero 的基于推理的强化学习。在强化学习过程接近收敛时,我们通过对强化学习检查点进行拒绝采样,并结合来自 DeepSeek-V3
在写作、事实性问答和自我认知等领域的监督数据,创建新的 SFT 数据,然后重新训练 DeepSeek-V3
-Base 模型。用新数据微调后,检查点会经过额外的强化学习过程,同时考虑来自所有场景的提示。经过这些步骤,我们得到了一个名为 DeepSeek-R1
的检查点,其性能与 OpenAI-o1
-1217 相当。This sounds a lot like what OpenAI did for
o1
: DeepSeek started the model out with a bunch of examples of chain-of-thought thinking so it could learn the proper format for human consumption, and then did the reinforcement learning to enhance its reasoning, along with a number of editing and refinement steps; the output is a model that appears to be very competitive witho1
.
这听起来很像OpenAI 对o1
所做的:DeepSeek 一开始用大量链式思维的例子来训练模型,使其能够学习适合人类理解的格式,然后进行强化学习以增强其推理能力,并进行多次编辑和改进;最终输出的模型似乎与o1
具有很强的竞争力。
Here again it seems plausible that DeepSeek benefited from distillation, particularly in terms of training R1
. That, though, is itself an important takeaway: we have a situation where AI models are teaching AI models, and where AI models are teaching themselves. We are watching the assembly of an AI takeoff scenario in realtime.
这里再次看来,DeepSeek 受益于蒸馏是合理的,尤其是在训练R1
方面。然而,这本身就是一个重要的收获:我们面临着 AI 模型教 AI 模型,以及 AI 模型自我学习的情况。我们正在实时观察 AI 腾飞场景的组装。
So are we close to AGI?
我们离 AGI 还远吗?
It definitely seems like it. This also explains why Softbank (and whatever investors Masayoshi Son brings together) would provide the funding for OpenAI that Microsoft will not: the belief that we are reaching a takeoff point where there will in fact be real returns towards being first.
这确实看起来像那样。这也解释了为什么软银(以及孙正义聚集的任何投资者)会为 OpenAI 提供微软不会提供的资金:他们相信我们正在达到一个起飞点,届时成为第一将确实带来实际回报。
But isn’t R1
now in the lead?
但是R1
现在不是领先了吗?
I don’t think so; this has been overstated. R1
is competitive with o1
, although there do seem to be some holes in its capability that point towards some amount of distillation from o1
-Pro. OpenAI, meanwhile, has demonstrated o3
, a far more powerful reasoning model. DeepSeek is absolutely the leader in efficiency, but that is different than being the leader overall.
我不这么认为;这一点被夸大了。R1
与o1
具有竞争力,尽管它的能力确实存在一些漏洞,这表明它在一定程度上是从o1
-Pro 中提取出来的。与此同时,OpenAI 已经展示了o3
,这是一个强大得多的推理模型。DeepSeek 在效率方面绝对是领先的,但这与整体领先是不同的。
So why is everyone freaking out?
所以大家为什么都这么慌?
I think there are multiple factors. First, there is the shock that China has caught up to the leading U.S. labs, despite the widespread assumption that China isn’t as good at software as the U.S.. This is probably the biggest thing I missed in my surprise over the reaction. The reality is that China has an extremely proficient software industry generally, and a very good track record in AI model building specifically.
我认为有多重因素。首先,中国赶上了领先的美国实验室,这令人震惊,此前普遍认为中国的软件不如美国。这可能是我对这种反应感到惊讶时最忽略的一点。现实情况是,中国总体上拥有极其熟练的软件产业,并且在人工智能模型构建方面拥有非常好的往绩。
Second is the low training cost for V3
, and DeepSeek’s low inference costs. This part was a big surprise for me as well, to be sure, but the numbers are plausible. This, by extension, probably has everyone nervous about Nvidia, which obviously has a big impact on the market.
其次是V3
的低训练成本和 DeepSeek 的低推理成本。这部分也让我大吃一惊,但数据是可信的。这反过来可能会让每个人都对英伟达感到紧张,英伟达显然对市场有很大影响。
Third is the fact that DeepSeek pulled this off despite the chip ban. Again, though, while there are big loopholes in the chip ban, it seems likely to me that DeepSeek accomplished this with legal chips.
第三,DeepSeek 实现了这一点,尽管存在芯片禁令。不过,尽管芯片禁令存在很大的漏洞,但我认为 DeepSeek 很可能是用合法的芯片做到的。
I own Nvidia! Am I screwed?
我拥有英伟达!我惨了吗?
There are real challenges this news presents to the Nvidia story. Nvidia has two big moats:
这条新闻确实给英伟达的故事带来了真正的挑战。英伟达有两大护城河:
- CUDA is the language of choice for anyone programming these models, and CUDA only works on Nvidia chips.
CUDA 是编写这些模型的首选语言,而 CUDA 仅适用于英伟达芯片。 - Nvidia has a massive lead in terms of its ability to combine multiple chips together into one large virtual GPU.
英伟达在将多个芯片组合成一个大型虚拟 GPU 方面拥有巨大优势。
These two moats work together. I noted above that if DeepSeek had access to H100s they probably would have used a larger cluster to train their model, simply because that would have been the easier option; the fact they didn’t, and were bandwidth constrained, drove a lot of their decisions in terms of both model architecture and their training infrastructure. Just look at the U.S. labs: they haven’t spent much time on optimization because Nvidia has been aggressively shipping ever more capable systems that accommodate their needs. The route of least resistance has simply been to pay Nvidia. DeepSeek, however, just demonstrated that another route is available: heavy optimization can produce remarkable results on weaker hardware and with lower memory bandwidth; simply paying Nvidia more isn’t the only way to make better models.
这两个护城河共同作用。我上面提到,如果 DeepSeek 能够使用 H100,他们可能会使用更大的集群来训练他们的模型,因为这将是更容易的选择;事实上他们没有这样做,并且受到带宽限制,这驱动了他们在模型架构和训练基础设施方面的许多决策。看看美国的实验室:他们没有花太多时间在优化上,因为英伟达一直在积极地交付越来越强大的系统来满足他们的需求。最简单的途径就是付钱给英伟达。然而,DeepSeek 刚刚证明了另一种途径是可行的:大量的优化可以在较弱的硬件和较低的内存带宽下产生显著的结果;仅仅向英伟达支付更多费用并不是制作更好模型的唯一途径。
That noted, there are three factors still in Nvidia’s favor. First, how capable might DeepSeek’s approach be if applied to H100s, or upcoming GB100s? Just because they found a more efficient way to use compute doesn’t mean that more compute wouldn’t be useful. Second, lower inference costs should, in the long run, drive greater usage. Microsoft CEO Satya Nadella, in a late night tweet almost assuredly directed at the market, said exactly that:
尽管如此,英伟达仍有三方面优势。首先,如果将 DeepSeek 的方法应用于 H100 或即将推出的 GB100,其能力会有多强?仅仅因为他们找到了一种更有效的计算方法,并不意味着更多的计算不会有用。其次,从长远来看,较低的推理成本应该会推动更大的使用率。微软首席执行官萨蒂亚·纳德拉在深夜发的一条几乎肯定针对市场的推文中,也正是这么说的:
Third, reasoning models like R1
and o1
derive their superior performance from using more compute. To the extent that increasing the power and capabilities of AI depend on more compute is the extent that Nvidia stands to benefit!
第三,像R1
和o1
这样的推理模型之所以性能优越,是因为它们使用了更多的计算资源。在某种程度上,人工智能能力的提升依赖于更多计算资源,而英伟达也因此受益匪浅!
Still, it’s not all rosy. At a minimum DeepSeek’s efficiency and broad availability cast significant doubt on the most optimistic Nvidia growth story, at least in the near term. The payoffs from both model and infrastructure optimization also suggest there are significant gains to be had from exploring alternative approaches to inference in particular. For example, it might be much more plausible to run inference on a standalone AMD GPU, completely sidestepping AMD’s inferior chip-to-chip communications capability. Reasoning models also increase the payoff for inference-only chips that are even more specialized than Nvidia’s GPUs.
然而,情况并非一片光明。至少,DeepSeek 的效率和广泛可用性对英伟达最乐观的增长预期提出了重大质疑,至少在短期内如此。模型和基础设施优化的回报也表明,探索替代的推理方法(特别是推理方法)具有显著的优势。例如,在独立的 AMD GPU 上运行推理可能更可行,从而完全避开了 AMD 较差的芯片间通信能力。推理模型也增加了对仅推理芯片的回报,这些芯片比英伟达的 GPU 更专业化。
In short, Nvidia isn’t going anywhere; the Nvidia stock, however, is suddenly facing a lot more uncertainty that hasn’t been priced in. And that, by extension, is going to drag everyone down.
简而言之,英伟达不会消失;然而,英伟达的股票突然面临着许多尚未计入价格的不确定性。而这,反过来,将会拖累所有人。
So what about the chip ban?
那芯片禁令呢?
The easiest argument to make is that the importance of the chip ban has only been accentuated given the U.S.’s rapidly evaporating lead in software. Software and knowhow can’t be embargoed — we’ve had these debates and realizations before — but chips are physical objects and the U.S. is justified in keeping them away from China.
最容易提出的论点是,鉴于美国在软件方面的领先优势正在迅速消失,芯片禁令的重要性更加突出。软件和技术诀窍无法禁运——我们以前就进行过这些辩论和认识——但芯片是实物,美国有理由将它们拒之中国门外。
At the same time, there should be some humility about the fact that earlier iterations of the chip ban seem to have directly led to DeepSeek’s innovations. Those innovations, moreover, would extend to not just smuggled Nvidia chips or nerfed ones like the H800, but to Huawei’s Ascend chips as well. Indeed, you can very much make the case that the primary outcome of the chip ban is today’s crash in Nvidia’s stock price.
与此同时,我们也应该谦逊地承认,早期的芯片禁令似乎直接导致了 DeepSeek 的创新。此外,这些创新不仅限于走私的英伟达芯片或性能降低的芯片(如 H800),还包括华为的昇腾芯片。事实上,你可以非常有力地论证,芯片禁令的主要结果就是今天英伟达股价的暴跌。
What concerns me is the mindset undergirding something like the chip ban: instead of competing through innovation in the future the U.S. is competing through the denial of innovation in the past. Yes, this may help in the short term — again, DeepSeek would be even more effective with more computing — but in the long run it simply sews the seeds for competition in an industry — chips and semiconductor equipment — over which the U.S. has a dominant position.
让我担忧的是芯片禁令背后的思维模式:美国不是通过未来的创新来竞争,而是通过否定过去的创新来竞争。是的,这可能在短期内有所帮助——再说一次,DeepSeek 拥有更多计算能力会更有效——但从长远来看,这只会为美国在芯片和半导体设备等主导产业的竞争埋下种子。
Like AI models? 喜欢 AI 模型?
AI models are a great example. I mentioned above I would get to OpenAI’s greatest crime, which I consider to be the 2023 Biden Executive Order on AI. I wrote in Attenuating Innovation:
AI 模型就是一个很好的例子。我在上面提到过,我会谈到 OpenAI 最大的罪过,我认为那就是2023 年拜登关于人工智能的行政命令。我在《减弱创新》一文中写道:
The point is this: if you accept the premise that regulation locks in incumbents, then it sure is notable that the early AI winners seem the most invested in generating alarm in Washington, D.C. about AI. This despite the fact that their concern is apparently not sufficiently high to, you know, stop their work. No, they are the responsible ones, the ones who care enough to call for regulation; all the better if concerns about imagined harms kneecap inevitable competitors.
关键在于:如果你接受监管会巩固既得利益者的前提,那么早期 AI 赢家似乎最热衷于在华盛顿特区制造对 AI 的恐慌,这一点确实值得注意。尽管事实上,他们的担忧显然不足以,你知道的,停止他们的工作。不,他们是负责任的,是那些足够关心呼吁监管的人;如果对想象中危害的担忧能够削弱不可避免的竞争对手,那就更好了。
That paragraph was about OpenAI specifically, and the broader San Francisco AI community generally. For years now we have been subject to hand-wringing about the dangers of AI by the exact same people committed to building it — and controlling it. These alleged dangers were the impetus for OpenAI becoming closed back in 2019 with the release of GPT-2:
那段话具体指的是 OpenAI,以及更广泛的旧金山人工智能社区。多年来,我们一直受到那些致力于构建和控制人工智能的同一人提出的关于人工智能危险性的担忧的影响——这些所谓的危险是 OpenAI 在2019 年发布 GPT-2 后关闭的动力:
Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code(opens in a new window). We are not releasing the dataset, training code, or GPT-2 model weights…We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems.
由于担心大型语言模型被用于大规模生成具有欺骗性、偏见或滥用性质的语言,我们只发布了一个小得多的 GPT-2 版本以及采样代码(在新窗口中打开)。我们没有发布数据集、训练代码或 GPT-2 模型权重……我们知道一些研究人员具备复制和开源我们结果的技术能力。我们相信我们的发布策略限制了最初可能选择这样做的组织数量,并为人工智能社区提供了更多时间来讨论此类系统的影响。We also think governments should consider expanding or commencing initiatives to more systematically monitor the societal impact and diffusion of AI technologies, and to measure the progression in the capabilities of such systems. If pursued, these efforts could yield a better evidence base for decisions by AI labs and governments regarding publication decisions and AI policy more broadly.
我们也认为政府应该考虑扩大或启动相关举措,更系统地监测人工智能技术对社会的冲击和传播,并衡量此类系统能力的进步。如果实施这些努力,可以为人工智能实验室和政府关于出版决定和更广泛的人工智能政策的决策提供更好的证据基础。
The arrogance in this statement is only surpassed by the futility: here we are six years later, and the entire world has access to the weights of a dramatically superior model. OpenAI’s gambit for control — enforced by the U.S. government — has utterly failed. In the meantime, how much innovation has been foregone by virtue of leading edge models not having open weights? More generally, how much time and energy has been spent lobbying for a government-enforced moat that DeepSeek just obliterated, that would have been better devoted to actual innovation?
这段话的傲慢程度与其徒劳无功相比有过之而无不及:六年过去了,全世界都可以访问一个性能显著优越的模型的权重。OpenAI 为了控制而采取的策略——在 美国政府的强制执行下——彻底失败了。与此同时,由于领先的模型没有公开权重,错过了多少创新?更普遍地说,有多少时间和精力花在了游说政府强制建立护城河上,而 DeepSeek 却轻而易举地将其摧毁,这些时间和精力本来可以更好地用于真正的创新?
So you’re not worried about AI doom scenarios?
所以你并不担心人工智能末日情景?
I definitely understand the concern, and just noted above that we are reaching the stage where AIs are training AIs and learning reasoning on their own. I recognize, though, that there is no stopping this train. More than that, this is exactly why openness is so important: we need more AIs in the world, not an unaccountable board ruling all of us.
我完全理解你的担忧,并且上面已经提到,我们正在进入 AI 训练 AI 并自主学习推理的阶段。不过,我也认识到,这列火车是无法阻止的。更重要的是,这正是开放性如此重要的原因:我们需要世界上有更多的 AI,而不是一个无法问责的委员会来统治我们所有人。
Wait, why is China open-sourcing their model?
等等,中国为什么开源他们的模型?
Well DeepSeek is, to be clear; CEO Liang Wenfeng said in a must-read interview that open source is key to attracting talent:
好的,DeepSeek,明确地说;首席执行官梁文锋在一篇必读的采访中表示,开源是吸引人才的关键:
In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
面对颠覆性技术,封闭源代码打造的护城河是暂时的。即使是 OpenAI 的封闭源代码方法也无法阻止其他人赶超。因此,我们将价值锚定在团队——我们的同事在这个过程中成长,积累专业知识,形成一个能够创新的组织和文化。这就是我们的护城河。Open source, publishing papers, in fact, do not cost us anything. For technical talent, having others follow your innovation gives a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial one, and contributing to it earns us respect. There is also a cultural attraction for a company to do this.
开源、发表论文,实际上对我们来说没有任何成本。对于技术人才来说,让别人跟随你的创新会带来极大的成就感。事实上,开源更是一种文化行为,而不是商业行为,为其贡献能赢得我们的尊重。公司这么做也具有一定的文化吸引力。
The interviewer asked if this would change:
面试官问这是否会改变:
DeepSeek, right now, has a kind of idealistic aura reminiscent of the early days of OpenAI, and it’s open source. Will you change to closed source later on? Both OpenAI and Mistral moved from open-source to closed-source.
DeepSeek 现在拥有一种让人联想起 OpenAI 早期的那种理想主义光环,而且它是开源的。你们以后会转向闭源吗?OpenAI 和 Mistral 都从开源转向了闭源。We will not change to closed source. We believe having a strong technical ecosystem first is more important.
我们不会转向闭源。我们认为首先拥有强大的技术生态系统更为重要。
This actually makes sense beyond idealism. If models are commodities — and they are certainly looking that way — then long-term differentiation comes from having a superior cost structure; that is exactly what DeepSeek has delivered, which itself is resonant of how China has come to dominate other industries. This is also contrary to how most U.S. companies think about differentiation, which is through having differentiated products that can sustain larger margins.
这实际上超越了理想主义的范畴。如果模型是商品——它们确实越来越像商品了——那么长期的差异化就来自于拥有更优的成本结构;而这正是 DeepSeek 所实现的,这也与中国在其他行业占据主导地位的方式相呼应。这与大多数美国公司对差异化的思考方式也截然相反,它们更注重拥有差异化的产品以维持更高的利润率。
So is OpenAI screwed? OpenAI 完蛋了吗?
Not necessarily. ChatGPT made OpenAI the accidental consumer tech company, which is to say a product company; there is a route to building a sustainable consumer business on commoditizable models through some combination of subscriptions and advertisements. And, of course, there is the bet on winning the race to AI take-off.
不一定。ChatGPT 使 OpenAI 意外地成为一家消费科技公司,也就是说一家产品公司;通过订阅和广告的某种组合,在可商品化的模型上建立可持续的消费者业务是有途径的。当然,还有押注赢得人工智能腾飞竞赛的赌注。
Anthropic, on the other hand, is probably the biggest loser of the weekend. DeepSeek made it to number one in the App Store, simply highlighting how Claude, in contrast, hasn’t gotten any traction outside of San Francisco. The API business is doing better, but API businesses in general are the most susceptible to the commoditization trends that seem inevitable (and do note that OpenAI and Anthropic’s inference costs look a lot higher than DeepSeek because they were capturing a lot of margin; that’s going away).
另一方面,Anthropic 可能是这个周末最大的输家。DeepSeek 登上了 App Store 榜首,这突显了Claude相比之下,在旧金山以外地区没有任何吸引力。API 业务做得更好,但总的来说,API 业务最容易受到似乎不可避免的商品化趋势的影响(并且请注意,OpenAI 和 Anthropic 的推理成本看起来比 DeepSeek 高得多,因为它们获得了很大的利润;这将消失)。
So this is all pretty depressing, then?
那这一切都很令人沮丧,是吗?
Actually, no. I think that DeepSeek has provided a massive gift to nearly everyone. The biggest winners are consumers and businesses who can anticipate a future of effectively-free AI products and services. Jevons Paradox will rule the day in the long run, and everyone who uses AI will be the biggest winners.
实际上,不。我认为 DeepSeek 几乎为每个人都带来了巨大的礼物。最大的赢家是能够预期未来有效免费的 AI 产品和服务的消费者和企业。从长远来看,耶文斯悖论将占据主导地位,所有使用 AI 的人将成为最大的赢家。
Another set of winners are the big consumer tech companies. A world of free AI is a world where product and distribution matters most, and those companies already won that game; The End of the Beginning was right.
另一批赢家是大消费科技公司。一个免费 AI 的世界,是一个产品和分销至关重要的世界,而这些公司已经赢得了这场游戏;The End of the Beginning是对的。
China is also a big winner, in ways that I suspect will only become apparent over time. Not only does the country have access to DeepSeek, but I suspect that DeepSeek’s relative success to America’s leading AI labs will result in a further unleashing of Chinese innovation as they realize they can compete.
中国也是一个大赢家,其获益方式可能需要一段时间才能显现。中国不仅可以使用 DeepSeek,而且我认为,DeepSeek 相对美国领先的 AI 实验室的成功,将进一步释放中国的创新活力,因为他们意识到自己可以竞争。
That leaves America, and a choice we have to make. We could, for very logical reasons, double down on defensive measures, like massively expanding the chip ban and imposing a permission-based regulatory regime on chips and semiconductor equipment that mirrors the E.U.’s approach to tech; alternatively, we could realize that we have real competition, and actually give ourself permission to compete. Stop wringing our hands, stop campaigning for regulations — indeed, go the other way, and cut out all of the cruft in our companies that has nothing to do with winning. If we choose to compete we can still win, and, if we do, we will have a Chinese company to thank.
这使得美国面临一个选择。出于非常合乎逻辑的原因,我们可以加倍采取防御措施,例如大幅扩大芯片禁令,并对芯片和半导体设备实施类似欧盟科技方法的许可制监管制度;或者,我们可以意识到我们面临真正的竞争,并真正允许自己参与竞争。停止焦虑,停止游说监管——事实上,反其道而行之,剔除公司中所有与获胜无关的累赘。如果我们选择竞争,我们仍然可以获胜,而且,如果我们获胜,我们将要感谢一家中国公司。