Listen to this post: 聆聽這篇文章:
It’s Monday, January 27. Why haven’t you written about DeepSeek yet?
今天是 1 月 27 日,星期一。你為什麼還沒有寫關於 DeepSeek 的文章?
I did! I wrote about R1
last Tuesday.
我寫了!我上週二寫了關於 R1
的內容。
I totally forgot about that.
我完全忘記了那件事。
I take responsibility. I stand by the post, including the two biggest takeaways that I highlighted (emergent chain-of-thought via pure reinforcement learning, and the power of distillation), and I mentioned the low cost (which I expanded on in Sharp Tech) and chip ban implications, but those observations were too localized to the current state of the art in AI. What I totally failed to anticipate were the broader implications this news would have to the overall meta-discussion, particularly in terms of the U.S. and China.
我承擔責任。我支持這篇文章,包括我強調的兩個最重要的要點(通過純強化學習的 emergent chain-of-thought 和蒸餾的力量),我提到了低成本(我在 Sharp Tech 中進一步闡述了這一點)和芯片禁令的影響,但這些觀察對於當前人工智慧的最新技術來說過於局限。我完全未能預見這則消息對整體元討論的更廣泛影響,特別是在美國和中國之間。
Is there precedent for such a miss?
這樣的失誤有先例嗎?
There is. In September 2023 Huawei announced the Mate 60 Pro with a SMIC-manufactured 7nm chip. The existence of this chip wasn’t a surprise for those paying close attention: SMIC had made a 7nm chip a year earlier (the existence of which I had noted even earlier than that), and TSMC had shipped 7nm chips in volume using nothing but DUV lithography (later iterations of 7nm were the first to use EUV). Intel had also made 10nm (TSMC 7nm equivalent) chips years earlier using nothing but DUV, but couldn’t do so with profitable yields; the idea that SMIC could ship 7nm chips using their existing equipment, particularly if they didn’t care about yields, wasn’t remotely surprising — to me, anyways.
在 2023 年 9 月,華為宣布推出 Mate 60 Pro,搭載由中芯國際製造的 7 納米晶片。對於密切關注的人來說,這款晶片的存在並不令人驚訝:中芯國際在一年前已經製造出 7 納米晶片(我甚至比那更早就注意到了這一點),而台積電則使用純粹的深紫外光刻技術大規模出貨 7 納米晶片(7 納米的後續版本是首批使用極紫外光的)。英特爾幾年前也使用純粹的深紫外光刻技術製造了 10 納米(相當於台積電的 7 納米)晶片,但無法實現盈利的產量;對我來說,中芯國際能夠使用現有設備出貨 7 納米晶片,特別是如果他們不在乎產量,這一點並不令人驚訝。
What I totally failed to anticipate was the overwrought reaction in Washington D.C. The dramatic expansion in the chip ban that culminated in the Biden administration transforming chip sales to a permission-based structure was downstream from people not understanding the intricacies of chip production, and being totally blindsided by the Huawei Mate 60 Pro. I get the sense that something similar has happened over the last 72 hours: the details of what DeepSeek has accomplished — and what they have not — are less important than the reaction and what that reaction says about people’s pre-existing assumptions.
我完全未能預料到華盛頓特區的過度反應。對晶片禁令的戲劇性擴大,最終導致拜登政府將晶片銷售轉變為基於許可的結構,源於人們對晶片生產的複雜性缺乏理解,並對華為 Mate 60 Pro 感到完全措手不及。我感覺在過去 72 小時內發生了類似的情況:DeepSeek 所取得的成就 —— 以及他們未能達成的目標 —— 比起反應本身及該反應所反映的人們先前的假設來說,顯得不那麼重要。
So what did DeepSeek announce?
那麼 DeepSeek 宣布了什麼?
The most proximate announcement to this weekend’s meltdown was R1
, a reasoning model that is similar to OpenAI’s o1
. However, many of the revelations that contributed to the meltdown — including DeepSeek’s training costs — actually accompanied the V3
announcement over Christmas. Moreover, many of the breakthroughs that undergirded V3
were actually revealed with the release of the V2
model last January.
這個週末崩潰的最接近公告是 R1
,這是一個類似於 OpenAI 的 o1
的推理模型。然而,許多導致崩潰的揭露 —— 包括 DeepSeek 的訓練成本 —— 實際上是在聖誕節的 V3
公告中出現的。此外,支撐 V3
的許多突破實際上是在去年一月 V2
模型發布時揭示的。
Is this model naming convention the greatest crime that OpenAI has committed?
這個模型命名慣例是 OpenAI 所犯下的最大罪行嗎?
Second greatest; we’ll get to the greatest momentarily.
第二大;我們馬上會談到最大的。
Let’s work backwards: what was the V2 model, and why was it important?
讓我們倒著來想:V2 模型是什麼,為什麼它很重要?
The DeepSeek-V2
model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.
DeepSeek- V2
模型引入了兩個重要的突破:DeepSeekMoE 和 DeepSeekMLA。“MoE” 在 DeepSeekMoE 中指的是 “專家混合”。一些模型,如 GPT-3.5,在訓練和推理過程中會激活整個模型;然而,事實證明,並不是模型的每一部分對於當前主題都是必要的。MoE 將模型分割成多個 “專家”,並僅激活必要的專家;GPT-4 是一個被認為擁有大約 1100 億個參數的 16 個專家的 MoE 模型。
DeepSeekMoE, as implemented in V2
, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.
DeepSeekMoE,如在 V2
中實現的,對這一概念引入了重要的創新,包括區分更細粒度的專門專家和具有更通用能力的共享專家。關鍵是,DeepSeekMoE 還引入了在訓練過程中負載平衡和路由的新方法;傳統上,MoE 在訓練中增加了通信開銷,以換取高效的推理,但 DeepSeek 的方法也使訓練變得更高效。
DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.
DeepSeekMLA 是一個更大的突破。推理的最大限制之一是所需的內存量:你需要將模型加載到內存中,還需要加載整個上下文窗口。上下文窗口在內存方面特別昂貴,因為每個標記都需要一個鍵和相應的值;DeepSeekMLA 或多頭潛在注意力使得壓縮鍵值存儲成為可能,顯著減少了推理過程中的內存使用。
I’m not sure I understood any of that.
我不確定我是否理解了那個。
The key implications of these breakthroughs — and the part you need to understand — only became apparent with V3
, which added a new approach to load balancing (further reducing communications overhead) and multi-token prediction in training (further densifying each training step, again reducing overhead): V3
was shockingly cheap to train. DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of $2/GPU hour, comes out to a mere $5.576 million.
這些突破的關鍵含義 —— 以及你需要理解的部分 —— 只有在 V3
出現後才變得明顯,它增加了一種新的負載平衡方法(進一步減少通信開銷)和訓練中的多標記預測(進一步密集化每個訓練步驟,再次減少開銷): V3
的訓練成本驚人地便宜。DeepSeek 聲稱模型訓練耗時 2,788 千小時 H800 GPU,按每小時 $2 計算,總成本僅為 $5.576 百萬。
That seems impossibly low.
那似乎低得不可思議。
DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3
paper:
DeepSeek 明確表示這些費用僅適用於最終的訓練運行,並不包括所有其他開支;來自 V3
論文:
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.
最後,我們再次強調 DeepSeek-V3 的經濟訓練成本,如表 1 所總結,這是通過我們對算法、框架和硬體的優化共同設計實現的。在預訓練階段,對每一萬億個標記訓練 DeepSeek-V3 僅需 180K H800 GPU 小時,即在我們擁有 2048 H800 GPU 的集群上需要 3.7 天。因此,我們的預訓練階段在不到兩個月內完成,並耗費 2664K GPU 小時。加上 119K GPU 小時的上下文長度擴展和 5K GPU 小時的後訓練,DeepSeek-V3 的完整訓練僅需 2.788M GPU 小時。假設 H800 GPU 的租金為每小時 2 美元,我們的總訓練成本僅為 5.576M 美元。請注意,上述成本僅包括 DeepSeek-V3 的官方訓練,不包括與先前研究和架構、算法或數據的消融實驗相關的成本。
So no, you can’t replicate DeepSeek the company for $5.576 million.
所以不,你無法以 557.6 萬美元複製 DeepSeek 這家公司。
I still don’t believe that number.
我仍然不相信那個數字。
Actually, the burden of proof is on the doubters, at least once you understand the V3
architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3
. Again, this was just the final run, not the total cost, but it’s a plausible number.
實際上,舉證責任在於懷疑者,至少在你理解 V3
架構之後。記住那段關於 DeepSeekMoE 的話:V3 有 6710 億個參數,但每個標記計算的活躍專家只有 370 億個參數;這相當於每個標記 3333.3 億 FLOPs 的計算。在這裡我應該提到另一個 DeepSeek 的創新:雖然參數以 BF16 或 FP32 精度存儲,但在計算時降低到 FP8 精度;2048 個 H800 GPU 的容量為 3.97 exoflops,即 3.97 億億 FLOPS。與此同時,訓練集由 14.8 萬億個標記組成;一旦你做完所有的數學計算,就會發現 280 萬 H800 小時足以訓練 V3
。再說一次,這只是最終運行的時間,而不是總成本,但這是一個合理的數字。
Scale AI CEO Alexandr Wang said they have 50,000 H100s.
Scale AI 首席執行官 Alexandr Wang 表示他們擁有 50,000 台 H100。
I don’t know where Wang got his information; I’m guessing he’s referring to this November 2024 tweet from Dylan Patel, which says that DeepSeek had “over 50k Hopper GPUs”. H800s, however, are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions.
我不知道王從哪裡得到他的資訊;我猜他指的是迪倫・帕特爾在 2024 年 11 月的這條推文,裡面說 DeepSeek 擁有 “超過 50k Hopper GPU”。然而,H800 是 Hopper GPU,它們的記憶體帶寬比 H100 受到美國制裁的限制要小得多。
Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.
這就是問題所在:我上面解釋的許多創新都是為了克服使用 H800 而不是 H100 所暗示的記憶體帶寬不足。此外,如果你真的計算了前一個問題,你會意識到 DeepSeek 實際上擁有過剩的計算能力;這是因為 DeepSeek 實際上專門編程了每個 H800 上 132 個處理單元中的 20 個,以管理跨晶片通信。這在 CUDA 中實際上是不可能做到的。DeepSeek 的工程師不得不降級到 PTX,這是一種針對 Nvidia GPU 的低級指令集,基本上就像組合語言。這是一種瘋狂的優化水平,只有在使用 H800 時才有意義。
Meanwhile, DeepSeek also makes their models available for inference: that requires a whole bunch of GPUs above-and-beyond whatever was used for training.
同時,DeepSeek 也提供他們的模型進行推斷:這需要一整堆超出訓練所需的 GPU。
So was this a violation of the chip ban?
這是對晶片禁令的違反嗎?
Nope. H100s were prohibited by the chip ban, but not H800s. Everyone assumed that training leading edge models required more interchip memory bandwidth, but that is exactly what DeepSeek optimized both their model structure and infrastructure around.
不。H100 因晶片禁令被禁止,但 H800 則不然。每個人都認為訓練尖端模型需要更多的晶片間記憶體帶寬,但這正是 DeepSeek 在其模型結構和基礎設施上進行優化的地方。
Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.
再次強調這一點,DeepSeek 在設計這個模型時所做的所有決策只有在受到 H800 的限制時才有意義;如果 DeepSeek 能夠使用 H100,他們可能會使用一個更大的訓練集群,並且進行更少的優化,專注於克服帶寬不足的問題。
So V3
is a leading edge model?
所以 V3
是一個領先的邊緣模型?
It’s definitely competitive with OpenAI’s 4o
and Anthropic’s Sonnet-3.5, and appears to be better than Llama’s biggest model. What does seem likely is that DeepSeek was able to distill those models to give V3
high quality tokens to train on.
它絕對與 OpenAI 的 4o
和 Anthropic 的 Sonnet-3.5 競爭,並且似乎比 Llama 的最大模型更好。看起來 DeepSeek 能夠提煉這些模型,以提供 {{1}} 高品質的標記進行訓練。
What is distillation? 什麼是蒸餾?
Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.
蒸餾是一種從另一個模型中提取理解的手段;你可以將輸入發送到教師模型並記錄輸出,然後用這些來訓練學生模型。這就是如何從 GPT-4 獲得像 GPT-4 Turbo 這樣的模型。對於公司來說,對自己的模型進行蒸餾更容易,因為他們擁有完全的訪問權限,但你仍然可以通過 API 以某種更笨拙的方式進行蒸餾,甚至如果你發揮創意,通過聊天客戶端也可以。
Distillation obviously violates the terms of service of various models, but the only way to stop it is to actually cut off access, via IP banning, rate limiting, etc. It’s assumed to be widespread in terms of model training, and is why there are an ever-increasing number of models converging on GPT-4o quality. This doesn’t mean that we know for a fact that DeepSeek distilled 4o or Claude, but frankly, it would be odd if they didn’t.
蒸餾顯然違反了各種模型的服務條款,但唯一阻止它的方法是實際切斷訪問,例如通過 IP 封鎖、速率限制等。據推測,這在模型訓練中是普遍存在的,這也是為什麼有越來越多的模型趨向於 GPT-4o 的質量。這並不意味著我們確定 DeepSeek 蒸餾了 4o 或 Claude,但坦白說,如果他們沒有這樣做,那就很奇怪。
Distillation seems terrible for leading edge models.
蒸餾對於尖端模型似乎很糟糕。
It is! On the positive side, OpenAI and Anthropic and Google are almost certainly using distillation to optimize the models they use for inference for their consumer-facing apps; on the negative side, they are effectively bearing the entire cost of training the leading edge, while everyone else is free-riding on their investment.
是的!從正面來看,OpenAI、Anthropic 和 Google 幾乎可以肯定正在使用蒸餾技術來優化他們用於推理的模型,以便用於面向消費者的應用;從負面來看,他們實際上承擔了最前沿訓練的全部成本,而其他人則在他們的投資上免費搭便車。
Indeed, this is probably the core economic factor undergirding the slow divorce of Microsoft and OpenAI. Microsoft is interested in providing inference to its customers, but much less enthused about funding $100 billion data centers to train leading edge models that are likely to be commoditized long before that $100 billion is depreciated.
確實,這可能是支撐微軟與 OpenAI 緩慢脫離的核心經濟因素。微軟有興趣為其客戶提供推理服務,但對於資助 1000 億美元的數據中心以訓練可能在 1000 億美元折舊之前就會商品化的尖端模型則興趣不大。
Is this why all of the Big Tech stock prices are down?
這就是為什麼所有大型科技公司的股價都下跌的原因嗎?
In the long run, model commoditization and cheaper inference — which DeepSeek has also demonstrated — is great for Big Tech. A world where Microsoft gets to provide inference to its customers for a fraction of the cost means that Microsoft has to spend less on data centers and GPUs, or, just as likely, sees dramatically higher usage given that inference is so much cheaper. Another big winner is Amazon: AWS has by-and-large failed to make their own quality model, but that doesn’t matter if there are very high quality open source models that they can serve at far lower costs than expected.
從長遠來看,模型商品化和更便宜的推理 —— 這也是 DeepSeek 所展示的 —— 對於大型科技公司來說是好事。一個微軟能以極低的成本為其客戶提供推理的世界意味著微軟在數據中心和 GPU 上的支出會減少,或者同樣可能的是,考慮到推理便宜得多,使用量會大幅增加。另一個大贏家是亞馬遜:AWS 在很大程度上未能製作出自己的高品質模型,但如果有非常高品質的開源模型可以以遠低於預期的成本提供,那麼這並不重要。
Apple is also a big winner. Dramatically decreased memory requirements for inference make edge inference much more viable, and Apple has the best hardware for exactly that. Apple Silicon uses unified memory, which means that the CPU, GPU, and NPU (neural processing unit) have access to a shared pool of memory; this means that Apple’s high-end hardware actually has the best consumer chip for inference (Nvidia gaming GPUs max out at 32GB of VRAM, while Apple’s chips go up to 128 GB of RAM).
蘋果也是一個大贏家。顯著減少的推理內存需求使邊緣推理變得更加可行,而蘋果擁有最適合這一點的硬件。蘋果矽片使用統一內存,這意味著 CPU、GPU 和 NPU(神經處理單元)可以訪問共享的內存池;這意味著蘋果的高端硬件實際上擁有最佳的消費者推理芯片(Nvidia 遊戲 GPU 的 VRAM 最大為 32GB,而蘋果的芯片可達 128GB 的 RAM)。
Meta, meanwhile, is the biggest winner of all. I already laid out last fall how every aspect of Meta’s business benefits from AI; a big barrier to realizing that vision is the cost of inference, which means that dramatically cheaper inference — and dramatically cheaper training, given the need for Meta to stay on the cutting edge — makes that vision much more achievable.
與此同時,Meta 是最大的贏家。我已經在去年秋天說明了 Meta 業務的每個方面如何受益於人工智慧;實現這一願景的一個大障礙是推理的成本,這意味著大幅降低的推理成本 —— 以及考慮到 Meta 需要保持在前沿的大幅降低的訓練成本 —— 使得這一願景變得更加可實現。
Google, meanwhile, is probably in worse shape: a world of decreased hardware requirements lessens the relative advantage they have from TPUs. More importantly, a world of zero-cost inference increases the viability and likelihood of products that displace search; granted, Google gets lower costs as well, but any change from the status quo is probably a net negative.
谷歌目前的情況可能更糟:硬體需求減少的世界減少了他們在 TPU 方面的相對優勢。更重要的是,零成本推理的世界增加了取代搜索產品的可行性和可能性;雖然谷歌的成本也降低,但任何來自現狀的變化可能都是淨負面影響。
I asked why the stock prices are down; you just painted a positive picture!
我問為什麼股價下跌;你卻只是描繪了一幅正面的畫面!
My picture is of the long run; today is the short run, and it seems likely the market is working through the shock of R1’s existence.
我的圖片是長期的;今天是短期的,市場似乎正在消化 R1 存在的衝擊。
Wait, you haven’t even talked about R1
yet.
等等,你還沒有談到 R1
呢。
R1
is a reasoning model like OpenAI’s o1
. It has the ability to think through a problem, producing much higher quality results, particularly in areas like coding, math, and logic (but I repeat myself).
R1
是一個類似於 OpenAI 的 o1
的推理模型。它能夠深入思考問題,產生更高質量的結果,特別是在編程、數學和邏輯等領域(但我重複了一遍)。
Is this more impressive than V3
?
這比 V3
更令人印象深刻嗎?
Actually, the reason why I spent so much time on V3
is that that was the model that actually demonstrated a lot of the dynamics that seem to be generating so much surprise and controversy. R1
is notable, however, because o1
stood alone as the only reasoning model on the market, and the clearest sign that OpenAI was the market leader.
其實,我花了這麼多時間在 V3
上的原因是,這個模型實際上展示了許多似乎引發驚訝和爭議的動態。然而, R1
是值得注意的,因為 o1
獨自作為市場上唯一的推理模型,也是 OpenAI 是市場領導者的最明顯跡象。
R1
undoes the o1
mythology in a couple of important ways. First, there is the fact that it exists. OpenAI does not have some sort of special sauce that can’t be replicated. Second, R1
— like all of DeepSeek’s models — has open weights (the problem with saying “open source” is that we don’t have the data that went into creating it). This means that instead of paying OpenAI to get reasoning, you can run R1
on the server of your choice, or even locally, at dramatically lower cost.
R1
以幾個重要的方式推翻了 o1
神話。首先,它的存在就是一個事實。OpenAI 並沒有什麼無法複製的特殊秘方。其次, R1
— 像 DeepSeek 的所有模型一樣 — 具有開放的權重(所謂的 “開源” 問題在於我們沒有用於創建它的數據)。這意味著,與其支付 OpenAI 以獲得推理,你可以在你選擇的伺服器上運行 R1
,甚至在本地,以大幅降低的成本。
How did DeepSeek make R1
?
DeepSeek 是如何製作 R1
的?
DeepSeek actually made two models: R1
and R1
-Zero. I actually think that R1
-Zero is the bigger deal; as I noted above, it was my biggest focus in last Tuesday’s Update:
DeepSeek 實際上製作了兩個模型: R1
和 R1
-Zero。我實際上認為 R1
-Zero 更重要;正如我在上週二的更新中提到的,它是我最大的關注點:
R1
-Zero, though, is the bigger deal in my mind. From the paper:
R1
- 零,在我心中更重要。來自論文:In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-
V3
-Base as the base model and employ GRPO as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1
-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1
-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1
-0912.
在本文中,我們邁出了改善語言模型推理能力的第一步,使用純強化學習(RL)。我們的目標是探索 LLMs 的潛力,以在沒有任何監督數據的情況下發展推理能力,專注於它們通過純 RL 過程的自我演化。具體而言,我們使用 DeepSeek-V3
-Base 作為基礎模型,並採用 GRPO 作為 RL 框架,以提高模型在推理方面的性能。在訓練過程中,DeepSeek-R1
-Zero 自然出現了許多強大而有趣的推理行為。經過數千步的 RL 訓練,DeepSeek-R1
-Zero 在推理基準測試中表現出色。例如,AIME 2024 的 pass@1 分數從 15.6% 提高到 71.0%,並且通過多數投票,分數進一步提高到 86.7%,達到 OpenAI-o1
-0912 的性能。Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function. The classic example is AlphaGo, where DeepMind gave the model the rules of Go with the reward function of winning the game, and then let the model figure everything else on its own. This famously ended up working better than other more human-guided techniques.
強化學習是一種技術,其中機器學習模型被提供一堆數據和一個獎勵函數。經典的例子是 AlphaGo,DeepMind 給了模型圍棋的規則,並以贏得比賽作為獎勵函數,然後讓模型自己找出其他一切。這最終被證明比其他更依賴人類指導的技術效果更好。LLMs to date, however, have relied on reinforcement learning with human feedback; humans are in the loop to help guide the model, navigate difficult choices where rewards aren’t obvious, etc. RLHF was the key innovation in transforming GPT-3 into ChatGPT, with well-formed paragraphs, answers that were concise and didn’t trail off into gibberish, etc.
LLMs 到目前為止,然而,已經依賴於帶有人工反饋的強化學習;人類參與其中以幫助指導模型,導航在獎勵不明顯的困難選擇等。RLHF 是將 GPT-3 轉變為 ChatGPT 的關鍵創新,具有良好結構的段落、簡潔的回答,並且不會陷入無意義的話語等。
R1
-Zero, however, drops the HF part — it’s just reinforcement learning. DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process. Moreover, the technique was a simple one: instead of trying to evaluate step-by-step (process supervision), or doing a search of all possible answers (a la AlphaGo), DeepSeek encouraged the model to try several different answers at a time and then graded them according to the two reward functions.
R1
-Zero,然而,去掉了 HF 部分 —— 這只是強化學習。DeepSeek 給了模型一組數學、代碼和邏輯問題,並設置了兩個獎勵函數:一個是正確答案,另一個是利用思考過程的正確格式。此外,這種技術非常簡單:DeepSeek 鼓勵模型同時嘗試幾個不同的答案,然後根據這兩個獎勵函數對它們進行評分,而不是逐步評估(過程監督)或搜索所有可能的答案(類似 AlphaGo)。What emerged is a model that developed reasoning and chains-of-thought on its own, including what DeepSeek called “Aha Moments”:
出現了一個模型,它自行發展了推理和思考鏈,包括 DeepSeek 所稱的 “恍然大悟時刻”:A particularly intriguing phenomenon observed during the training of DeepSeek-
R1
-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1
-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
在訓練 DeepSeek-R1
-Zero 的過程中,觀察到一個特別引人入勝的現象,即出現了 “恍然大悟” 的時刻。正如表 3 所示,這一時刻發生在模型的中間版本。在這個階段,DeepSeek-R1
-Zero 通過重新評估其初始方法來學會為問題分配更多的思考時間。這種行為不僅證明了模型日益增長的推理能力,也是強化學習如何導致意想不到且複雜結果的迷人範例。This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
這一刻不僅是模型的 “恍然大悟”,也是觀察其行為的研究者的 “恍然大悟”。它突顯了強化學習的力量和美:我們不是明確教導模型如何解決問題,而是簡單地提供正確的激勵,讓它自主發展出先進的問題解決策略。“恍然大悟” 強而有力地提醒我們,強化學習有潛力在人工系統中解鎖新的智慧層次,為未來更自主和適應性的模型鋪平道路。This is one of the most powerful affirmations yet of The Bitter Lesson: you don’t need to teach the AI how to reason, you can just give it enough compute and data and it will teach itself!
這是《苦澀的教訓》中最有力的肯定之一:你不需要教 AI 如何推理,你只需給它足夠的計算能力和數據,它就能自我學習!Well, almost:
R1
-Zero reasons, but in a way that humans have trouble understanding. Back to the introduction:
好吧,差不多:R1
- 零個理由,但以人類難以理解的方式。回到介紹:However, DeepSeek-
R1
-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1
, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3
-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1
-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3
in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3
-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1
, which achieves performance on par with OpenAI-o1
-1217.
然而,DeepSeek-R1
-Zero 遇到了可讀性差和語言混合等挑戰。為了解決這些問題並進一步提升推理性能,我們引入了 DeepSeek-R1
,它結合了一小部分冷啟動數據和多階段訓練流程。具體而言,我們首先收集數千條冷啟動數據,以微調 DeepSeek-V3
-Base 模型。隨後,我們進行類似 DeepSeek-R1
-Zero 的推理導向強化學習。在強化學習過程接近收斂時,我們通過對強化學習檢查點進行拒絕抽樣來創建新的 SFT 數據,並結合來自 DeepSeek-V3
的監督數據,涵蓋寫作、事實問答和自我認知等領域,然後重新訓練 DeepSeek-V3
-Base 模型。在用新數據微調後,檢查點經過額外的強化學習過程,考慮到所有場景的提示。經過這些步驟,我們獲得了一個稱為 DeepSeek-R1
的檢查點,其性能達到了與 OpenAI-o1
-1217 相當的水平。This sounds a lot like what OpenAI did for
o1
: DeepSeek started the model out with a bunch of examples of chain-of-thought thinking so it could learn the proper format for human consumption, and then did the reinforcement learning to enhance its reasoning, along with a number of editing and refinement steps; the output is a model that appears to be very competitive witho1
.
這聽起來很像 OpenAI 為o1
所做的:DeepSeek 一開始用一堆思考鏈的範例來訓練模型,以便它能學習適合人類理解的格式,然後進行強化學習以增強其推理能力,還有多個編輯和精煉步驟;最終的輸出是一個看起來與o1
非常有競爭力的模型。
Here again it seems plausible that DeepSeek benefited from distillation, particularly in terms of training R1
. That, though, is itself an important takeaway: we have a situation where AI models are teaching AI models, and where AI models are teaching themselves. We are watching the assembly of an AI takeoff scenario in realtime.
在這裡,再次看起來 DeepSeek 從蒸餾中受益,特別是在訓練 R1
方面。不過,這本身就是一個重要的收穫:我們面臨一種情況,即 AI 模型正在教導 AI 模型,而 AI 模型也在自我教學。我們正在實時觀察一個 AI 起飛場景的組裝。
So are we close to AGI?
那我們接近通用人工智慧了嗎?
It definitely seems like it. This also explains why Softbank (and whatever investors Masayoshi Son brings together) would provide the funding for OpenAI that Microsoft will not: the belief that we are reaching a takeoff point where there will in fact be real returns towards being first.
這確實看起來是這樣。這也解釋了為什麼軟銀(以及孫正義所帶來的任何投資者)會提供微軟不會提供的資金給 OpenAI:相信我們正達到一個起飛點,實際上會有真正的回報來成為第一。
But isn’t R1
now in the lead?
但 R1
現在不是在領先嗎?
I don’t think so; this has been overstated. R1
is competitive with o1
, although there do seem to be some holes in its capability that point towards some amount of distillation from o1
-Pro. OpenAI, meanwhile, has demonstrated o3
, a far more powerful reasoning model. DeepSeek is absolutely the leader in efficiency, but that is different than being the leader overall.
我不這麼認為;這已經被過度誇大了。 R1
與 o1
具有競爭力,儘管似乎在其能力上存在一些漏洞,這指向了從 o1
-Pro 中提煉出來的一定程度的能力。與此同時,OpenAI 已經展示了 o3
,這是一個更強大的推理模型。DeepSeek 在效率方面絕對是領導者,但這與整體領導者是不同的。
So why is everyone freaking out?
那麼為什麼大家都在驚慌失措呢?
I think there are multiple factors. First, there is the shock that China has caught up to the leading U.S. labs, despite the widespread assumption that China isn’t as good at software as the U.S.. This is probably the biggest thing I missed in my surprise over the reaction. The reality is that China has an extremely proficient software industry generally, and a very good track record in AI model building specifically.
我認為有多個因素。首先,中國已經追趕上領先的美國實驗室,這讓人感到震驚,儘管普遍認為中國在軟體方面不如美國。這可能是我對反應感到驚訝時最大的疏漏。事實是,中國的軟體產業整體上非常熟練,尤其在人工智慧模型建設方面有著非常好的記錄。
Second is the low training cost for V3
, and DeepSeek’s low inference costs. This part was a big surprise for me as well, to be sure, but the numbers are plausible. This, by extension, probably has everyone nervous about Nvidia, which obviously has a big impact on the market.
第二是 V3
的低訓練成本,以及 DeepSeek 的低推理成本。這一點對我來說也是一個很大的驚喜,但這些數字是合理的。由此推斷,這可能讓每個人都對 Nvidia 感到緊張,因為這顯然對市場有很大的影響。
Third is the fact that DeepSeek pulled this off despite the chip ban. Again, though, while there are big loopholes in the chip ban, it seems likely to me that DeepSeek accomplished this with legal chips.
第三,DeepSeek 在芯片禁令下成功實現了這一點。再者,儘管芯片禁令存在重大漏洞,但我認為 DeepSeek 是使用合法芯片完成這一工作的。
I own Nvidia! Am I screwed?
我擁有 Nvidia!我完蛋了嗎?
There are real challenges this news presents to the Nvidia story. Nvidia has two big moats:
這則新聞對 Nvidia 的故事提出了真正的挑戰。Nvidia 有兩個大護城河:
- CUDA is the language of choice for anyone programming these models, and CUDA only works on Nvidia chips.
CUDA 是任何編寫這些模型的人的首選語言,而 CUDA 只在 Nvidia 芯片上運行。 - Nvidia has a massive lead in terms of its ability to combine multiple chips together into one large virtual GPU.
Nvidia 在將多個晶片組合成一個大型虛擬 GPU 的能力上具有巨大的領先優勢。
These two moats work together. I noted above that if DeepSeek had access to H100s they probably would have used a larger cluster to train their model, simply because that would have been the easier option; the fact they didn’t, and were bandwidth constrained, drove a lot of their decisions in terms of both model architecture and their training infrastructure. Just look at the U.S. labs: they haven’t spent much time on optimization because Nvidia has been aggressively shipping ever more capable systems that accommodate their needs. The route of least resistance has simply been to pay Nvidia. DeepSeek, however, just demonstrated that another route is available: heavy optimization can produce remarkable results on weaker hardware and with lower memory bandwidth; simply paying Nvidia more isn’t the only way to make better models.
這兩個護城河共同運作。我在上面提到過,如果 DeepSeek 能夠使用 H100,他們可能會使用更大的集群來訓練他們的模型,因為那將是更簡單的選擇;他們沒有這樣做,並且受到帶寬限制,這驅動了他們在模型架構和訓練基礎設施方面的許多決策。看看美國的實驗室:他們在優化上花的時間不多,因為 Nvidia 一直在積極推出越來越強大的系統來滿足他們的需求。最小阻力的路徑就是支付 Nvidia。然而,DeepSeek 剛剛證明了另一條路徑是可行的:重度優化可以在較弱的硬體和較低的記憶體帶寬下產生顯著的結果;僅僅支付 Nvidia 更多並不是製作更好模型的唯一方法。
That noted, there are three factors still in Nvidia’s favor. First, how capable might DeepSeek’s approach be if applied to H100s, or upcoming GB100s? Just because they found a more efficient way to use compute doesn’t mean that more compute wouldn’t be useful. Second, lower inference costs should, in the long run, drive greater usage. Microsoft CEO Satya Nadella, in a late night tweet almost assuredly directed at the market, said exactly that:
考慮到這一點,Nvidia 仍然有三個有利因素。首先,如果將 DeepSeek 的方法應用於 H100 或即將推出的 GB100,這種方法的能力可能有多強?僅僅因為他們找到了一種更有效的計算使用方式,並不意味著更多的計算不會有用。第二,較低的推理成本應該在長期內推動更大的使用量。微軟首席執行官薩提亞・納德拉在一條幾乎肯定是針對市場的深夜推文中正是這樣說的:
Third, reasoning models like R1
and o1
derive their superior performance from using more compute. To the extent that increasing the power and capabilities of AI depend on more compute is the extent that Nvidia stands to benefit!
第三,像 R1
和 o1
這樣的推理模型之所以能夠表現出色,是因為它們使用了更多的計算能力。增加人工智慧的力量和能力在多大程度上依賴於更多的計算能力,Nvidia 就有多大程度的好處!
Still, it’s not all rosy. At a minimum DeepSeek’s efficiency and broad availability cast significant doubt on the most optimistic Nvidia growth story, at least in the near term. The payoffs from both model and infrastructure optimization also suggest there are significant gains to be had from exploring alternative approaches to inference in particular. For example, it might be much more plausible to run inference on a standalone AMD GPU, completely sidestepping AMD’s inferior chip-to-chip communications capability. Reasoning models also increase the payoff for inference-only chips that are even more specialized than Nvidia’s GPUs.
不過,並非一切都是美好的。至少在短期內,DeepSeek 的效率和廣泛可用性對最樂觀的 Nvidia 增長故事提出了重大質疑。模型和基礎設施優化的回報也表明,特別是在推理方面,探索替代方法將帶來顯著的收益。例如,在獨立的 AMD GPU 上進行推理可能更為可行,完全避開 AMD 較差的芯片間通信能力。推理模型還增加了對於比 Nvidia 的 GPU 更專門化的僅推理芯片的回報。
In short, Nvidia isn’t going anywhere; the Nvidia stock, however, is suddenly facing a lot more uncertainty that hasn’t been priced in. And that, by extension, is going to drag everyone down.
簡而言之,Nvidia 不會消失;然而,Nvidia 的股票卻突然面臨更多未被計入的不確定性。而這,將會連帶拖累所有人。
So what about the chip ban?
那麼晶片禁令怎麼辦?
The easiest argument to make is that the importance of the chip ban has only been accentuated given the U.S.’s rapidly evaporating lead in software. Software and knowhow can’t be embargoed — we’ve had these debates and realizations before — but chips are physical objects and the U.S. is justified in keeping them away from China.
最簡單的論點是,考慮到美國在軟體領域迅速消失的領先地位,晶片禁令的重要性只會更加凸顯。軟體和技術無法被禁運 —— 我們之前已經進行過這些辯論和認識 —— 但晶片是實體物品,美國有理由將它們遠離中國。
At the same time, there should be some humility about the fact that earlier iterations of the chip ban seem to have directly led to DeepSeek’s innovations. Those innovations, moreover, would extend to not just smuggled Nvidia chips or nerfed ones like the H800, but to Huawei’s Ascend chips as well. Indeed, you can very much make the case that the primary outcome of the chip ban is today’s crash in Nvidia’s stock price.
同時,對於早期的晶片禁令似乎直接導致 DeepSeek 的創新這一事實,應該保持一些謙遜。此外,這些創新不僅會擴展到走私的 Nvidia 晶片或像 H800 這樣的削弱版本,還包括華為的 Ascend 晶片。事實上,可以很有力地主張,晶片禁令的主要結果是今天 Nvidia 股價的崩潰。
What concerns me is the mindset undergirding something like the chip ban: instead of competing through innovation in the future the U.S. is competing through the denial of innovation in the past. Yes, this may help in the short term — again, DeepSeek would be even more effective with more computing — but in the long run it simply sews the seeds for competition in an industry — chips and semiconductor equipment — over which the U.S. has a dominant position.
我所關心的是支撐晶片禁令這類事情的心態:美國不是通過未來的創新來競爭,而是通過否定過去的創新來競爭。是的,這在短期內可能有幫助 —— 再說一次,DeepSeek 如果有更多的計算能力會更有效 —— 但從長遠來看,這只是在為一個行業 —— 晶片和半導體設備 —— 的競爭播下種子,而美國在這個行業中佔據主導地位。
Like AI models? 像人工智慧模型嗎?
AI models are a great example. I mentioned above I would get to OpenAI’s greatest crime, which I consider to be the 2023 Biden Executive Order on AI. I wrote in Attenuating Innovation:
AI 模型是一個很好的例子。我在上面提到過,我會談到 OpenAI 的最大罪行,我認為是 2023 年拜登關於 AI 的行政命令。我在《減弱創新》中寫道:
The point is this: if you accept the premise that regulation locks in incumbents, then it sure is notable that the early AI winners seem the most invested in generating alarm in Washington, D.C. about AI. This despite the fact that their concern is apparently not sufficiently high to, you know, stop their work. No, they are the responsible ones, the ones who care enough to call for regulation; all the better if concerns about imagined harms kneecap inevitable competitors.
重點是:如果你接受這個前提,即監管會鎖定現有企業,那麼早期的人工智慧贏家似乎最積極地在華盛頓特區引發對人工智慧的警報,這確實值得注意。儘管事實上,他們的擔憂顯然並不足以讓他們停止工作。不,他們是負責任的,是那些足夠關心以呼籲監管的人;如果對想像中的傷害的擔憂能夠削弱不可避免的競爭對手,那就更好了。
That paragraph was about OpenAI specifically, and the broader San Francisco AI community generally. For years now we have been subject to hand-wringing about the dangers of AI by the exact same people committed to building it — and controlling it. These alleged dangers were the impetus for OpenAI becoming closed back in 2019 with the release of GPT-2:
該段落專門談到了 OpenAI,以及更廣泛的舊金山人工智慧社群。這些年來,我們一直受到那些致力於建造和控制人工智慧的同一群人對人工智慧危險的擔憂所困擾。這些所謂的危險是 OpenAI 在 2019 年因發布 GPT-2 而轉為封閉的動力。
Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code(opens in a new window). We are not releasing the dataset, training code, or GPT-2 model weights…We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems.
由於對大型語言模型被用來大規模生成欺騙性、偏見或虐待性語言的擔憂,我們僅釋出一個更小版本的 GPT-2 以及取樣代碼(在新窗口中打開)。我們不會釋出數據集、訓練代碼或 GPT-2 模型權重…… 我們知道一些研究人員具備重現和開源我們結果的技術能力。我們相信,我們的釋出策略限制了最初可能選擇這樣做的組織範圍,並給予 AI 社群更多時間討論這些系統的影響。We also think governments should consider expanding or commencing initiatives to more systematically monitor the societal impact and diffusion of AI technologies, and to measure the progression in the capabilities of such systems. If pursued, these efforts could yield a better evidence base for decisions by AI labs and governments regarding publication decisions and AI policy more broadly.
我們也認為政府應考慮擴大或啟動倡議,以更系統地監測人工智慧技術的社會影響和擴散情況,並衡量這些系統能力的進展。如果這些努力得以推進,將能為人工智慧實驗室和政府在發表決策及更廣泛的人工智慧政策方面提供更好的證據基礎。
The arrogance in this statement is only surpassed by the futility: here we are six years later, and the entire world has access to the weights of a dramatically superior model. OpenAI’s gambit for control — enforced by the U.S. government — has utterly failed. In the meantime, how much innovation has been foregone by virtue of leading edge models not having open weights? More generally, how much time and energy has been spent lobbying for a government-enforced moat that DeepSeek just obliterated, that would have been better devoted to actual innovation?
這個陳述中的傲慢只有被無用所超越:六年後,我們在這裡,整個世界都可以訪問一個顯著優越模型的權重。OpenAI 對控制的賭注 —— 由美國政府強制執行 —— 已經徹底失敗。與此同時,由於尖端模型沒有開放權重,多少創新被放棄了?更一般地說,為了爭取政府強制的護城河,花費了多少時間和精力,而這個護城河被 DeepSeek 毀滅了,這些時間和精力本來可以更好地用於實際創新?
So you’re not worried about AI doom scenarios?
所以你不擔心人工智慧的末日情景嗎?
I definitely understand the concern, and just noted above that we are reaching the stage where AIs are training AIs and learning reasoning on their own. I recognize, though, that there is no stopping this train. More than that, this is exactly why openness is so important: we need more AIs in the world, not an unaccountable board ruling all of us.
我完全理解這個擔憂,並且剛才提到我們正達到一個階段,人工智慧正在訓練人工智慧並自主學習推理。然而,我認識到這列火車是無法停止的。更重要的是,這正是開放性如此重要的原因:我們需要更多的人工智慧在這個世界上,而不是一個不負責任的董事會統治我們所有人。
Wait, why is China open-sourcing their model?
等等,為什麼中國要開源他們的模型?
Well DeepSeek is, to be clear; CEO Liang Wenfeng said in a must-read interview that open source is key to attracting talent:
好吧,DeepSeek 是,為了清楚起見;首席執行官梁文峰在一篇必讀的訪談中表示,開源是吸引人才的關鍵:
In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
面對顛覆性技術,由封閉源碼創造的護城河是暫時的。即使是 OpenAI 的封閉源碼方法也無法阻止其他人追趕上來。因此,我們將價值寄託於我們的團隊 —— 我們的同事在這個過程中成長,積累專業知識,形成一個能夠創新的組織和文化。這就是我們的護城河。Open source, publishing papers, in fact, do not cost us anything. For technical talent, having others follow your innovation gives a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial one, and contributing to it earns us respect. There is also a cultural attraction for a company to do this.
開源、發表論文,實際上對我們來說並不需要花費任何成本。對於技術人才來說,讓他人追隨你的創新會帶來巨大的成就感。事實上,開源更多的是一種文化行為,而非商業行為,對其貢獻能讓我們贏得尊重。對於公司來說,這樣做也有文化上的吸引力。
The interviewer asked if this would change:
面試官問這是否會改變:
DeepSeek, right now, has a kind of idealistic aura reminiscent of the early days of OpenAI, and it’s open source. Will you change to closed source later on? Both OpenAI and Mistral moved from open-source to closed-source.
DeepSeek 目前有一種理想主義的氛圍,讓人想起 OpenAI 的早期時期,而且它是開源的。你們會在以後轉為閉源嗎?OpenAI 和 Mistral 都從開源轉為閉源。We will not change to closed source. We believe having a strong technical ecosystem first is more important.
我們不會轉向封閉源代碼。我們認為擁有強大的技術生態系統更為重要。
This actually makes sense beyond idealism. If models are commodities — and they are certainly looking that way — then long-term differentiation comes from having a superior cost structure; that is exactly what DeepSeek has delivered, which itself is resonant of how China has come to dominate other industries. This is also contrary to how most U.S. companies think about differentiation, which is through having differentiated products that can sustain larger margins.
這實際上超越了理想主義是有道理的。如果模型是商品 —— 而它們確實看起來是這樣 —— 那麼長期的差異化來自於擁有優越的成本結構;這正是 DeepSeek 所提供的,這本身也與中國如何主導其他行業相呼應。這也與大多數美國公司對差異化的看法相反,他們認為差異化是通過擁有可以維持更大利潤的差異化產品。
So is OpenAI screwed? 那麼 OpenAI 完蛋了嗎?
Not necessarily. ChatGPT made OpenAI the accidental consumer tech company, which is to say a product company; there is a route to building a sustainable consumer business on commoditizable models through some combination of subscriptions and advertisements. And, of course, there is the bet on winning the race to AI take-off.
不一定。ChatGPT 使 OpenAI 成為意外的消費科技公司,也就是說是一家產品公司;通過某種訂閱和廣告的組合,有一條路徑可以建立可持續的消費業務。當然,還有贏得人工智慧起飛競賽的賭注。
Anthropic, on the other hand, is probably the biggest loser of the weekend. DeepSeek made it to number one in the App Store, simply highlighting how Claude, in contrast, hasn’t gotten any traction outside of San Francisco. The API business is doing better, but API businesses in general are the most susceptible to the commoditization trends that seem inevitable (and do note that OpenAI and Anthropic’s inference costs look a lot higher than DeepSeek because they were capturing a lot of margin; that’s going away).
Anthropic 另一方面,可能是這個週末最大的輸家。DeepSeek 在應用商店中登上了第一名,這突顯了 Claude 相比之下在舊金山以外並沒有獲得任何 traction。API 業務表現較好,但一般來說,API 業務最容易受到看似不可避免的商品化趨勢的影響(並且請注意,OpenAI 和 Anthropic 的推理成本看起來比 DeepSeek 高得多,因為他們捕捉了很多利潤;這種情況正在消失)。
So this is all pretty depressing, then?
那這一切都很令人沮喪,是嗎?
Actually, no. I think that DeepSeek has provided a massive gift to nearly everyone. The biggest winners are consumers and businesses who can anticipate a future of effectively-free AI products and services. Jevons Paradox will rule the day in the long run, and everyone who uses AI will be the biggest winners.
其實不是。我認為 DeepSeek 為幾乎每個人提供了一份巨大的禮物。最大的受益者是消費者和企業,他們可以預見未來將有有效免費的 AI 產品和服務。傑文斯悖論在長期內將主導一切,所有使用 AI 的人將是最大的贏家。
Another set of winners are the big consumer tech companies. A world of free AI is a world where product and distribution matters most, and those companies already won that game; The End of the Beginning was right.
另一組贏家是大型消費科技公司。自由 AI 的世界是一個產品和分銷最重要的世界,而這些公司已經贏得了這場比賽;開始的結束是正確的。
China is also a big winner, in ways that I suspect will only become apparent over time. Not only does the country have access to DeepSeek, but I suspect that DeepSeek’s relative success to America’s leading AI labs will result in a further unleashing of Chinese innovation as they realize they can compete.
中國也是一個大贏家,我懷疑這些方式只有隨著時間的推移才會變得明顯。這個國家不僅可以使用 DeepSeek,我懷疑 DeepSeek 相對於美國領先的人工智慧實驗室的成功將進一步釋放中國的創新,因為他們意識到自己可以競爭。
That leaves America, and a choice we have to make. We could, for very logical reasons, double down on defensive measures, like massively expanding the chip ban and imposing a permission-based regulatory regime on chips and semiconductor equipment that mirrors the E.U.’s approach to tech; alternatively, we could realize that we have real competition, and actually give ourself permission to compete. Stop wringing our hands, stop campaigning for regulations — indeed, go the other way, and cut out all of the cruft in our companies that have nothing to do with winning. If we choose to compete we can still win, and, if we do, we will have a Chinese company to thank.
這讓美國面臨一個選擇。我們可以出於非常合理的原因,加倍加強防禦措施,例如大規模擴大晶片禁令,並對晶片和半導體設備實施類似於歐盟對科技的許可制監管;或者,我們可以意識到我們面臨真正的競爭,並實際上給自己競爭的許可。停止懷疑,停止為規範而競選 —— 事實上,走向相反的方向,刪除我們公司中與獲勝無關的所有繁瑣。如果我們選擇競爭,我們仍然可以獲勝,如果我們這樣做,我們將感謝一家中國公司。