Listen to this post:
Good morning,
On Monday’s Sharp Tech we discussed the TikTok situation, including a digression about how the nature of governing may be poised to change under the current administration (the flood of executive orders very much fit within the thesis I put forward), and how tech companies have become quasi-government entities in their own right.
On to the Update:
Stratechery Updates
A few notes about some Stratechery changes that happened over the weekend:
- First, the Stratechery website has been re-designed to better accomodate the additional content in the Stratechery bundle. Articles and Updates now include a podcast player (and YouTube link, if it exists), and the latest podcasts in the Stratechery Plus bundle are in the sidebar. There is also a new visual distinction between quotes from external sources versus from past Stratechery posts (this has not yet been added to the emails).
- Second, there were two bugs that I want to apologize for. First, feeds across the Stratechery bundle contained podcasts from other shows early Monday morning. That was rectified fairly quickly, and you can refresh your feed if they are still incorrect. Secondly, and more seriously, the SMS notification for yesterday’s Interview with Jon Yu was sent to every phone number, when it should have only been sent to those who have opted in to SMS notifications in their account settings. This is totally unacceptable on my part and I am very sorry about this.
This was, as you might have guessed, a very substantive Update to nearly every piece of Stratechery infrastructure, and frankly, we didn’t handle it as well as we should have (I haven’t even mentioned countless Updates over the last few years, because they usually go much more smoothly!). We did learn a lot, though, and fixed some very obscure bugs that will make Passport better in the long run.
With that out of the way, I am very excited about the content additions: I think that Jon Yu’s Asianometry is a tremendous resource; check out the introductory post if you missed it, as well as the Interview I did with Jon.
That brings me to the final point: if you want to receive emails from Jon or add the Asianometry podcast, you do need to go to the Asianometry Passport page to opt in and add the podcast. My final mistake in yesterday’s post was to include automatic “Add this podcast” links to the podcast show notes, but for reasons that are complicated to explain — and which will be fixed — you actually need to have signed in first. I’m sorry for the confusion on this point.
I know this weekend was a bit messy, but my goal is to give you even more compelling content for your subscription at no additional cost, and I thank you for your patience as we make this happen.
DeepSeek-R1
From VentureBeat
Chinese AI startup DeepSeek, known for challenging leading AI vendors with open-source technologies, just dropped another bombshell: a new open reasoning LLM called DeepSeek-R1. Based on the recently introduced DeepSeek V3 mixture-of-experts model, DeepSeek-R1 matches the performance of o1, OpenAI’s frontier reasoning LLM, across math, coding and reasoning tasks. The best part? It does this at a much more tempting cost, proving to be 90-95% more affordable than the latter.
The release marks a major leap forward in the open-source arena. It showcases that open models are further closing the gap with closed commercial models in the race to artificial general intelligence (AGI). To show the prowess of its work, DeepSeek also used R1 to distill six Llama and Qwen models, taking their performance to new levels. In one case, the distilled version of Qwen-1.5B outperformed much bigger models, GPT-4o and Claude 3.5 Sonnet, in select math benchmarks. These distilled models, along with the main R1, have been open-sourced and are available on Hugging Face under an MIT license.
As a quick aside, the distilled Llama models are almost certainly violating the Llama license; DeepSeek doesn’t have the right to unilaterally change the Llama license, which is “open”, but not unencumbered like the MIT license.
That noted, this model is a big deal, even if it’s not the first we’ve heard of it. Over Christmas break DeepSeek released a GPT-4 level model called V3 that was notable for how efficiently it was trained, using only 2788K H800 training hours, which costs around $5.6 million, a shockingly low figure (and easily covered through smuggled chips). V3 was also fine-tuned using a then-yet-unreleased reasoning model called R1 to enhance its capabilities in areas like coding, mathematics, and logic; from the V3 Technical Report:
To establish our methodology, we begin by developing an expert model tailored to a specific domain, such as code, mathematics, or general reasoning, using a combined Supervised FineTuning (SFT) and Reinforcement Learning (RL) training pipeline. This expert model serves as a data generator for the final model. The training process involves generating two distinct types of SFT samples for each instance: the first couples the problem with its original response in the format of, while the second incorporates a system prompt alongside the problem and the R1 response in the format of.
The system prompt is meticulously designed to include instructions that guide the model toward producing responses enriched with mechanisms for reflection and verification. During the RL phase, the model leverages high-temperature sampling to generate responses that integrate patterns from both the R1-generated and original data, even in the absence of explicit system prompts. After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically.
The key part here is that R1 was used to generate synthetic data to make V3 better; in other words, one AI was training another AI. This is a critical capability in terms of the progress for these models.
What is even more interesting, though, is how R1 developed its reasoning in the first place. To that end, the most interesting revelation in the R1 Technical Report was that DeepSeek actually developed two R1 models: R1 and R1-Zero. R1 is the model that is publicly available; R1-Zero, though, is the bigger deal in my mind. From the paper:
In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912.
Reinforcement learning is a technique where a machine learning model is given a bunch of data and a reward function. The classic example is AlphaGo, where DeepMind gave the model the rules of Go with the reward function of winning the game, and then let the model figure everything else on its own. This famously ended up working better than other more human-guided techniques.
LLMs to date, however, have relied on reinforcement learning with human feedback; humans are in the loop to help guide the model, navigate difficult choices where rewards aren’t obvious, etc. RLHF was the key innovation in transforming GPT-3 into ChatGPT, with well-formed paragraphs, answers that were concise and didn’t trail off into gibberish, etc.
R1-Zero, however, drops the HF part — it’s just reinforcement learning. DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process. Moreover, the technique was a simple one: instead of trying to evaluate step-by-step (process supervision), or doing a search of all possible answers (a la AlphaGo), DeepSeek encouraged the model to try several different answers at a time and then graded them according to the two reward functions.
What emerged is a model that developed reasoning and chains-of-thought on its own, including what DeepSeek called “Aha Moments”:
A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
This is one of the most powerful affirmations yet of The Bitter Lesson: you don’t need to teach the AI how to reason, you can just give it enough compute and data and it will teach itself!
Well, almost: R1-Zero reasons, but in a way that humans have trouble understanding. Back to the introduction:
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
This sounds a lot like what OpenAI did for o1
: DeepSeek started the model out with a bunch of examples of chain-of-though thinking so it could learn the proper format for human consumption, and then did the reinforcement learning to enhance its reasoning, along with a number of editing and refinement steps; the output is a model that appears to be very competitive with o1
.
The third thing DeepSeek did was fine-tune various sized models from Qwen (Alibaba’s family of models) and Llama with R1; this imbued those models with R1’s reasoning capability, dramatically increasing their performance. Moreover, this increase in performance exceeded the gains from simply training the smaller models directly to reason; in other words, effectively having a large model train a small model provides better results than improving the smaller model directly.
DeepSeek Implications
All of this is quite remarkable, even if it might not be entirely new; after all, the o1
and o3
models exist. We don’t, however, know anything about those models, given that OpenAI is anything but; the fact these technical papers exist — and the fact that R1 actually writes out its thinking process — is particularly exciting.
What is confounding about this reality is the provenance of these models: China. Somehow we’ve ended up in a situation where the leading U.S. labs, including OpenAI and Anthropic (which hasn’t yet released a reasoning model), are keeping everything close to their vest, while this Chinese lab — itself a spin-off from a quantitative hedge fund called High-Flyer — is probably the world leader in open source models, and certainly in open-source reasoning models.
They also have no choice but to lead in efficiency, thanks to the challenge in acquiring chips; that efficiency, moreover, means that DeepSeek has caused a price war in China amongst LLM inference providers. Notice, however, the long-term path that points to: all U.S. AI model providers would like more chips, but at the end of the day access to chips is not their number one constraint; that means that it is unlikely they will achieve similar levels of efficiency. That’s a problem, however, in a world where models are a commodity: absent differentiation, the only way to have sustainable profits is through a sustainable cost advantage, and DeepSeek appears to be further down the road towards delivering exactly that, even as the U.S. model makers basically benefit from protectionism, which invariably leads to less competitive entities in a commoditized world.
In other words, you could make the case that the China Chip Ban hasn’t just failed, but actually backfired; Tyler Cowen suggested as much in Bloomberg:
I have in the past supported these trade restrictions, as AI technology is a vital matter of national security. But I now think the ban was too ambitious to work. It may have delayed Chinese progress in AI by a few years, but it also induced a major Chinese innovation — namely, DeepSeek.
Now the world knows that a very high-quality AI system can be trained for a relatively small sum of money. That could bring comparable AI systems into realistic purview for nations such as Russia, Iran, Pakistan and others. It is possible to imagine a foreign billionaire initiating a similar program, although personnel would be a constraint. Whatever the dangers of the Chinese system and its potential uses, DeepSeek-inspired offshoots in other nations could be more worrying yet.
Finding cheaper ways to build AI systems was almost certainly going to happen anyway. But consider the tradeoff here: US policy succeeded in hampering China’s ability to deploy high-quality chips in AI systems, with the accompanying national-security benefits, but it also accelerated the development of effective AI systems that do not rely on the highest-quality chips.
Of course now the argument will be that the chip ban is even more important, now that China has its own leading models; one wonders, however, if at some point it might be worth asking if (1) AI capability can truly be contained and (2) we might be overly confident in our ability to predict exactly how the landscape will evolve, which ought to increase the discount rate of perceived future benefits of controls as compared to very real losses today.
This Update will be available as a podcast later today. To receive it in your podcast player, visit Stratechery.
The Stratechery Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly.
Thanks for being a subscriber, and have a great day!