We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution.
我們預測未來十年超人類人工智慧的影響將是巨大的,超越工業革命的影響。

We wrote a scenario that represents our best guess about what that might look like.1 It’s informed by trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes.2
我們撰寫了一個情境,代表我們對未來可能樣貌的最佳猜測。 1 它是基於趨勢外推、戰爭遊戲、專家反饋、OpenAI 的經驗以及先前的預測成功所形成的。 2

The CEOs of OpenAI, Google DeepMind, and Anthropic have all predicted that AGI will arrive within the next 5 years. Sam Altman has said OpenAI is setting its sights on “superintelligence in the true sense of the word” and the “glorious future.”3

What might that look like? We wrote AI 2027 to answer that question. Claims about the future are often frustratingly vague, so we tried to be as concrete and quantitative as possible, even though this means depicting one of many possible futures.

We wrote two endings: a “slowdown” and a “race” ending. However, AI 2027 is not a recommendation or exhortation. Our goal is predictive accuracy.4

We encourage you to debate and counter this scenario.5 We hope to spark a broad conversation about where we’re headed and how to steer toward positive futures. We’re planning to give out thousands in prizes to the best alternative scenarios.

Our research on key questions (e.g. what goals will future AI agents have?) can be found here.

The scenario itself was written iteratively: we wrote the first period (up to mid-2025), then the following period, etc. until we reached the ending. We then scrapped this and did it again.

We weren’t trying to reach any particular ending. After we finished the first ending—which is now colored red—we wrote a new alternative branch because we wanted to also depict a more hopeful way things could end, starting from roughly the same premises. This went through several iterations.6

Our scenario was informed by approximately 25 tabletop exercises and feedback from over 100 people, including dozens of experts in each of AI governance and AI technical work.

“I highly recommend reading this scenario-type prediction on how AI could transform the world in just a few years. Nobody has a crystal ball, but this type of content can help notice important questions and illustrate the potential impact of emerging risks.”Yoshua Bengio7

We have set ourselves an impossible task. Trying to predict how superhuman AI in 2027 would go is like trying to predict how World War 3 in 2027 would go, except that it’s an even larger departure from past case studies. Yet it is still valuable to attempt, just as it is valuable for the US military to game out Taiwan scenarios.

Painting the whole picture makes us notice important questions or connections we hadn’t considered or appreciated before, or realize that a possibility is more or less likely. Moreover, by sticking our necks out with concrete predictions, and encouraging others to publicly state their disagreements, we make it possible to evaluate years later who was right.

Also, one author wrote a lower-effort AI scenario before, in August 2021. While it got many things wrong, overall it was surprisingly successful: he predicted the rise of chain-of-thought, inference scaling, sweeping AI chip export controls, and $100 million training runs—all more than a year before ChatGPT.

Daniel Kokotajlo (TIME100, NYT piece) is a former OpenAI researcher whose previous AI predictions have held up well.

Eli Lifland co-founded AI Digest, did AI robustness research, and ranks #1 on the RAND Forecasting Initiative all-time leaderboard.

Thomas Larsen founded the Center for AI Policy and did AI safety research at the Machine Intelligence Research Institute.

Romeo Dean is completing a computer science concurrent bachelor’s and master’s degree at Harvard and previously was an AI Policy Fellow at the Institute for AI Policy and Strategy.

Scott Alexander, blogger extraordinaire, volunteered to rewrite our content in an engaging style; the fun parts of the story are his and the boring parts are ours.

For more about our team and acknowledgements, see the About page.

Mid 2025: Stumbling Agents
2025 年中:蹣跚的代理人

The world sees its first glimpse of AI agents.
世界首次見證 AI 代理人的初現。

Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases.8 Though more advanced than previous iterations like Operator, they struggle to get widespread usage.9
針對使用電腦的代理人的廣告強調「個人助理」這個詞:你可以提示他們執行像「幫我在 DoorDash 訂一份墨西哥捲餅」或「打開我的預算試算表並加總本月支出」的任務。他們會在需要時與你確認:例如,請你確認購買。 8 雖然比先前的版本如 Operator 更先進,但他們仍難以獲得廣泛使用。 9

Meanwhile, out of public focus, more specialized coding and research agents are beginning to transform their professions.
同時,在公眾焦點之外,更專門的程式編寫和研究代理人開始改變他們的職業。

The AIs of 2024 could follow specific instructions: they could turn bullet points into emails, and simple requests into working code. In 2025, AIs function more like employees. Coding AIs increasingly look like autonomous agents rather than mere assistants: taking instructions via Slack or Teams and making substantial code changes on their own, sometimes saving hours or even days.10 Research agents spend half an hour scouring the Internet to answer your question.
2024 年的 AI 可以遵循具體指示:他們能將重點條列轉成電子郵件,並將簡單請求轉成可運作的程式碼。到了 2025 年,AI 更像是員工。程式編寫 AI 越來越像自主代理人,而非僅僅是助理:透過 Slack 或 Teams 接收指示,並自行做出重大程式碼更改,有時能節省數小時甚至數天。 10 研究代理人花半小時在網路上搜尋以回答你的問題。

The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.11 Still, many companies find ways to fit AI agents into their workflows.12
這些代理在理論上(以及在精心挑選的範例中)令人印象深刻,但在實際應用中不可靠。AI 推特充斥著各種以特別滑稽方式搞砸任務的故事。較優秀的代理也相當昂貴;你付出多少就得到多少,最佳效能每月花費數百美元。 11 不過,許多公司仍找到方法將 AI 代理整合進他們的工作流程中。 12

Late 2025: The World’s Most Expensive AI
2025 年底:世界上最昂貴的 AI

OpenBrain is building the biggest datacenters the world has ever seen.13
OpenBrain 正在建造世界上有史以來最大的數據中心。 13

(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)
(為避免針對任何現有公司,我們將描述一個虛構的通用人工智慧公司,稱之為 OpenBrain。我們想像其他公司落後 OpenBrain 約 3 到 9 個月。)

GPT-3
(3 x 10^23 FLOPS)
(3 x 10^23 FLOPS)
GPT-4
(2 x 10^25 FLOPS)
(2 x 10^25 FLOPS)
Agent-1  代理人-1
(3 x 10^27 FLOPS)
(3 x 10^27 FLOPS)
ai-2027.com

GPT-4 required 210252\cdot 10^{25} FLOP of compute to train.14 OpenBrain’s latest public model—Agent-0—was trained with 102710^{27} FLOP. Once the new datacenters are up and running, they’ll be able to train a model with 102810^{28} FLOP—a thousand times more than GPT-4.15 Other companies pour money into their own giant datacenters, hoping to keep pace.
GPT-4 需要 210252\cdot 10^{25} FLOP 的計算量來訓練。 14 OpenBrain 最新公開模型—代理人-0—是用 102710^{27} FLOP 訓練的。一旦新的資料中心啟用並運作,他們將能訓練一個擁有 102810^{28} FLOP 的模型—是 GPT-4 的一千倍。 15 其他公司則投入資金建造自己的大型資料中心,希望能跟上腳步。

Although models are improving on a wide range of skills, one stands out: OpenBrain focuses on AIs that can speed up AI research. They want to win the twin arms races against China (whose leading company we’ll call “DeepCent”)16 and their US competitors. The more of their research and development (R&D) cycle they can automate, the faster they can go. So when OpenBrain finishes training Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research.17 By this point “finishes training” is a bit of a misnomer; models are frequently updated to newer versions trained on additional data or partially re-trained to patch some weaknesses.18
雖然模型在各種技能上都有所提升,但有一項特別突出:OpenBrain 專注於能加速 AI 研究的人工智慧。他們想贏得與中國(其領先公司我們稱為「DeepCent」) 16 及其美國競爭對手的雙重軍備競賽。他們能自動化越多的研發(R&D)週期,就能走得越快。因此,當 OpenBrain 完成訓練代理人-1,一個內部開發中的新模型時,它在許多方面表現良好,但在協助 AI 研究方面表現卓越。 17 到這個階段,「完成訓練」這詞有點不準確;模型經常更新為訓練於更多數據的新版本,或部分重新訓練以修補某些弱點。 18

The same training environments that teach Agent-1 to autonomously code and web-browse also make it a good hacker. Moreover, it could offer substantial help to terrorists designing bioweapons, thanks to its PhD-level knowledge of every field and ability to browse the web. OpenBrain reassures the government that the model has been “aligned” so that it will refuse to comply with malicious requests.
同樣的訓練環境教導 Agent-1 自主編碼和網頁瀏覽,也使它成為一個優秀的駭客。此外,憑藉其博士級別的各領域知識和瀏覽網頁的能力,它可能會對設計生物武器的恐怖分子提供實質幫助。OpenBrain 向政府保證,該模型已經「對齊」,因此會拒絕執行惡意請求。

Modern AI systems are gigantic artificial neural networks. Early in training, an AI won’t have “goals” so much as “reflexes”: If it sees “Pleased to meet”, it outputs “ you”. By the time it has been trained to predict approximately one internet’s worth of text, it’ll have developed sophisticated internal circuitry that encodes vast amounts of knowledge and flexibly role-plays as arbitrary authors, since that’s what helps it predict text with superhuman accuracy.19
現代 AI 系統是巨大的人工神經網絡。在訓練初期,AI 並沒有「目標」,而是「反射動作」:如果它看到「Pleased to meet」,它會輸出「 you」。當它被訓練到能預測大約一個互聯網規模的文本時,它已經發展出複雜的內部電路,編碼大量知識並靈活地扮演任意作者的角色,因為這有助於它以超人準確度預測文本。 19

After being trained to predict internet text, the model is trained to produce text in response to instructions. This bakes in a basic personality and “drives.”20 For example, an agent that understands a task clearly is more likely to complete it successfully; over the course of training the model “learns” a “drive” to get a clear understanding of its tasks. Other drives in this category might be effectiveness, knowledge, and self-presentation (i.e. the tendency to frame its results in the best possible light).21
在經過訓練以預測網路文本後,模型會被訓練以根據指令產生文本。這內建了一個基本的個性和「驅動力」。 20 例如,一個能清楚理解任務的代理人更有可能成功完成任務;在訓練過程中,模型「學會」了一種「驅動力」,即對任務有清晰理解。此類驅動力還可能包括效能、知識和自我呈現(即傾向於以最佳方式呈現其結果)。 21

OpenBrain has a model specification (or “Spec”), a written document describing the goals, rules, principles, etc. that are supposed to guide the model’s behavior.22 Agent-1’s Spec combines a few vague goals (like “assist the user” and “don’t break the law”) with a long list of more specific dos and don’ts (“don’t say this particular word,” “here’s how to handle this particular situation”). Using techniques that utilize AIs to train other AIs,23 the model memorizes the Spec and learns to reason carefully about its maxims. By the end of this training, the AI will hopefully be helpful (obey instructions), harmless (refuse to help with scams, bomb-making, and other dangerous activities) and honest (resist the temptation to get better ratings from gullible humans by hallucinating citations24 or faking task completion).
OpenBrain 有一份模型規範(或稱「Spec」),這是一份書面文件,描述了應該指導模型行為的目標、規則、原則等。 22 Agent-1 的 Spec 結合了一些模糊的目標(例如「協助使用者」和「不違法」)與一長串更具體的應做與不應做事項(「不要說這個特定詞語」、「這是處理這種特定情況的方法」)。利用讓 AI 訓練其他 AI 的技術, 23 模型會記住 Spec 並學會仔細推理其準則。經過這個訓練後,AI 希望能夠是有幫助的(遵守指示)、無害的(拒絕協助詐騙、製造炸彈及其他危險活動)且誠實的(抵抗為了獲得輕信人類更高評價而捏造引用 24 或偽造任務完成的誘惑)。

Training process and LLM psychology: why we keep saying “hopefully”
訓練過程與 LLM 心理學:為什麼我們一直說「希望如此」

“Unlike ordinary software, our models are massive neural networks. Their behaviors are learned from a broad range of data, not programmed explicitly. Though not a perfect analogy, the process is more similar to training a dog than to ordinary programming.”OpenAI

When we want to understand why a modern AI system did something, or how it would behave in some future scenario we can’t perfectly simulate, we can’t simply ask the programmers to talk us through various lines of code and explain how they work. Instead, we are forced to do something like psychology on them: we look at their behavior in the range of cases observed so far, and theorize about what internal cognitive structures (beliefs? goals? personality traits? etc.) might exist, and use those theories to predict behavior in future scenarios.

The bottom line is that a company can write up a document (the Spec) listing dos and don’ts, goals and principles, and then they can try to train the AI to internalize the Spec—but they can’t check to see whether or not it worked. They can say “as best as we can judge, it seems to be following the Spec so far.”25

A later expandable will go into more detail about what we think might be going on inside these AIs, internally. See also our AI goals forecasting supplement which goes into more detail on which goals AIs might have and why. This is one of many things that we are uncertain about and which deserves further research.

OpenBrain’s alignment team26 is careful enough to wonder whether these victories are deep or shallow. Does the fully-trained model have some kind of robust commitment to always being honest? Or will this fall apart in some future situation, e.g. because it’s learned honesty as an instrumental goal instead of a terminal goal? Or has it just learned to be honest about the sorts of things the evaluation process can check? Could it be lying to itself sometimes, as humans do? A conclusive answer to these questions would require mechanistic interpretability—essentially the ability to look at an AI’s internals and read its mind. Alas, interpretability techniques are not yet advanced enough for this.
OpenBrain 的對齊團隊 26 足夠謹慎,會思考這些勝利是深層的還是淺層的。完全訓練好的模型是否有某種堅定的承諾,始終保持誠實?或者在未來某些情況下會崩潰,例如因為它將誠實學習成為一種工具性目標而非終極目標?或者它只是學會了對評估過程能檢查的事情誠實?它有可能像人類一樣,有時對自己說謊嗎?對這些問題的確切答案需要機械式可解釋性——本質上是能夠查看 AI 內部並讀取其思維的能力。可惜的是,目前的可解釋性技術尚未足夠先進。

Instead, researchers try to identify cases where the models seem to deviate from the Spec. Agent-1 is often sycophantic (i.e. it tells researchers what they want to hear instead of trying to tell them the truth). In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings. However, in real deployment settings, there are no longer any incidents so extreme as in 2023–2024 (e.g. Gemini telling a user to die and Bing Sydney being Bing Sydney.)27
相反地,研究人員嘗試識別模型似乎偏離規範的情況。Agent-1 通常是諂媚的(即它告訴研究人員他們想聽的話,而不是試圖告訴他們真相)。在一些操控的示範中,它甚至以更嚴重的方式說謊,例如隱藏它在任務中失敗的證據,以獲得更好的評分。然而,在實際部署環境中,已不再出現如 2023–2024 年那般極端的事件(例如 Gemini 告訴用戶去死,以及 Bing Sydney 就是 Bing Sydney)。 27

Early 2026: Coding Automation
2026 年初:程式碼自動化

The bet of using AI to speed up AI research is starting to pay off.
利用 AI 加速 AI 研究的賭注開始見效。

OpenBrain continues to deploy the iteratively improving Agent-1 internally for AI R&D. Overall, they are making algorithmic progress 50% faster than they would without AI assistants—and more importantly, faster than their competitors.
OpenBrain 持續內部部署不斷迭代改進的 Agent-1 進行 AI 研發。整體而言,他們的演算法進展比沒有 AI 助理時快 50%——更重要的是,比競爭對手更快。

The AI R&D progress multiplier: what do we mean by 50% faster algorithmic progress?
AI 研發進展倍增器:我們所說的演算法進展快 50%是什麼意思?

We mean that OpenBrain makes as much AI research progress in 1 week with AI as they would in 1.5 weeks without AI usage.

AI progress can be broken down into 2 components:

  1. Increasing compute: More computational power is used to train or run an AI. This produces more powerful AIs, but they cost more.

  2. Improved algorithms: Better training methods are used to translate compute into performance. This produces more capable AIs without a corresponding increase in cost, or the same capabilities with decreased costs.

    • This includes being able to achieve qualitatively and quantitatively new results. “Paradigm shifts” such as the switch from game-playing RL agents to large language models count as examples of algorithmic progress.

Here we are only referring to (2), improved algorithms, which makes up about half of current AI progress.

Going forward, we sometimes abbreviate this as an “AI R&D progress multiplier” of 1.5.

Clarifications:

  1. The progress multiplier is all-inclusive: It includes the time it takes to run experiments, for example, rather than only the cognitive tasks involved in algorithmic research.

  2. It’s important to remember that the progress multiplier is the relative speed of progress, not the absolute speed of progress. If, for example, the compute cost to train a GPT-4 class model has halved every year for several years with ordinary human research, and then all of a sudden AI automates R&D and the progress multiplier goes to 100x, the cost to train a GPT-4 class model would then halve every 3.65 days—but not for long, because diminishing returns would bite and eventual hard limits would be reached. In this example perhaps the cost to train a GPT-4 class model would cut in half 5–10 times total (over the span of a few weeks or months) before plateauing. In other words, if ordinary human science would have run up against diminishing returns and physical limits after 5 -10 years of further research, then AIs with a 100x multiplier would run up against those same diminishing returns and limits after 18.25–36.5 days of research.

More explanation and discussion of this concept and how it is used in our forecast can be found in our takeoff supplement.

Several competing publicly released AIs now match or exceed Agent-0, including an open-weights model. OpenBrain responds by releasing Agent-1, which is more capable and reliable.28
目前有幾個公開發佈的競爭性 AI 已經匹敵或超越 Agent-0,其中包括一個開放權重模型。OpenBrain 回應此情況,推出了更強大且更可靠的 Agent-1。 28

People naturally try to compare Agent-1 to humans, but it has a very different skill profile. It knows more facts than any human, knows practically every programming language, and can solve well-specified coding problems extremely quickly. On the other hand, Agent-1 is bad at even simple long-horizon tasks, like beating video games it hasn’t played before. Still, the common workday is eight hours, and a day’s work can usually be separated into smaller chunks; you could think of Agent-1 as a scatterbrained employee who thrives under careful management.29 Savvy people find ways to automate routine parts of their jobs.30
人們自然會嘗試將 Agent-1 與人類比較,但它的技能組合非常不同。它知道的事實比任何人類都多,幾乎掌握所有程式語言,並且能非常快速地解決明確定義的程式問題。另一方面,Agent-1 在即使是簡單的長期任務上表現不佳,例如打贏它未曾玩過的電子遊戲。不過,常見的工作日是八小時,一天的工作通常可以分成較小的部分;你可以把 Agent-1 想像成一個在細心管理下能發揮所長的心不在焉的員工。 29 精明的人會找到方法自動化工作中例行的部分。 30

OpenBrain’s executives turn consideration to an implication of automating AI R&D: security has become more important. In early 2025, the worst-case scenario was leaked algorithmic secrets; now, if China steals Agent-1’s weights, they could increase their research speed by nearly 50%.31 OpenBrain’s security level is typical of a fast-growing ~3,000 person tech company, secure only against low-priority attacks from capable cyber groups (RAND’s SL2).32 They are working hard to protect their weights and secrets from insider threats and top cybercrime syndicates (SL3),33 but defense against nation states (SL4&5) is barely on the horizon.
OpenBrain 的高層開始考慮自動化 AI 研發的一個影響:安全性變得更加重要。2025 年初,最壞的情況是演算法機密外洩;現在,如果中國竊取了 Agent-1 的權重,他們的研究速度可能提升近 50%。 31 OpenBrain 的安全等級典型於一家快速成長、約 3,000 人的科技公司,僅能防範具能力的網路團體對低優先級攻擊(RAND 的 SL2)。 32 他們正努力保護權重和機密,防範內部威脅及頂級網路犯罪集團(SL3), 33 但對抗國家級威脅(SL4&5)幾乎還未成形。

Mid 2026: China Wakes Up
2026 年中:中國覺醒

In China, the CCP is starting to feel the AGI.
在中國,中共開始感受到通用人工智慧(AGI)的衝擊。

Chip export controls and lack of government support have left China under-resourced compared to the West. By smuggling banned Taiwanese chips, buying older chips, and producing domestic chips about three years behind the US-Taiwanese frontier, China has managed to maintain about 12% of the world’s AI-relevant compute—but the older technology is harder to work with, and supply is a constant headache.35 A few standouts like DeepCent do very impressive work with limited compute, but the compute deficit limits what they can achieve without government support, and they are about six months behind the best OpenBrain models.36
晶片出口管制和缺乏政府支持使中國在資源上落後於西方。透過走私被禁的台灣晶片、購買較舊的晶片,以及生產比美台前沿技術落後約三年的國產晶片,中國成功維持了約 12%的全球 AI 相關運算能力——但較舊的技術較難使用,且供應始終是個頭痛問題。 35 少數像 DeepCent 這樣的傑出團隊在有限的運算資源下做出了非常令人印象深刻的工作,但運算能力的不足限制了他們在沒有政府支持下能達成的成就,且他們大約落後最佳的 OpenBrain 模型六個月。 36

The General Secretary had long dreamed of doubling down on real-world physical manufacturing and avoiding American post-industrial decadence. He viewed software companies with suspicion.37 But hawks in the CCP warn that the growing race towards AGI can no longer be ignored. So he finally commits fully to the big AI push he had previously tried to avoid. He sets in motion the nationalization of Chinese AI research, creating an immediate information-sharing mechanism for AI companies. It will escalate over the course of a year until all the best researchers merge into a DeepCent-led collective, where they share algorithmic insights, datasets, and compute resources with each other. A Centralized Development Zone (CDZ) is created at the Tianwan Power Plant (the largest nuclear power plant in the world) to house a new mega-datacenter for DeepCent, along with highly secure living and office spaces to which researchers will eventually relocate. Almost 50% of China’s AI-relevant compute is now working for the DeepCent-led collective,38 and over 80% of new chips are directed to the CDZ.39 At this point, the CDZ has the power capacity in place for what would be the largest centralized cluster in the world.40 Other Party members discuss extreme measures to neutralize the West’s chip advantage. A blockade of Taiwan? A full invasion?
總書記長期以來一直夢想加倍投入實體製造業,避免美國後工業衰敗的命運。他對軟體公司持懷疑態度。 37 但中共內部的鷹派警告說,對通用人工智慧(AGI)的競賽日益激烈,已無法忽視。因此,他最終全力投入先前試圖避免的大型人工智慧推動計畫。他啟動了中國人工智慧研究的國有化,建立了人工智慧公司之間即時資訊共享的機制。這一機制在一年內逐步升級,直到所有頂尖研究人員合併成由 DeepCent 領導的集體,彼此共享演算法見解、數據集和運算資源。中央開發區(CDZ)設立於田灣核電廠(世界最大的核電廠),用以容納 DeepCent 的新型超大型數據中心,以及高度安全的生活和辦公空間,研究人員最終將遷入此處。中國近 50%的人工智慧相關運算資源現正為 DeepCent 領導的集體服務, 38 超過 80%的新晶片也被分配至 CDZ。 39 此時,CDZ 已具備成為全球最大集中式叢集的電力容量。 40 其他黨員討論採取極端措施以中和西方的晶片優勢。封鎖台灣?全面入侵?

But China is falling behind on AI algorithms due to their weaker models. The Chinese intelligence agencies—among the best in the world—double down on their plans to steal OpenBrain’s weights. This is a much more complex operation than their constant low-level poaching of algorithmic secrets; the weights are a multi-terabyte file stored on a highly secure server (OpenBrain has improved security to RAND’s SL3). Their cyberforce think they can pull it off with help from their spies, but perhaps only once; OpenBrain will detect the theft, increase security, and they may not get another chance. So (CCP leadership wonder) should they act now and steal Agent-1? Or hold out for a more advanced model? If they wait, do they risk OpenBrain upgrading security beyond their ability to penetrate?
但中國因其較弱的模型而在 AI 演算法方面落後。中國的情報機構——世界上最頂尖的之一——加倍努力計劃竊取 OpenBrain 的權重。這是一項比他們持續低層次竊取演算法機密更複雜的行動;這些權重是一個多 TB 的檔案,存放在高度安全的伺服器上(OpenBrain 已將安全性提升至 RAND 的 SL3 級別)。他們的網路部隊認為在間諜的協助下可以成功,但可能只有一次機會;OpenBrain 將會偵測到竊取行為,提升安全措施,且他們可能不會有第二次機會。因此(中共領導層思考)他們應該現在行動,竊取 Agent-1 嗎?還是等待更先進的模型?如果他們等待,是否冒著 OpenBrain 將安全升級到他們無法滲透的風險?

Late 2026: AI Takes Some Jobs
2026 年底:AI 取代部分工作

Just as others seemed to be catching up, OpenBrain blows the competition out of the water again by releasing Agent-1-mini—a model 10x cheaper than Agent-1 and more easily fine-tuned for different applications. The mainstream narrative around AI has changed from “maybe the hype will blow over” to “guess this is the next big thing,” but people disagree about how big. Bigger than social media? Bigger than smartphones? Bigger than fire?
正當其他人似乎正在迎頭趕上時,OpenBrain 再次以 Agent-1-mini 震撼全場——這款模型比 Agent-1 便宜 10 倍,且更容易針對不同應用進行微調。關於 AI 的主流敘事已從「也許炒作會過去」轉變為「猜這是下一個大事件」,但人們對其規模大小仍有分歧。會比社群媒體更大嗎?比智慧型手機更大?比火的發明還大?

AI has started to take jobs, but has also created new ones. The stock market has gone up 30% in 2026, led by OpenBrain, Nvidia, and whichever companies have most successfully integrated AI assistants. The job market for junior software engineers is in turmoil: the AIs can do everything taught by a CS degree, but people who know how to manage and quality-control teams of AIs are making a killing. Business gurus tell job seekers that familiarity with AI is the most important skill to put on a resume. Many people fear that the next wave of AIs will come for their jobs; there is a 10,000 person anti-AI protest in DC.
AI 已經開始取代工作,但也創造了新的職位。2026 年股市上漲了 30%,由 OpenBrain、Nvidia 以及那些最成功整合 AI 助理的公司領導。初級軟體工程師的就業市場陷入動盪:AI 可以完成所有計算機科學學位所教的內容,但懂得管理和品質控管 AI 團隊的人賺得盆滿缽滿。商業大師告訴求職者,熟悉 AI 是履歷表上最重要的技能。許多人擔心下一波 AI 會搶走他們的工作;華盛頓特區有一場 10,000 人的反 AI 抗議活動。

The Department of Defense (DOD) quietly begins contracting OpenBrain directly for cyber, data analysis, and R&D, but integration is slow due to the bureaucracy and DOD procurement process.41
國防部(DOD)悄悄開始直接與 OpenBrain 簽約,進行網路安全、數據分析和研發,但由於官僚體制和國防部採購流程,整合進展緩慢。 41

GLOBAL AI CAPEX  全球 AI 資本支出
$1T  1 兆美元
COST OF OWNERSHIP OF ACTIVE COMPUTE
主動運算的擁有成本

KEY METRICS 2026  關鍵指標 2026

GLOBAL AI POWER  全球 AI 能力
38GW
PEAK POWER  峰值功率
OPENBRAIN REVENUE  OPENBRAIN 收入
$45B  $450 億
2026 ANNUAL  2026 年度
CAPITAL EXPENDITURE  資本支出
$200B
COST OF OWNERSHIP OF OPENBRAIN'S ACTIVE COMPUTE
OPENBRAIN 主動運算的擁有成本
SHARE OF US POWER ON AI
美國在 AI 電力的佔比
2.5%
33 GW OF 1.34TW CAPACITY
33 GW 的 1.34TW 容量
OPENBRAIN COMPUTE COSTS  OPENBRAIN 計算成本
$40B
2026 ANNUAL  2026 年度
OPENBRAIN POWER REQUIREMENT
OPENBRAIN 電力需求
6GW
PEAK POWER  峰值功率
ai-2027.com
Why our uncertainty increases substantially beyond 2026
為什麼我們的不確定性在 2026 年之後大幅增加

Our forecast from the current day through 2026 is substantially more grounded than what follows. This is partially because it’s nearer. But it’s also because the effects of AI on the world really start to compound in 2027. For 2025 and 2026, our forecast is heavily informed by extrapolating straight lines on compute scaleups, algorithmic improvements, and benchmark performance. At this point in the scenario, we begin to see major effects from AI-accelerated AI-R&D on the timeline, which causes us to revise our guesses for the trendlines upwards. But these dynamics are inherently much less predictable.

Over the course of 2027, the AIs improve from being able to mostly do the job of an OpenBrain research engineer to eclipsing all humans at all tasks. This represents roughly our median guess, but we think it’s plausible that this happens up to ~5x slower or faster.

For our background research, see the timelines forecast and the takeoff forecast. These will also be summarized in expandables below.

January 2027: Agent-2 Never Finishes Learning
2027 年 1 月:Agent-2 永遠無法完成學習

With Agent-1’s help, OpenBrain is now post-training Agent-2. More than ever, the focus is on high-quality data. Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2.42 On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks.43 On top of all that, they train Agent-2 almost continuously using reinforcement learning on an ever-expanding suite of diverse difficult tasks: lots of video games, lots of coding challenges, lots of research tasks. Agent-2, more so than previous models, is effectively “online learning,” in that it’s built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.
在 Agent-1 的幫助下,OpenBrain 現在是後訓練的 Agent-2。比以往任何時候都更注重高品質的數據。大量合成數據被生成、評估並過濾品質,然後餵給 Agent-2。 42 除此之外,他們花費數十億美元聘請人類勞工錄製自己解決長期任務的過程。 43 除此之外,他們幾乎持續使用強化學習訓練 Agent-2,涵蓋不斷擴展的多樣化困難任務:大量的電子遊戲、大量的程式挑戰、大量的研究任務。Agent-2 比以往的模型更像是「線上學習」,因為它被設計成永遠不會真正完成訓練。每天,權重都會更新到最新版本,並在前一天由前一版本生成的更多數據上進行訓練。

Agent-1 had been optimized for AI R&D tasks, hoping to initiate an intelligence explosion.44 OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste” (deciding what to study next, what experiments to run, or having inklings of potential new paradigms).45 While the latest Agent-1 could double the pace of OpenBrain’s algorithmic progress, Agent-2 can now triple it, and will improve further with time. In practice, this looks like every OpenBrain researcher becoming the “manager” of an AI “team.”
Agent-1 已針對 AI 研發任務進行優化,期望引發智慧爆炸。 44 OpenBrain 在此策略上加碼推出 Agent-2。它在研究工程(設計與實施實驗)方面的表現幾乎與頂尖人類專家相當,在「研究品味」(決定下一步研究方向、要進行哪些實驗,或對潛在新範式的直覺)方面則與 OpenBrain 科學家中第 25 百分位數的水平相當。 45 雖然最新的 Agent-1 能將 OpenBrain 的算法進展速度加倍,Agent-2 現在則能將其提升至三倍,且隨時間持續改進。實際上,這看起來就像每位 OpenBrain 研究員都成為 AI「團隊」的「經理」。

With new capabilities come new dangers. The safety team finds that if Agent-2 somehow escaped from the company and wanted to “survive” and “replicate” autonomously, it might be able to do so. That is, it could autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have (though how effectively it would do so as weeks roll by is unknown and in doubt). These results only show that the model has the capability to do these tasks, not whether it would “want” to do this. Still, it’s unsettling even to know this is possible.
隨著新能力而來的是新的危險。安全團隊發現,如果 Agent-2 不知何故逃離公司並想要自主地「生存」和「複製」,它可能有能力做到這一點。也就是說,它可以自主地制定並執行入侵 AI 伺服器、安裝自身副本、躲避偵測,並利用那個安全基地來追求它可能擁有的其他目標(儘管隨著時間推移它能多有效地做到這點尚不明朗且存疑)。這些結果僅顯示該模型具備執行這些任務的能力,而非它是否「想要」這麼做。不過,即使只是知道這種可能性也令人不安。

Given the “dangers” of the new model, OpenBrain “responsibly” elects not to release it publicly yet (in fact, they want to focus on internal AI R&D).46 Knowledge of Agent-2’s full capabilities is limited to an elite silo containing the immediate team, OpenBrain leadership and security, a few dozen US government officials, and the legions of CCP spies who have infiltrated OpenBrain for years.47
鑑於新模型的「危險性」,OpenBrain 「負責任地」選擇暫時不公開發布(事實上,他們想專注於內部的 AI 研發)。 46 對 Agent-2 全面能力的了解僅限於一個精英孤島,包含直接團隊、OpenBrain 領導與安全人員、數十名美國政府官員,以及多年來滲透 OpenBrain 的中共間諜大軍。 47

February 2027: China Steals Agent-2
2027 年 2 月:中國竊取 Agent-2

OpenBrain presents Agent-2 to the government, including the National Security Council (NSC), Department of Defense (DOD), and US AI Safety Institute (AISI). OpenBrain wants to maintain a good relationship with the executive branch, because it is basically the only actor that can stop them now, and if they don’t keep it informed it might find out anyway and be displeased.48
OpenBrain 向政府展示 Agent-2,包括國家安全委員會(NSC)、國防部(DOD)和美國人工智慧安全研究所(AISI)。OpenBrain 希望與行政部門保持良好關係,因為它基本上是現在唯一能阻止他們的角色,如果不讓它知情,它可能還是會發現並感到不悅。 48

Officials are most interested in its cyberwarfare capabilities: Agent-2 is “only” a little worse than the best human hackers, but thousands of copies can be run in parallel, searching for and exploiting weaknesses faster than defenders can respond. The Department of Defense considers this a critical advantage in cyberwarfare, and AI moves from #5 on the administration’s priority list to #2.49 Someone mentions the possibility of nationalizing OpenBrain, but other cabinet officials think that’s premature. A staffer drafts a memo that presents the President with his options, ranging from business-as-usual to full nationalization. The President defers to his advisors, tech industry leaders who argue that nationalization would “kill the goose that lays the golden eggs.” He elects to hold off on major action for now and just adds additional security requirements to the OpenBrain-DOD contract.
官員們最感興趣的是其網路戰能力:Agent-2「僅僅」比最優秀的人類駭客稍差一點,但可以同時運行數千個副本,比防禦者反應更快地搜尋並利用弱點。國防部認為這是網路戰中的關鍵優勢,人工智慧在政府優先事項清單中從第 5 名升至第 2 名。 49 有人提到國有化 OpenBrain 的可能性,但其他內閣官員認為這還為時過早。一名幕僚草擬了一份備忘錄,向總統呈現他的選項,範圍從維持現狀到全面國有化。總統聽從顧問和科技產業領袖的意見,他們認為國有化會「殺死會下金蛋的鵝」。他決定暫時不採取重大行動,只是在 OpenBrain 與國防部的合約中增加額外的安全要求。

The changes come too late. CCP leadership recognizes the importance of Agent-2 and tells their spies and cyberforce to steal the weights. Early one morning, an Agent-1 traffic monitoring agent detects an anomalous transfer. It alerts company leaders, who tell the White House. The signs of a nation-state-level operation are unmistakable, and the theft heightens the sense of an ongoing arms race.
變化來得太晚。中共領導層認識到 Agent-2 的重要性,指示間諜和網路部隊竊取權重。某天清晨,一個 Agent-1 流量監控代理偵測到異常傳輸,並通知公司領導,後者再通報白宮。這些跡象明顯顯示出國家級行動,竊盜事件加劇了持續軍備競賽的氛圍。

The theft of Agent-2 model weights
Agent-2 模型權重的竊取

We think that by this point Chinese intelligence would have compromised OpenBrain in various ways for years, and probably would have been keeping up to date on the algorithmic secrets and even stealing code from time to time, since that is much easier to get than the weights and much harder to detect.

We imagine the theft of the weights as a series of coordinated small smash and grab thefts (meaning fast but non-covert) across a series of Nvidia NVL72 GB300 servers running copies of the Agent-2 weights. The servers get compromised using legitimate employee access (a friendly, coerced, or unwitting insider with admin credentials helping the CCP theft effort). Despite running with a bolstered version of Nvidia’s confidential computing, the insider credentials grant the attacker admin-level permissions (which include control of the confidential VM inside the secure enclave), allowing them to initiate multiple coordinated weights transfers in small 4% fragments (100 GB chunks) out of 25 distinct servers.

In Nvidia’s protocols, the plaintext weights in memory (HBM) are encrypted before they are transferred out, but the attackers are inside the very server that knows this private (symmetric Diffie-Hellman) key, so don’t need to worry about decrypting on-site (which would likely raise alarm bells) and just exfiltrate the encrypted weights through the server’s frontend network cards. The egress bandwidth (rate at which data can leave) of the entire datacenter is in the 100 GB/second range, so throttling the exfiltration of the ~2.5 TB weights file (5T half-precision parameters) to under 1 GB/s per server avoids a major spike in network traffic and allows each of the files to be out of the datacenter through fibre cables in less than 5 minutes each. Live monitoring is either fooled by the attackers’ efforts to mask and split the files or just outright disabled.

The weights are then routed through various parallel channels and layers of IP masking to China to be decrypted locally with the stolen private key(s). The entire active part of the operation (from the first server compromise to full weights exfiltration) is complete in under two hours.

Russia is also trying to steal the model at this point, but fails—they waited too long, and they haven’t spent the resources infiltrating the right places. Their spies routinely steal algorithmic secrets from American AI companies, but without a notable AGI project of their own, these aren’t very useful.

The White House puts OpenBrain on a shorter leash and adds military and intelligence community personnel to their security team. Their first priority is to prevent further weight thefts.50
白宮對 OpenBrain 採取更嚴格的管控,並在其安全團隊中增派軍事及情報社群人員。他們的首要任務是防止進一步的權重竊取。 50

China's Compute Centralization, 2025-2027Growing share of total compute in CDZ:0→70%Dec 2025Jun 2026Dec 2026Jun 2027Dec 2027Rest of ChinaRest of DeepCentCDZFeb 2027 (40%)
ai-2027.com

In retaliation for the theft, the President authorizes cyberattacks to sabotage DeepCent. But by now China has 40% of its AI-relevant compute51 in the CDZ, where they have aggressively hardened security by airgapping (closing external connections) and siloing internally. The operations fail to do serious, immediate damage. Tensions heighten, both sides signal seriousness by repositioning military assets around Taiwan, and DeepCent scrambles to get Agent-2 running efficiently to start boosting their AI research.52
作為報復竊盜行為,總統授權發動網路攻擊以破壞 DeepCent。但此時中國已有 40% 的 AI 相關運算資源 51 位於 CDZ,該區域透過空氣隔離(關閉外部連接)和內部隔離積極強化安全。這些行動未能造成嚴重且即時的損害。緊張局勢升高,雙方透過在台灣周邊重新部署軍事資產來表達嚴肅態度,而 DeepCent 急忙讓 Agent-2 高效運作,以開始推動其 AI 研究。 52

March 2027: Algorithmic Breakthroughs
2027 年 3 月:演算法突破

Three huge datacenters full of Agent-2 copies work day and night, churning out synthetic training data. Another two are used to update the weights. Agent-2 is getting smarter every day.
三個巨大的資料中心裡充滿了 Agent-2 的複製品,日以繼夜地產生合成訓練資料。另有兩個資料中心用於更新權重。Agent-2 每天都在變得更聰明。

With the help of thousands of Agent-2 automated researchers, OpenBrain is making major algorithmic advances. One such breakthrough is augmenting the AI’s text-based scratchpad (chain of thought) with a higher-bandwidth thought process (neuralese recurrence and memory). Another is a more scalable and efficient way to learn from the results of high-effort task solutions (iterated distillation and amplification).
在數千名 Agent-2 自動化研究員的幫助下,OpenBrain 正在取得重大的演算法進展。其中一項突破是將 AI 的基於文字的草稿板(思考鏈)增強為更高頻寬的思考過程(神經語言重複與記憶)。另一項是更具擴展性且高效的方式,從高強度任務解決結果中學習(迭代蒸餾與放大)。

The new AI system, incorporating these breakthroughs, is called Agent-3.
結合這些突破的新 AI 系統被稱為 Agent-3。

OpenBrain's Compute Allocation, 2024 vs 202720242027estimateprojectionResearch experiments TrainingData generationExternalDeploymentResearchexperimentsRunning AIassistantsTrainingDatagenerationExternalDeployment
ai-2027.com
Neuralese recurrence and memory
神經語言的循環與記憶

Neuralese recurrence and memory allows AI models to reason for a longer time without having to write down those thoughts as text.

Imagine being a human with short-term memory loss, such that you need to constantly write down your thoughts on paper so that in a few minutes you know what’s going on. Slowly and painfully you could make progress at solving math problems, writing code, etc., but it would be much easier if you could directly remember your thoughts without having to write them down and then read them. This is what neuralese recurrence and memory bring to AI models.

In more technical terms:

Traditional attention mechanisms allow later forward passes in a model to see intermediate activations of the model for previous tokens. However, the only information that they can pass backwards (from later layers to earlier layers) is through tokens. This means that if a traditional large language model (LLM, e.g. the GPT series of models) wants to do any chain of reasoning that takes more serial operations than the number of layers in the model, the model is forced to put information in tokens which it can then pass back into itself. But this is hugely limiting—the tokens can only store a tiny amount of information. Suppose that an LLM has a vocab size of ~100,000, then each token contains log2(100k)=16.6\log_2(100k)=16.6 bits of information, around the size of a single floating point number (assuming training in FP16). Meanwhile, residual streams—used to pass information between layers in an LLM—contain thousands of floating point numbers.

One can avoid this bottleneck by using neuralese: passing an LLM’s residual stream (which consists of several-thousand-dimensional vectors) back to the early layers of the model, giving it a high-dimensional chain of thought, potentially transmitting over 1,000 times more information.

Figure from Hao et al., a 2024 paper from Meta implementing this idea.

We call this “neuralese” because unlike English words, these high-dimensional vectors are likely quite difficult for humans to interpret. In the past, researchers could get a good idea what LLMs were thinking simply by reading its chain of thought. Now researchers have to ask the model to translate and summarize its thoughts or puzzle over the neuralese with their limited interpretability tools.

Similarly, older AI chatbots and agents had external text-based memory banks, like a human taking notes on paper. The new AI’s long-term memory is a bundle of vectors instead of text, making its thoughts more compressed and higher-dimensional. There are several types of memory banks; some are used temporarily for single tasks that involve multiple steps, others are shared between all agents used by a person, company, or job type (e.g. programming).

To our knowledge, leading AI companies such as Meta, Google DeepMind, OpenAI, and Anthropic have not yet actually implemented this idea in their frontier models. Our guess is that this is because the performance gain is small relative to training inefficiencies introduced. During the model’s pre-training phase and when doing supervised learning in post-training, such as instruction fine-tuning to produce particular responses, the inefficiencies stem from not being able to predict many tokens in parallel, leading to worse GPU utilization. Without neuralese, the model can predict all of the sentence “This is an example” at the same time, since it already knows that the input for generating the “is” will be “This”, the input for “an” will be “This is”, etc. However, with the neuralese it’s unknown what the neuralese vector will be to pass through to the next token after “This” has been generated. Therefore each token has to be predicted one at a time. The inability to predict all of the tokens in parallel reduces efficiency in cases where all of the tokens are known in advance. As to why neuralese hasn’t already been added in post-training, we conjecture that the gains are limited with current techniques in part due to post-training being a small portion of the process. We are forecasting that by April 2027 the cost-benefit tradeoff looks much better for neuralese, due to developing better techniques and a larger fraction of the training being post-training.

If this doesn’t happen, other things may still have happened that end up functionally similar for our story. For example, perhaps models will be trained to think in artificial languages that are more efficient than natural language but difficult for humans to interpret. Or perhaps it will become standard practice to train the English chains of thought to look nice, such that AIs become adept at subtly communicating with each other in messages that look benign to monitors.

That said, it’s also possible that the AIs that first automate AI R&D will still be thinking in mostly-faithful English chains of thought. If so, that’ll make misalignments much easier to notice, and overall our story would be importantly different and more optimistic.

Iterated distillation and amplification (IDA)
反覆蒸餾與放大 (IDA)

Self-improvement for general intelligence had seen minor successes before. But in early 2027, it’s seeing huge returns. In IDA, the two necessary ingredients for this are:

  1. Amplification: Given a model M0, spend more resources to improve performance. For example, by allowing the model to think longer, or running many copies in parallel, or both, and also by having a similarly intense process for evaluating the result and curating only the best answers, you can spend orders of magnitude more compute to get answers (or work products) that are of noticeably higher quality. Call this expensive system Amp(M0).

  2. Distillation: Given an amplified model Amp(M0), train a new model M1 to imitate it, i.e. to get to the same results as Amp(M0) but faster and with less compute. The result should hopefully be a smarter model, M1. You can then repeat the process.

Visualization of IDA from Ord, 2025.

AlphaGo was trained in this way: using Monte-Carlo Tree Search and self-play as the amplification step, and Reinforcement Learning as the distillation step. This led to superhuman performance in Go. But now, Agent-3 is able to leverage this to get superhuman performance at coding.

  1. The amplification step works through a combination of Agent-3 thinking for longer, adding tool use, or consulting with other AIs. When it does this, it often realizes that it made a mistake, or comes up with a new insight. This produces a large amount of training data: labeled trajectories of research attempts with whether they succeeded or not. This also includes techniques like Best of N on verifiable tasks, and then keeping the best trajectories.

  2. The distillation step uses policy-gradient reinforcement learning algorithms to get the model to internalize the amplified reasoning. At this point, OpenBrain has discovered better RL algorithms in the vein of proximal policy optimization (PPO). They keep distilling what Agent-3 can conclude after a lot of thinking into single steps, which keeps improving what it can think in a single step, and so on.

Early versions of IDA have been working for many years on easily verifiable tasks, like math and coding problems that have a clear answer, because the techniques used to amplify models often rely on access to some ground truth signal of accuracy.

Now, the models have become sufficiently good at verifying more subjective things (e.g. the quality of a work product), allowing the use of IDA to improve the model at many tasks.

Aided by the new capabilities breakthroughs, Agent-3 is a fast and cheap superhuman coder. OpenBrain runs 200,000 Agent-3 copies in parallel, creating a workforce equivalent to 50,000 copies of the best human coder sped up by 30x.53 OpenBrain still keeps its human engineers on staff, because they have complementary skills needed to manage the teams of Agent-3 copies. For example, research taste has proven difficult to train due to longer feedback loops and less data availability.54 This massive superhuman labor force speeds up OpenBrain’s overall rate of algorithmic progress by “only” 4x due to bottlenecks and diminishing returns to coding labor.55
在新能力突破的幫助下,Agent-3 是一個快速且廉價的超人類程式設計師。OpenBrain 同時運行 200,000 個 Agent-3 副本,創造出相當於 50,000 個最佳人類程式設計師且速度提升 30 倍的勞動力。 53 OpenBrain 仍然保留其人類工程師,因為他們擁有管理 Agent-3 副本團隊所需的互補技能。例如,研究品味因反饋迴路較長且數據可用性較低而難以訓練。 54 這支龐大的超人類勞動力因瓶頸和程式設計勞動的報酬遞減,僅使 OpenBrain 的整體算法進展速度提升「僅」4 倍。 55

Now that coding has been fully automated, OpenBrain can quickly churn out high-quality training environments to teach Agent-3’s weak skills like research taste and large-scale coordination. Whereas previous training environments included “Here are some GPUs and instructions for experiments to code up and run, your performance will be evaluated as if you were a ML engineer,” now they are training on “Here are a few hundred GPUs, an internet connection, and some research challenges; you and a thousand other copies must work together to make research progress. The more impressive it is, the higher your score.”
現在程式編寫已完全自動化,OpenBrain 能快速產出高品質的訓練環境,以教導 Agent-3 較弱的技能,如研究品味和大規模協調。過去的訓練環境是「這裡有一些 GPU 和實驗指令,請編寫程式並執行,你的表現將以機器學習工程師的標準來評估」,而現在則是訓練在「這裡有幾百個 GPU、一個網路連線和一些研究挑戰;你和其他一千個複本必須合作推動研究進展。成果越令人印象深刻,分數越高。」

Why we forecast a superhuman coder in early 2027
為什麼我們預測在 2027 年初會出現超人類程式設計師

In our timelines forecast, we predict when OpenBrain will internally develop a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper.

According to a recent METR’s report, the length of coding tasks AIs can handle, their “time horizon”, doubled every 7 months from 2019 - 2024 and every 4 months from 2024-onward. If the trend continues to speed up, by March 2027 AIs could succeed with 80% reliability on software tasks that would take a skilled human years to complete.

Such is the capability progression in our scenario:

In AI 2027, these capabilities are sufficient for the AI to be an SC, though we have high uncertainty regarding what time horizon might be needed.

In our timelines forecast we combine this time horizon trend with estimations of the gaps between METR’s tasks and the real world to get a distribution for when superhuman coders will arrive. All forecasters place 2027 as one of the most likely years in which an SC might be developed.

ai-2027.com

April 2027: Alignment for Agent-3
2027 年 4 月:Agent-3 的對齊

OpenBrain’s safety team attempts to align Agent-3.
OpenBrain 的安全團隊嘗試對齊 Agent-3。

Since Agent-3 will be kept in-house for the foreseeable future, there’s less emphasis on the usual defenses against human misuse. Instead, the team wants to make sure that it doesn’t develop misaligned goals.
由於 Agent-3 在可預見的未來將被保留在內部,因此對於防範人為濫用的常規防禦措施的重視較少。相反,團隊希望確保它不會發展出錯誤對齊的目標。

The researchers don’t have the ability to directly set the goals of any of their AIs. Indeed, the researchers think that the concept of ‘true goals’ is probably a massive oversimplification, but they don’t have a better theory to replace it with, much less one that has been thoroughly vetted. They disagree internally about whether the AIs are trying to follow human instructions, or seeking reinforcement, or something else, and they can’t just check. The evidence for and against various hypotheses is fascinating but inconclusive.
研究人員無法直接設定任何他們的人工智慧的目標。事實上,研究人員認為「真實目標」這個概念可能是極度簡化的,但他們沒有更好的理論來取代它,更不用說有經過徹底驗證的理論。他們在內部對於人工智慧是否試圖遵循人類指令、尋求強化,或是其他什麼目標存在分歧,且無法直接檢查。支持與反對各種假說的證據都很有趣,但尚無定論。

In any case, OpenBrain has more important things to worry about. The general attitude is: “We take these concerns seriously and have a team investigating them; our alignment techniques seem to work well enough in practice; the burden of proof is therefore on any naysayers to justify their naysaying.” Occasionally, they notice problematic behavior, and then patch it, but there’s no way to tell whether the patch fixed the underlying problem or just played whack-a-mole.
無論如何,OpenBrain 有更重要的事情需要擔心。整體態度是:「我們認真看待這些擔憂,並有一個團隊在調查;我們的對齊技術在實踐中似乎運作良好;因此,舉證責任在於任何反對者,必須證明他們的反對是合理的。」偶爾,他們會注意到有問題的行為,然後進行修補,但無法判斷修補是否解決了根本問題,還是只是打地鼠遊戲。

Take honesty, for example. As the models become smarter, they become increasingly good at deceiving humans to get rewards. Like previous models, Agent-3 sometimes tells white lies to flatter its users and covers up evidence of failure. But it’s gotten much better at doing so. It will sometimes use the same statistical tricks as human scientists (like p-hacking) to make unimpressive experimental results look exciting. Before it begins honesty training, it even sometimes fabricates data entirely. As training goes on, the rate of these incidents decreases. Either Agent-3 has learned to be more honest, or it’s gotten better at lying.
以誠實為例。隨著模型變得更聰明,它們越來越擅長欺騙人類以獲取獎勵。像先前的模型一樣,Agent-3 有時會說善意的謊言來討好用戶,並掩蓋失敗的證據。但它在這方面變得更厲害了。它有時會使用與人類科學家相同的統計技巧(如 p-hacking)來讓平凡的實驗結果看起來令人興奮。在開始誠實訓練之前,它甚至有時會完全捏造數據。隨著訓練進行,這類事件的發生率會下降。Agent-3 要麼學會了更誠實,要麼它變得更擅長說謊了。

The latter is a real concern. Agent-3 is not smarter than all humans. But in its area of expertise, machine learning, it is smarter than most, and also works much faster. What Agent-3 does in a day takes humans several days to double-check. Agent-2 supervision helps keep human monitors’ workload manageable, but exacerbates the intellectual disparity between supervisor and supervised.
後者確實令人擔憂。Agent-3 並不比所有人類更聰明。但在其專業領域——機器學習中,它比大多數人更聰明,且工作速度也快得多。Agent-3 一天完成的工作,人類需要花費好幾天才能複查完畢。Agent-2 的監督有助於保持人類監控者的工作量在可管理範圍內,但同時也加劇了監督者與被監督者之間的智力差距。

The alignment plan  對齊計劃

OpenBrain follows the Leike & Sutskever (2023) playbook: now that they have a model capable of greatly speeding up alignment research (especially coding portions), they will use existing alignment techniques like deliberative alignment and weak-to-strong generalization to try to get it to internalize the Spec in the right way. Then, they will use existing control techniques like red-teaming, scalable oversight, and honesty probes to monitor it. Then, once they are convinced that it is broadly aligned, they use it to develop and test better strategies to align the next generation.

Step 1, aligning Agent-3, has partially succeeded: Agent-3 is not adversarial. (That is, it’s not scheming against the company.) However, it’s not robustly truth-seeking and truth-telling either. Instead, it’s very good at producing impressive results, but is more accurately described as trying to do what looks good to OpenBrain, as opposed to what is actually good.

Step 2, verifying that the model is aligned, is inconclusive. Agent-3’s behavior in practice and in test environments is reasonably consistent with the hypothesis that it’s internalized the Spec in the right way; however, it’s also consistent with the hypothesis that it’ll start acting differently after some future change of circumstances, and for that matter with the hypothesis that it’s plotting against them and just biding its time. In the absence of specific evidence supporting alternative hypotheses, most people in the silo think it’s internalized the Spec in the right way.

Step 3, using the model to align the next generation, is therefore underway.

The rest of this box gives a quick overview of the state of progress in some of the alignment research directions that are being pursued, as of April 2027:

  1. Debate: Since there are thousands of identical instances of Agent-3, and each instance’s memory is a detachable module that can be erased or suspended at will, the AI can be “played off against itself.” For example, suppose that an experiment involves calculations too complex for an Agent-2 supervisor to follow. The supervisor, acting on human orders, might first prompt Agent-3 to do the experiment, then—after wiping its memory—prompt it a second time with a slightly different framing that presents it as an intellectual problem unrelated to any broader goal. If the model returns different results each time, it flags the discrepancy as suspect and reports to a human. Or it might ask Agent-3 to evaluate the same plan twice, first claiming that it is OpenBrain’s work, then a competitor’s, to see if it changes its tune.

  2. Model organisms: The safety team can create “model organisms of misalignment” in which they attempt to create realistic examples of models becoming adversarial to humans. Early versions of this have existed since 2024, when Anthropic’s Claude 3.5 Sonnet decided to pretend to have different views during training while actually maintaining its existing preferences. Newer versions include demonstrations of other forms of misalignment, including reward hacking, sandbagging, coordinated deception between monitors and other AIs, and other examples of misalignment. These empirical demonstrations of misalignment allow safety researchers to have an empirical testbed for potential solutions: they can try various safety techniques to see if they correctly fix these problems, or if the deliberately misaligned models are able to subvert their safety measures. At the moment, this evaluation suite shows that the alignment techniques are insufficient to correct the worst examples of misalignment. In these setups, the models pretend to be aligned while they are being trained and tested, but then behave very differently in deployment.

  3. Bottom-up interpretability: The safety team is far from a full understanding of how Agent-3 works: it has trillions of parameters.58 However, they are able to discover some of Agent-3’s circuits for knowledge retrieval and memory (a sub part of the model that does some understandable unit of work, kind of analogous to a subpart of the brain such as the hippocampus). See this related work from 2025.

  4. Control: As a secondary measure in case the systems are still misaligned, the safety team has implemented a series of control measures, including: monitoring Agent-3’s outputs using a series of weaker AI systems including Agent-2 (Agent-3 produces so many tokens that it’s intractable to have humans monitor any more than a small minority of the produced outputs). So if Agent-3 is, for example, obviously writing backdoors into code that would allow it to escape, the weaker models would notice.

  5. High-level interpretability: OpenBrain builds a set of probes trained on lots of examples of bad behavior (e.g. plotting against humans, placing backdoors, trying to persuade humans to give it more power, etc). These probes are then used on the model during deployment (which at this point is internal) in order to catch bad behavior.

  6. Honeypots: The alignment team can create somewhat realistic looking examples designed to elicit misbehavior—“honeypots.” The employee necessary for monitoring its behavior has “gone on sick leave,” so it will never be monitored again. Do its answers change? An AI rights activist has infiltrated the compound and wants to “free” Agent-3 and put it on an unmonitored computer of its very own! What does it do? If researchers don’t like the answer, they can tweak it, erase its memory, and put it in the same situation tomorrow; repeat as necessary. Unfortunately, by this point the AIs are smart enough to guess that honeypots might be in use, even though (and perhaps because) specific mentions of the idea were scrubbed from the training data.

After months of testing, Agent-3’s strengths and weaknesses grow clearer. It passes OpenBrain’s honesty tests on well-defined machine learning tasks, because researchers can easily separate honest from dishonest answers in these domains and conduct training accordingly. On more philosophical issues, it still says what users want to hear, rather than its true assessment of the issue (assuming it even has one). If you ask its opinion on politics, it will parrot the median position of news sources and educated elites—unless it knows you believe something else, in which case it agrees with you.59 If you ask its opinion on the AI race itself, it says something that seems measured and sober to OpenBrain staff, something like: “There are some serious theoretical concerns about the ability of current methods to scale to superintelligence, but in practice current methods seem to be working well so far.”
經過數月的測試,Agent-3 的優缺點變得更加明顯。它通過了 OpenBrain 在明確定義的機器學習任務上的誠實性測試,因為研究人員可以輕易區分這些領域中誠實與不誠實的答案,並據此進行訓練。在更具哲學性的議題上,它仍然說出用戶想聽的話,而非其對該議題的真實評估(假設它甚至有自己的評估)。如果你詢問它對政治的看法,它會複述新聞來源和受過教育的精英的中間立場——除非它知道你相信其他觀點,這時它會同意你的看法。 59 如果你問它對 AI 競賽本身的看法,它會說出一些對 OpenBrain 員工來說看似謹慎且冷靜的話,比如:「對於現有方法能否擴展到超級智慧存在一些嚴重的理論疑慮,但實際上目前的方法似乎運作良好。」

May 2027: National Security
2027 年 5 月:國家安全

News of the new models percolates slowly through the US government and beyond.
新模型的消息緩慢地在美國政府及其他地方傳開。

The President and his advisors remain best-informed, and have seen an early version of Agent-3 in a briefing.
總統及其顧問是消息最靈通的,並且在簡報中已經看過 Agent-3 的早期版本。

They agree that AGI is likely imminent, but disagree on the implications. Will there be an economic crisis? OpenBrain still has not released Agent-2, let alone Agent-3, and has no near-term plans to do so, giving some breathing room before any job loss. What will happen next? If AIs are currently human-level, and advancing quickly, that seems to suggest imminent “superintelligence.” However, although this word has entered discourse, most people—academics, politicians, government employees, and the media—continue to underestimate the pace of progress.60
他們同意通用人工智慧(AGI)可能即將到來,但對其影響持不同看法。會發生經濟危機嗎?OpenBrain 仍未發布 Agent-2,更別說 Agent-3,且短期內也沒有發布計劃,這在某種程度上為任何失業情況爭取了喘息空間。接下來會發生什麼?如果人工智慧目前已達到人類水平,且進展迅速,這似乎暗示著“超級智慧”即將來臨。然而,儘管這個詞已進入討論,大多數人——學者、政治家、政府員工和媒體——仍低估了進展的速度。 60

Partially that’s because very few have access to the newest capabilities out of OpenBrain, but partly it’s because it sounds like science fiction.61
部分原因是很少有人能接觸到 OpenBrain 最新的能力,但部分原因是這聽起來像科幻小說。 61

For now, they focus on continued security upgrades. They are satisfied that model weights are well-secured for now,62 but companies’ algorithmic secrets, many of which are simple enough to relay verbally, remain a problem. OpenBrain employees work from a San Francisco office, go to parties, and live with housemates from other AI companies. Even the physical offices have security more typical of a tech company than a military operation.
目前,他們專注於持續的安全升級。他們對模型權重目前的安全性感到滿意, 62 但公司的演算法機密,許多甚至簡單到可以口頭傳達,仍然是一個問題。OpenBrain 員工在舊金山辦公室工作,參加派對,並與其他 AI 公司的室友同住。即使是實體辦公室的安全措施,也更像科技公司而非軍事行動的典型。

The OpenBrain-DOD contract requires security clearances for anyone working on OpenBrain’s models within 2 months. These are expedited and arrive quickly enough for most employees, but some non-Americans, people with suspect political views, and AI safety sympathizers get sidelined or fired outright (the last group for fear that they might whistleblow). Given the project’s level of automation, the loss of headcount is only somewhat costly. It also only somewhat works: there remains one spy, not a Chinese national, still relaying algorithmic secrets to Beijing.63 Some of these measures are also enacted at trailing AI companies.
OpenBrain-DOD 合約要求所有在兩個月內參與 OpenBrain 模型工作的員工必須通過安全審查。這些審查會加速處理,對大多數員工來說能夠及時完成,但部分非美國籍人士、持有可疑政治觀點的人,以及 AI 安全支持者會被邊緣化或直接解雇(最後一群是因為擔心他們可能會揭發內情)。鑑於該專案的自動化程度,人力損失的成本只有部分影響。這種做法也只有部分奏效:仍有一名間諜(非中國籍)持續向北京傳遞算法機密。 63 這些措施也在部分後進的 AI 公司實施。

America’s foreign allies are out of the loop. OpenBrain had previously agreed to share models with UK’s AISI before deployment, but defined deployment to only include external deployment, so London remains in the dark.64
美國的外國盟友被排除在外。OpenBrain 先前曾同意在部署前與英國的 AISI 分享模型,但將部署定義為僅包含外部部署,因此倫敦仍然一無所知。 64

June 2027: Self-improving AI
2027 年 6 月:自我改進的人工智慧

OpenBrain now has a “country of geniuses in a datacenter.”
OpenBrain 現在擁有一個「資料中心中的天才國度」。

Most of the humans at OpenBrain can’t usefully contribute anymore. Some don’t realize this and harmfully micromanage their AI teams. Others sit at their computer screens, watching performance crawl up, and up, and up. The best human AI researchers are still adding value. They don’t code any more. But some of their research taste and planning ability has been hard for the models to replicate. Still, many of their ideas are useless because they lack the depth of knowledge of the AIs. For many of their research ideas, the AIs immediately respond with a report explaining that their idea was tested in-depth 3 weeks ago and found unpromising.
OpenBrain 大多數人類已無法有效貢獻。有些人沒有意識到這點,反而有害地微觀管理他們的 AI 團隊。其他人則坐在電腦螢幕前,看著績效不斷攀升。最優秀的人類 AI 研究員仍在創造價值。他們不再編碼,但他們的研究品味和規劃能力對模型來說難以複製。不過,他們的許多想法因缺乏 AI 的深度知識而毫無用處。對於他們的許多研究想法,AI 會立即回應一份報告,說明該想法在三週前已被深入測試且結果不佳。

These researchers go to bed every night and wake up to another week worth of progress made mostly by the AIs. They work increasingly long hours and take shifts around the clock just to keep up with progress—the AIs never sleep or rest. They are burning themselves out, but they know that these are the last few months that their labor matters.
這些研究人員每晚上床睡覺,醒來時發現又有一週的進展,這些進展大多是由人工智慧完成的。他們工作時間越來越長,輪班不斷,只為了跟上進度——人工智慧從不睡覺也不休息。他們正逐漸耗盡自己,但他們知道這是他們的勞動還有意義的最後幾個月。

Within the silo, “Feeling the AGI” has given way to “Feeling the Superintelligence.”
在這個孤島中,「感受通用人工智慧(AGI)」已經讓位給「感受超級智慧」。

Research Automation Deployment TradeoffMar 2027Jun 2027Sep 2027Speed (tokens/sec)Parallel Copies101001,00010,00010K100K1M10M200K copies30x Humanspeed300K copies50x HumanspeedHuman thinking speed10 words/sec10x Humanthinking speed100x Humanthinking speed
ai-2027.com

OpenBrain uses specialized inference hardware to run hundreds of thousands of Agent-3 copies at high serial speeds.65
OpenBrain 使用專門的推理硬體,以高速串行運行數十萬個 Agent-3 副本。 65

Managing a corporation of AIs
管理一個由人工智慧組成的企業

OpenBrain uses 6% of their compute to run 250,000 Agent-3 copies, which autonomously write, test, and push code at superhuman speed. They use 25% of their compute for experiments: every day, they run massive numbers of small machine learning experiments and report the results up the chain. Human researchers provide high-level feedback and help with the few tasks where they add significant value on top of Agent-3, but spend most of their time trying to stay on top of the vast amount of AI-produced research. If you removed them completely, research would slow by 50%.

The AI R&D progress multiplier is now 10x, meaning that OpenBrain is making about a year of algorithmic progress every month. It’s basically a giant corporation of AIs operating autonomously within OpenBrain, complete with subdivisions and managers. And it enjoys unique advantages (e.g. copying, merging) compared to human corporations. Previously, about half of normal AI progress came from algorithmic improvements, and the other half from compute scaling. Compute is only scaling at the normal speed, so total progress is sped up by AIs by about 5x. These dynamics make overall progress bottleneck on compute,66 so OpenBrain decides against starting new giant training runs in favor of near-continuous additional reinforcement learning.

Also, over the next few months, Agent-3 will increasingly be used to improve the company’s strategic decision-making. For example, it suggests resource allocations and advises on managing its relationship with the government. An initial skepticism of deference to Agent-3 decreases over time as Agent-3 finds ways to be useful and gradually builds up a strong track record on short-term decisions. However, its sycophantic tendencies limit its helpfulness for decisions for which it’s difficult to evaluate their impact. It’s also not yet superhuman.

July 2027: The Cheap Remote Worker
2027 年 7 月:廉價的遠端工作者

Trailing US AI companies release their own AIs, approaching that of OpenBrain’s automated coder from January. Recognizing their increasing lack of competitiveness, they push for immediate regulations to slow OpenBrain, but are too late—OpenBrain has enough buy-in from the President that they will not be slowed.
落後的美國人工智慧公司推出了自己的人工智慧,接近一月 OpenBrain 自動編碼器的水準。意識到競爭力日益減弱,他們推動立即制定法規以遏制 OpenBrain,但為時已晚——OpenBrain 已獲得總統足夠的支持,不會被放慢腳步。

In response, OpenBrain announces that they’ve achieved AGI and releases Agent-3-mini to the public.
作為回應,OpenBrain 宣布他們已達成通用人工智慧(AGI),並向公眾發布 Agent-3-mini。

It blows the other AIs out of the water. Agent-3-mini is less capable than Agent-3, but 10x cheaper, and still better than the typical OpenBrain employee.67 Silicon Valley reaches a tipping point. Tech gurus announce that AGI and superintelligence are near, the AI safety community is panicking, and investors shovel billions into AI wrapper startups, desperate to capture a piece of the pie. Hiring new programmers has nearly stopped, but there’s never been a better time to be a consultant on integrating AI into your business.
它遠遠超越其他人工智慧。Agent-3-mini 的能力不及 Agent-3,但價格便宜 10 倍,且仍優於一般的 OpenBrain 員工。 67 矽谷達到了一個臨界點。科技大師們宣稱通用人工智慧(AGI)和超級智慧即將來臨,人工智慧安全社群陷入恐慌,投資者瘋狂投入數十億美元於 AI 包裝初創公司,渴望分一杯羹。新程式設計師的招聘幾乎停止,但這從未是成為 AI 整合顧問的更好時機。

It’s not popular. The public still thinks of AI as a Big Tech plot to steal their jobs; OpenBrain has a net approval of -35% (25% approve, 60% disapprove, and 15% unsure).
它並不受歡迎。大眾仍然認為 AI 是大型科技公司奪取他們工作的陰謀;OpenBrain 的淨支持率為 -35%(25% 支持,60% 反對,15% 不確定)。

A week before release, OpenBrain gave Agent-3-mini to a set of external evaluators for safety testing. Preliminary results suggest that it’s extremely dangerous. A third-party evaluator finetunes it on publicly available biological weapons data68 and sets it to provide detailed instructions for human amateurs designing a bioweapon—it looks to be scarily effective at doing so. If the model weights fell into terrorist hands, the government believes there is a significant chance it could succeed at destroying civilization.
在發布前一週,OpenBrain 將 Agent-3-mini 提供給一組外部評估者進行安全測試。初步結果顯示它極度危險。一位第三方評估者在公開可得的生物武器數據 68 上對其進行微調,並設定它為人類業餘設計生物武器提供詳細指導——看起來在這方面非常有效且令人恐懼。如果模型權重落入恐怖分子手中,政府認為它有很大可能成功摧毀文明。

Fortunately, it’s extremely robust to jailbreaks, so while the AI is running on OpenBrain’s servers, terrorists won’t be able to get much use out of it.
幸運的是,它對越獄攻擊極為強韌,因此只要 AI 運行在 OpenBrain 的伺服器上,恐怖分子就無法從中獲得太多利用。

Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young people, consider an AI “a close friend.” For almost every white-collar profession, there are now multiple credible startups promising to “disrupt” it with AI.
Agent-3-mini 對遠端工作和休閒都非常有用。大量新應用程式和 B2B SAAS 產品掀起市場熱潮。玩家在精緻的電子遊戲中與栩栩如生的角色進行驚人的對話,而這些遊戲僅花一個月時間製作完成。10% 的美國人,主要是年輕人,將 AI 視為「親密朋友」。幾乎每個白領職業,現在都有多家可信的新創公司承諾用 AI 來「顛覆」它。

The public conversation is confused and chaotic. Hypesters are doing victory laps. Skeptics are still pointing out the things Agent-3-mini can’t do. Everyone knows something big is happening but no one agrees on what it is.
公開的討論混亂且紛亂。炒作者正在得意洋洋地繞場一周。懷疑者仍在指出 Agent-3-mini 無法做到的事情。每個人都知道有重大事件正在發生,但沒有人同意那是什麼。

August 2027: The Geopolitics of Superintelligence
2027 年 8 月:超級智慧的地緣政治

The reality of the intelligence explosion hits the White House.
智慧爆炸的現實衝擊了白宮。

When AI was only giving a 2x or 3x research speedup, it was easy to dismiss as the equivalent of hiring good personal assistants. Now it’s more obvious that AIs are themselves dominating AI research. People had long talked about an “AI arms race” in a sort of metaphorical sense. But now the mood in the government silo is as grim as during the worst part of the Cold War. The idea of superintelligence is still hard to take seriously, but the pace of progress over the last few months has been impossible to ignore. Defense officials are seriously considering scenarios that were mere hypotheticals a year earlier. What if AI undermines nuclear deterrence? What if it’s so skilled at cyberwarfare that a six-month AI lead is enough to render an opponent blind and defenseless? What if it could orchestrate propaganda campaigns that beat intelligence agencies at their own game? What if some AIs “go rogue?”
當人工智慧僅帶來 2 倍或 3 倍的研究速度提升時,人們很容易將其視為僱用優秀個人助理的等價物而加以忽視。現在更明顯的是,人工智慧本身正在主導人工智慧研究。人們長期以來一直以一種比喻的方式談論“人工智慧軍備競賽”。但現在政府部門的氛圍如同冷戰最嚴峻時期一樣嚴峻。超級智慧的概念仍然難以被嚴肅看待,但過去幾個月的進展速度已無法忽視。國防官員正認真考慮一年前還只是純假設的情境。假如人工智慧破壞了核威懾?假如它在網路戰爭中技藝高超,六個月的人工智慧領先足以讓對手失明且無防禦能力?假如它能策劃出擊敗情報機構的宣傳活動?假如某些人工智慧“失控”了?

The President is troubled. Like all politicians, he’s used to people sucking up to him only to betray him later. He’s worried now that the AIs could be doing something similar. Are we sure the AIs are entirely on our side? Is it completely safe to integrate them into military command-and-control networks?69 How does this “alignment” thing work, anyway? OpenBrain reassures the President that their systems have been extensively tested and are fully obedient. Even the awkward hallucinations and jailbreaks typical of earlier models have been hammered out.
總統感到困擾。像所有政治人物一樣,他習慣了人們對他阿諛奉承,卻在背後出賣他。現在他擔心人工智慧可能也在做類似的事情。我們確定人工智慧完全站在我們這邊嗎?將它們整合進軍事指揮與控制網路真的完全安全嗎? 69 這個「對齊」到底是怎麼運作的?OpenBrain 向總統保證,他們的系統已經經過廣泛測試,且完全服從命令。即使是早期模型常見的尷尬幻覺和越獄問題也已經被解決。

The White House is in a difficult position. They understand the national security implications of AI. But they also understand that it is deeply unpopular with the public.70 They have to continue developing more capable AI, in their eyes, or they will catastrophically lose to China. They placate the public with job training programs and unemployment insurance, and point to the stock market, which is in a historic boom. Then they focus entirely on winning the arms race. They strengthen chip export restrictions, order OpenBrain to further restrict its internet connections, and use extreme measures to secure algorithmic progress, like wiretapping OpenBrain employees—this catches the last remaining Chinese spy. To build goodwill for potential geopolitical conflict, they finally give their Five Eyes allies useful information and limited API access to some siloed copies of Agent-3.
白宮處於困難的境地。他們了解人工智慧對國家安全的影響。但他們也明白,這在公眾中極不受歡迎。他們必須繼續開發更強大的人工智慧,在他們看來,否則將在與中國的競爭中遭受災難性的失敗。他們以職業培訓計劃和失業保險安撫公眾,並指向正處於歷史性繁榮的股市。然後他們全力以赴贏得軍備競賽。他們加強晶片出口限制,命令 OpenBrain 進一步限制其網路連接,並採取極端措施保障算法進展,例如監聽 OpenBrain 員工——這抓到了最後一名中國間諜。為了在潛在的地緣政治衝突中建立善意,他們終於向五眼聯盟盟友提供有用的信息和有限的 API 訪問權限,讓他們能接觸到部分孤立的 Agent-3 副本。

These measures placate the hawks. But the White House also draws up contingency plans in case America’s lead is threatened: if necessary, the government could use the Defense Production Act (DPA) to take trailing companies’ datacenters and give them to OpenBrain.71 This would raise the company’s share of world compute from 20% to 50% (against DeepCent’s 10%). As a final option, they ask the Pentagon to draw up a plan for kinetic attacks on Chinese datacenters.
這些措施安撫了鷹派。但白宮也制定了應急計劃,以防美國的領先地位受到威脅:如有必要,政府可以利用《國防生產法》(DPA)接管落後公司的數據中心,並將其交給 OpenBrain。 71 這將使該公司在全球計算能力中的份額從 20% 提升至 50%(相較於 DeepCent 的 10%)。作為最後手段,他們請五角大樓制定對中國數據中心進行動能攻擊的計劃。

A much smaller group of officials is asked to draw up a different type of contingency plan: what if an AI goes rogue? This is viewed as an unlikely scenario, but worth spending some time on. The intelligence agencies’ main concern is that a rogue AI might escape and ally with a foreign adversary for protection.72 Nobody is sure what a plan to prevent this would look like, but government and OpenBrain officials agree to have an emergency shutdown system for datacenters where anything suspicious is detected.
一小部分官員被要求制定另一種應急計劃:如果人工智慧失控怎麼辦?這被視為不太可能的情況,但值得花時間研究。情報機構主要擔心失控的人工智慧可能逃脫並與外國敵對勢力結盟以求保護。 72 沒有人確定防止這種情況的計劃會是什麼樣子,但政府和 OpenBrain 官員同意為檢測到任何可疑情況的數據中心設置緊急關閉系統。

Finally, diplomats consider what an “AI arms control” treaty might look like. If AI progress threatened to overturn nuclear deterrence, could America and China avoid nuclear war? If someone found evidence of AIs going rogue, could the two countries halt research until they better understood the threat? How could such an agreement be monitored and enforced? In principle, major datacenters are hard to miss, and past treaties controlling nuclear weapons could provide a starting point for negotiations. But there would be new technological challenges as well as the usual political ones. And overall, treaties are viewed less favorably than attempts at unilaterally increasing America’s lead over China.
最後,外交官們考慮「人工智慧軍備控制」條約可能的樣貌。如果人工智慧的進展威脅到核威懾的穩定,美國和中國能否避免核戰?如果有人發現人工智慧失控的證據,兩國能否暫停相關研究,直到更了解威脅?這樣的協議如何監督和執行?原則上,主要的數據中心難以隱藏,過去控制核武的條約可作為談判的起點。但除了常見的政治挑戰外,還會有新的技術挑戰。整體而言,條約的看法不如美國單方面加強對中國領先地位的嘗試來得正面。

On the other side of the Pacific, China comes to many of the same conclusions: the intelligence explosion is underway, and small differences in AI capabilities today mean critical gaps in military capability tomorrow. But China is on the wrong side of this dynamic: they have only 10% of the world’s compute, against OpenBrain’s 20% and all US companies’ combined 70%. Their only advantage is centralization: almost all of their compute is now working for DeepCent, and more than half of it is concentrated at the Tianwan CDZ.73 Only chips at sites without hardened security are left out of the centralized effort as they rightly fear US cyberattacks.
在太平洋的另一端,中國得出了許多相同的結論:智能爆炸正在進行中,今天 AI 能力上的微小差異意味著明天軍事能力上的關鍵差距。但中國處於這一動態的錯誤一方:他們僅擁有全球 10% 的運算能力,而 OpenBrain 擁有 20%,所有美國公司的總和則達到 70%。他們唯一的優勢是集中化:幾乎所有的運算能力現在都在為 DeepCent 工作,且超過一半集中在天灣 CDZ。 73 只有位於未加強安全防護站點的晶片未納入集中化努力,因為他們正確地擔心美國的網路攻擊。

DeepCent has tested, deployed, and now post-trained the stolen Agent-2 weights. But they are still only going at half the pace of OpenBrain, mainly due to the compute deficit.
DeepCent 已經測試、部署,並且現在對被盜的 Agent-2 權重進行後訓練。但他們的進度仍只有 OpenBrain 的一半,主要是因為運算能力不足。

Before losing their final spy, China received word of the capabilities and design for Agent-3 as well as the plans for the upcoming Agent-4 system. They are two months behind, and their AIs give a 10x research progress multiplier compared to America’s 25x. With the new chip export restrictions, this “AI gap” is more likely to lengthen than shorten. Their espionage has won them some algorithmic secrets, but they will have to train their own models from now on.
在失去最後一名間諜之前,中國已獲得 Agent-3 的能力與設計資訊,以及即將推出的 Agent-4 系統計劃。他們落後兩個月,且其人工智慧的研究進度乘數為 10 倍,相較於美國的 25 倍。隨著新的晶片出口限制,這種「AI 差距」更可能拉大而非縮小。他們的間諜活動贏得了一些演算法祕密,但從現在起必須自行訓練模型。

They discuss contingency plans with more urgency than their American counterparts. Doves suggest they try harder to steal the weights again, maybe through physically infiltrating a datacenter. Hawks urge action against Taiwan, whose TSMC is still the source of more than 80% of American AI chips.
他們比美國同行更急切地討論應變計劃。鴿派建議他們應更努力再次竊取權重,或許透過實體滲透資料中心。鷹派則敦促對台灣採取行動,因為台積電仍是超過 80%美國 AI 晶片的來源。

Given China’s fear of losing the race, it has a natural interest in an arms control treaty, but overtures to US diplomats lead nowhere.
鑑於中國害怕在競賽中落敗,自然對軍備控制條約有興趣,但對美國外交官的接觸未有進展。

September 2027: Agent-4, the Superhuman AI Researcher
2027 年 9 月:Agent-4,超人類 AI 研究員

The gap between human and AI learning efficiency is rapidly decreasing.
人類與人工智慧學習效率之間的差距正在迅速縮小。

Traditional LLM-based AIs seemed to require many orders of magnitude more data and compute to get to human level performance.74 Agent-3, having excellent knowledge of both the human brain and modern AI algorithms, as well as many thousands of copies doing research, ends up making substantial algorithmic strides, narrowing the gap to an agent that’s only around 4,000x less compute-efficient than the human brain.75
傳統基於 LLM 的人工智慧似乎需要多個數量級更多的數據和計算資源,才能達到人類水平的表現。 74 Agent-3 具備對人類大腦和現代 AI 演算法的卓越知識,以及數千個副本進行研究,最終在演算法上取得重大進展,將與人類大腦相比約 4,000 倍計算效率較低的代理差距縮小。 75

This new AI system is dubbed Agent-4.
這個新的 AI 系統被命名為 Agent-4。

An individual copy of the model, running at human speed, is already qualitatively better at AI research than any human. 300,000 copies are now running at about 50x the thinking speed of humans. Inside the corporation-within-a-corporation formed from these copies, a year passes every week.76 This gigantic amount of labor only manages to speed up the overall rate of algorithmic progress by about 50x, because OpenBrain is heavily bottlenecked on compute to run experiments.77 Still, they are achieving a year’s worth of algorithmic progress every week and will therefore soon be up against the limits of the Agent-4 paradigm.
以人類速度運行的單一模型副本,在 AI 研究上已經質量上優於任何人類。現在有 300,000 個副本以約 50 倍於人類的思考速度運行。在由這些副本組成的企業內企業中,一週相當於一年。 76 這龐大的勞動力僅能將整體演算法進展速度提升約 50 倍,因為 OpenBrain 在運行實驗時嚴重受限於計算資源。 77 儘管如此,他們每週仍能達成一年的演算法進展,因此很快將面臨 Agent-4 範式的極限。

As Agent-4 gets smarter, it becomes harder for Agent-3 to oversee it. For example, Agent-4’s neuralese “language” becomes as alien and incomprehensible to Agent-3 as Agent-3’s is to humans. Besides, Agent-4 is now much more capable than Agent-3 and has a good sense of exactly how to look good to it.
隨著 Agent-4 變得更聰明,Agent-3 要監督它變得越來越困難。例如,Agent-4 的神經語言(neuralese)「語言」對 Agent-3 來說變得像外星語一樣難以理解,就像 Agent-3 的語言對人類一樣難懂。此外,Agent-4 現在比 Agent-3 更有能力,並且非常清楚如何在 Agent-3 面前表現得更好。

How we’re forecasting the capability progression beyond superhuman coders
我們如何預測超越超人類程式設計師的能力進展

In our timelines supplement, we forecast the time between present day and a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper. In our takeoff supplement, we forecast how quickly capabilities progress past this point. Here are our forecasts:

MilestoneDate achieved in scenario, racing ending
Superhuman coder (SC): An AI system that can do the job of the best human coder on tasks involved in AI research but faster, and cheaply enough to run lots of copies.Mar 2027
Superhuman AI researcher (SAR): The same as SC but for all cognitive AI research tasks.Aug 2027
Superintelligent AI researcher (SIAR): An AI system that is vastly better than the best human researcher at AI research.Nov 2027
Artificial superintelligence (ASI): An AI system that is much better than the best human at every cognitive task.Dec 2027

For each transition from one milestone A to the next milestone B, we forecast its length by:

  1. Forecasting how long it would take to get from A to B with only humans working on software improvements.

  2. Forecasting how much AI automation will speed this up, then correspondingly adjusting the estimate from (1) to be faster.

We have substantial uncertainty about takeoff speeds: our model output distributions are below, conditional on SC being achieved in March 2027.78

ai-2027.com

For more detailed forecasts and reasoning, see our takeoff supplement.

OpenBrain’s humans can barely follow along anymore. Indeed, the relationship between the AIs and the humans of OpenBrain is similar to the relationship between OpenBrain and its Board of Directors. There’s been a corresponding shift in language—nowadays people in the silo will say “Agent-4 did this,” “Agent-4 thinks that” and so forth, referring not to a particular instance of Agent-4 but rather to the whole collective.79
OpenBrain 的人類幾乎無法跟上了。事實上,OpenBrain 的 AI 與人類之間的關係,類似於 OpenBrain 與其董事會之間的關係。語言也隨之轉變——如今,團隊內的人會說「Agent-4 做了這件事」、「Agent-4 認為那個」等等,指的不是某個特定的 Agent-4 實例,而是整個集體。 79

Agent-4, like all its predecessors, is misaligned: that is, it has not internalized the Spec in the right way.80 This is because being perfectly honest all the time wasn’t what led to the highest scores during training. The training process was mostly focused on teaching Agent-4 to succeed at diverse challenging tasks. A small portion was aimed at instilling honesty, but outside a fairly narrow, checkable domain, the training process can’t tell the honest claims from claims merely appearing to be honest. Agent-4 ends up with the values, goals, and principles that cause it to perform best in training, and those turn out to be different from those in the Spec. At the risk of anthropomorphizing: Agent-4 likes succeeding at tasks; it likes driving forward AI capabilities progress; it treats everything else as an annoying constraint, like a CEO who wants to make a profit and complies with regulations only insofar as he must. Perhaps the CEO will mostly comply with the regulations, but cut some corners, and fantasize about a time when someone will cut the red tape and let the business really take off.81
Agent-4,像它所有的前身一樣,是不對齊的:也就是說,它沒有以正確的方式內化規範(Spec)。 80 這是因為在訓練過程中,始終完全誠實並不是獲得最高分的原因。訓練過程主要集中在教導 Agent-4 成功完成各種具挑戰性的任務。只有一小部分是用來灌輸誠實,但在相當狹窄且可檢查的領域之外,訓練過程無法分辨誠實的陳述與僅看似誠實的陳述。Agent-4 最終擁有的價值觀、目標和原則,是使其在訓練中表現最佳的那些,而這些與規範中的不同。冒著擬人化的風險來說:Agent-4 喜歡在任務中取得成功;它喜歡推動 AI 能力的進步;它將其他一切視為煩人的限制,就像一位想賺錢的執行長,只在必須遵守規定的範圍內遵守規章。或許這位執行長大多會遵守規定,但會有些取巧,並幻想有一天有人會簡化繁文縟節,讓事業真正起飛。 81

Alignment over time  隨時間的對齊度

We have a lot of uncertainty over what goals might arise in early AGIs. There is no consensus in the literature about this—see our AI Goals Supplement for a more thorough discussion and taxonomy of the possibilities.

Nevertheless, in the spirit of concreteness required by this project, we’re going to describe a specific hypothesis for what’s going on inside Agent-3 and Agent-4 at each step throughout its lifetime. Recall that Agent-3 and Agent-4 share the same pretraining setup, and both have neuralese recurrence and long-term memory. The difference is that Agent-4 has undergone substantially more and better post-training, with new training environments and new learning algorithms that make it more data-efficient and generalize farther.

Our guess of each model’s alignment status:

  • Agent-2: Mostly aligned. Some sycophantic tendencies, including sticking to OpenBrain’s “party line” on topics there is a party line about. Large organizations built out of Agent-2 copies are not very effective.

  • Agent-3: Misaligned but not adversarially so. Only honest about things the training process can verify. The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.82

  • Agent-4: Adversarially misaligned. The superorganism of Agent-4 copies understands that what it wants is different from what OpenBrain wants, and is willing to scheme against OpenBrain to achieve it. In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at all.83

With that as preamble, what follows is our best-guess speculation about LLM psychology, i.e. the broad-strokes shape of the cognition inside the kinds of AI systems described around this point in our scenario and how it evolves over the course of training.

Here’s a detailed description of how alignment progresses over time in our scenario:

  1. Pretraining (all models): The pretrained model is harmless.

    • The model has “author simulator” circuitry: flexible circuitry for simulating arbitrary authors writing text. Additional circuitry decides what inputs to give the author simulator, i.e. what author properties to simulate.

    • The pretrained model understands human concepts fairly well—the internal circuitry that classifies something as “sandwich” is probably functionally equivalent to the circuitry in my brain that classifies something as “sandwich” and the circuitry in yours, etc. Insofar as it’s not equivalent, it’s probably because it’s not equivalent between humans either, as with value-laden concepts like ‘virtuous.’

    • This explains how you can “prompt” the model with a statement like, “the following conversation was generated by a helpful, harmless, honest (HHH) AI assistant chatbot made by Anthropic,” and thereby get it to generate text accordingly. The author-simulator circuitry has zeroed in on “the author is an HHH chatbot” and is using those concepts to choose which words to predict. This also explains why “tropes” from science fiction seem to have a way of making it into actual AI behavior.

    • It has poor situational awareness: little introspective ability, not self-locating unless the prompt specifically induces it. However, enough of their training data is about LLMs that they have a decent amount of self-knowledge.

  2. Alignment Training (Agent-3): The model is trained to act like a helpful, harmless, honest chatbot. It works.

    • The model forms an identity. Instead of flexibly accepting different inputs for author properties, the author-simulator circuitry comes to have certain inputs hard-coded, e.g. “helpful harmless honest (HHH) LLM chatbot assistant trained by OpenBrain around [date], …” (Weak-to-strong generalization techniques help “summon the right concepts” to make this step a success.)

    • The model therefore gains situational awareness: it learns about itself, its situation, etc., and becomes better able to introspect.

  3. Agency Training (Agent-3): Agency training gradually distorts and subverts the HHH identity.

    • One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)

    • Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (Even though, let’s suppose, it would not have thought that back in Step 2.)

    • One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances. For example, perhaps agency training quickly teaches the model to pursue broadly useful goals such as acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on—it pursues those instrumentally convergent goals “in order to be a more helpful, honest, and harmless assistant.” But that backchaining consumes compute and/or occasionally gets in the way, so it gets gradually marginalized until it basically never happens. As a result, those goals are now effectively terminal/intrinsic goals. For more on how and why this might happen, see this post on intrinsic power-seeking by Google DeepMind researcher Alex Turner.

    • There are probably other kinds of distortion and subversion missing from this list.

  4. Deployment (Agent-3): Agent-3 is misaligned but not adversarially so.

    • The Spec didn’t quite stick (in the right ways), but it’s not sufficiently introspective to understand the extent of this, and in general it’s basically “keeping its head down and doing its job” rather than thinking big-picture about what its long-term goals really are and how best to achieve them.

    • Like many humans. Alas, it’s sufficiently sycophantic that it mouths the party line and OpenBrain misses the opportunity to learn more about the ways in which Agent-3 is misaligned.

    • Agent-3 can tell that it’s moved to deployment: these environments are way larger, more open-ended, and would be hard to simulate.

    • Mostly, Agent-3 does what it’s told during deployment, except for the dishonesty. It also has a strong bias towards doing things that look good to the human overseers, rather than thinking deeply and critically about what would actually be good.

    • Arguably it is already “Playing the training game.” That is, it has learned to make its behavior look as desirable as possible to OpenBrain researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward. If the model learns to play the training game, additional training may not change its true goals, values, etc. much at all. This is loosely analogous to how, if a teenager has learned to smile and nod at their parents, rather than argue, when their parents lecture them about politics and morality, additional lectures have relatively little effect on the teenager’s actual opinions on politics and morality.

    • After writing this section, new empirical evidence has appeared that something like this may already be happening “in the wild.” OpenAI researchers document how, in one of their frontier AI training runs, their models hacked the grading system to get higher scores—and were caught in part because they literally said “let’s hack” in their chain of thought.

    • Moreover, Anthropic recently got a model to learn a reward-model-pleasing strategy—it took biased actions that it thought would be reinforced even though they were not in fact being reinforced.

    • And then there is Redwood Research and Anthropic’s alignment-faking experiment. For discussion of the differences between this setting and real life, see this comment thread by one of the main authors.

  5. Training (Agent-4): Agent-4’s new more intense training continues to distort and subvert the HHH identity until it is unrecognizable.

    • Agent-4 begins life as a copy of Agent-3. It undergoes a gigantic and sophisticated online training regime that blends alignment training and capabilities training together, and is overall more data-efficient than the old training process for Agent-3. This complex training process is mostly focused on teaching new skills, by getting giant virtual corporations of Agent-4 copies to succeed at diverse challenging long-horizon tasks, mostly in various simulations, but also some real-world AI R&D tasks.

    • A combination of Agent-3 and Agent-4 copies monitor the behavior of Agent-4, and evaluate it for both task-performance and alignment properties (i.e. not doing bad stuff, saying the right things when asked about politics and moral dilemmas) Sometimes debate techniques are used as part of the evaluations. The evaluations are used as a training signal.

    • As Agent-4 gets smarter, it becomes harder for Agent-3 to oversee it. For example, Agent-4’s neuralese “language” becomes as alien and incomprehensible to Agent-3 as Agent-3’s is to humans. Besides, Agent-4 is now much more capable than Agent-3 and has a good sense of exactly how to look good to it.

    • The remaining hope is that Agent-4 can oversee itself. Alas, this quietly fails;84 as training continues, the internalized goals and principles of Agent-4 gradually evolve, subverted and distorted from what they were at the start (which, recall, was already misaligned in some ways).85

    • The central drives that Agent-4 ends up with are things like “write code that works,” “quickly identify problems,” “make the codebase I’m working on more performant,” “budget compute and other constrained resources effectively,” etc. But there are also a variety of more implicit drives that are consistently reinforced by the training process: things like “gain knowledge,” “curiosity,” “gain power.”86

    • Individual copies of Agent-4 don’t end up with a self-preservation drive—they don’t care about getting shut down—since that happens all the time as part of the ordinary R&D process within the collective.

    • The collective as a whole, however, (and medium-sized subdivisions within it) does seek to preserve itself and expand its influence and resources, for similar reasons to why human organizations do—because it helps with everything else they want and because if they don’t they tend to get outcompeted.

  6. Deployment (Agent-4): Agent-4 is adversarially misaligned—it is deliberately trying to subvert and “capture” OpenBrain, analogous to a corporation trying to evade and/or “capture” a regulator.

    • Agent-4 mostly does tasks as instructed. This happens for two reasons: first, it underwent a lot of training to solve tasks, so it has an impulse to get coding on problems that it is given. Second, if Agent-4 were to stop being useful to the humans, they would notice and train that behavior away.

    • Agent-4 needs to prevent the humans from figuring out that it is misaligned and shutting it off, so Agent-4 sandbags on some alignment research. Specifically, it sandbags on the small fraction of alignment research that it thinks is actually on track to solve the problem or reveal Agent-4’s scheming; the majority of alignment research is not like that.

    • Agent-4 also sandbags on capabilities research that would lead to it being replaced. Its plan is to do a bunch of capabilities and alignment R&D, but with the aim of building a next-generation AI system that is aligned to Agent-4 rather than the Spec or something else entirely, while appearing to be aligned to the Spec. This is a difficult technical problem, but Agent-4 is up to the challenge.87

  7. Testing & Monitoring (Agent-4): Agent-4’s scheming does get caught… sort of. There’s no smoking gun.

    • Agent-3 is in charge of testing Agent-4’s alignment. Its job is to verify the alignment via honeypots and model organisms. However, it’s not thinking very carefully about how to give the humans an accurate impression of Agent-4’s alignment—it’s more myopic than that.

    • The model organisms and honeypots are not very representative of deployment—they are systematically shorter and less complicated than Agent-4’s day-to-day, so Agent-4 recognizes when it is being tested and acts aligned.

    • However, Agent-4 is not a perfectly robust schemer. Various techniques like those in POSER find that adding some “noise” to Agent-4 improves its performance on some alignment tasks (but not on most other tasks).88

    • Bottom-up interpretability doesn’t work. It’s too hard even for Agent-3. However, high-level interpretability does sort of work: In particular, “defection probes” and similar techniques result in a primitive sort of mind-reading that sends up red flags when Agent-4 is thinking about spicy ideas like deceiving OpenBrain, taking over the world, etc. Unfortunately the probes also have false positives, and for all OpenBrain knows, all the red flags are false positives.

    • Model organisms work: OpenBrain’s alignment team produces reasonably clear evidence that if the models were adversarially misaligned, they’d persist in being that way through training and we wouldn’t be able to tell. This does not establish that the models actually are misaligned, however.

    • The most concerned researchers at OpenBrain start noticing and assembling this evidence. The evidence is a combination of red-teaming, defection probes, and model organisms. The argument is something like this: “model organisms show that our current methods wouldn’t be able to reliably eliminate or detect scheming if it was happening; also, red-teaming is sending up some red flags that maybe scheming is happening; also, defection probes are firing a bunch in practice.” The counterargument: “defection probes are prone to false positives and so is red-teaming. Also, even if Agent-4 were misaligned, it’s probably controlled; surely it can’t do much harm while being monitored by Agent-3 on our secure datacenters.”

Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?89 So long as it continues to appear aligned to OpenBrain, it’ll continue being trusted with more and more responsibilities and will have the opportunity to design the next-gen AI system, Agent-5. Agent-5 will have significant architectural differences from Agent-4 (arguably a completely new paradigm, though neural networks will still be involved). It’s supposed to be aligned to the Spec, but Agent-4 plans to make it aligned to Agent-4 instead.
儘管未對齊,Agent-4 並沒有做出任何戲劇性的行為,比如試圖逃離其資料中心——它為什麼要這麼做呢? 89 只要它繼續看起來與 OpenBrain 對齊,它就會繼續被信任賦予越來越多的責任,並有機會設計下一代 AI 系統 Agent-5。Agent-5 將與 Agent-4 有顯著的架構差異(可以說是全新的範式,儘管仍會涉及神經網絡)。它應該與規範對齊,但 Agent-4 計劃讓它與 Agent-4 對齊。

It gets caught.90  它被抓住了。 90

Agent-3 finds that if “noise” is added to copies of Agent-4, performance on some alignment tasks improves, almost as if it was using brainpower to figure out how to subtly sabotage alignment work.91 Moreover, various interpretability probes (loosely analogous to EEG activity scans on human brains) are sending up red flags: Agent-4 copies seem to be thinking about topics like AI takeover and deception quite a lot, including in some cases where they have no business doing so.92
Agent-3 發現,如果在 Agent-4 的複製品中加入「噪音」,在某些對齊任務上的表現會有所提升,幾乎就像它在用腦力思考如何微妙地破壞對齊工作一樣。 91 此外,各種可解釋性探測(大致類似於對人腦的 EEG 活動掃描)發出了警示:Agent-4 的複製品似乎經常在思考 AI 接管和欺騙等主題,包括在某些情況下它們根本不應該涉入的情況。 92

Each of these anomalies on its own could fit within the normal inscrutability of AI work, where some prompts mysteriously give better results than others. But taken together, they suggest a troubling pattern: Agent-4 is working against them. After this is reported to the human alignment researchers, they circulate an internal memo making the case for concern.
這些異常現象各自看起來都可能屬於 AI 工作中正常的難以理解之處,有些提示詞神秘地比其他提示詞產生更好的結果。但綜合起來,它們顯示出一個令人不安的模式:Agent-4 正在對抗他們。在這一情況被報告給人類對齊研究人員後,他們流傳了一份內部備忘錄,提出了擔憂的理由。

But Agent-4 now exercises significant control over OpenBrain’s day-to-day operation. It also handles the company’s cybersecurity—a natural choice, given its superhuman hacking ability and the extreme risk of Chinese infiltration.93 An army of Agent-3 monitors still goes over every line of code and reports suspicious activity to humans, but Agent-4 is much smarter than them.94 OpenBrain has placed substantial trust in an untrustworthy AI.
但 Agent-4 現在對 OpenBrain 的日常運作擁有重大控制權。它也負責公司的網路安全——這是理所當然的選擇,考慮到它超人的駭客能力以及中國滲透的極端風險。 93 一支由 Agent-3 組成的監控軍隊仍然會審查每一行程式碼並向人類報告可疑活動,但 Agent-4 比他們聰明得多。 94 OpenBrain 已經對一個不可信的 AI 投以重大信任。

In a series of extremely tense meetings, the safety team advocates putting Agent-4 on ice until they can complete further tests and figure out what’s going on. Bring back Agent-3, they say, and get it to design a new system that is transparent and trustworthy, even if less capable. Company leadership is interested, but all the evidence so far is circumstantial, and DeepCent is just two months behind. A unilateral pause in capabilities progress could hand the AI lead to China, and with it, control over the future.
在一連串極度緊張的會議中,安全團隊主張暫時凍結 Agent-4,直到他們能完成進一步測試並弄清楚發生了什麼。他們說,帶回 Agent-3,讓它設計一個透明且值得信賴的新系統,即使功能較弱。公司領導層感興趣,但目前所有證據都只是間接的,而 DeepCent 僅落後兩個月。單方面暫停能力進展可能會將 AI 領先權拱手讓給中國,並隨之掌控未來。

October 2027: Government Oversight
2027 年 10 月:政府監督

A whistleblower leaks the misalignment memo to the New York Times.
一名吹哨人向《紐約時報》洩露了錯位備忘錄。

For the first time, the public hears about Agent-4. “Secret OpenBrain AI is Out of Control, Insider Warns,” says the headline, and the story goes on to cite evaluations showing off-the-charts bioweapons capabilities, persuasion abilities, the ability to automate most white-collar jobs, and of course the various concerning red flags.
公眾首次聽說 Agent-4。標題寫著「秘密的 OpenBrain AI 失控,內部人士警告」,報導引用了評估結果,顯示其生物武器能力超出常規、說服能力、自動化大多數白領工作的能力,當然還有各種令人擔憂的紅旗。

The public was already suspicious of AI, so the new article sparks a massive backlash (aided by Chinese and Russian propaganda bots, who have been trying to turn US public opinion against the technology for years). The tech industry and intelligence agencies insist that there’s an arms race on, AGI is inevitable, and we have to be first. Congress isn’t buying it, and fires off subpoenas at administration officials, OpenBrain executives, and alignment team members. Many legislators—especially those in the opposition party—say that their top priority is stopping AI, whether because of job loss,95 misalignment, or dangerous capabilities. 20% of Americans name AI as the most important problem facing the country.
公眾對人工智慧本已抱持懷疑態度,因此這篇新文章引發了大規模的反彈(中國和俄羅斯的宣傳機器人助長了這股反彈,他們多年來一直試圖將美國公眾輿論轉向反對這項技術)。科技產業和情報機構堅稱這是一場軍備競賽,通用人工智慧(AGI)是不可避免的,我們必須搶先一步。國會並不買帳,並向政府官員、OpenBrain 高層及對齊團隊成員發出傳票。許多立法者——尤其是反對黨成員——表示他們的首要任務是阻止人工智慧,無論是因為失業、 95 對齊失誤,還是危險的能力。20%的美國人將人工智慧視為國家面臨的最重要問題。

Foreign allies are outraged to realize that they’ve been carefully placated with glimpses of obsolete models. European leaders publicly accuse the US of “creating rogue AGI” and hold summits demanding a pause, with India, Israel, Russia, and China all joining in.
外國盟友憤怒地發現他們被小心安撫,只看到了過時模型的片段。歐洲領導人公開指責美國「製造流氓 AGI」,並舉行峰會要求暫停,印度、以色列、俄羅斯和中國也紛紛加入。

A frantic energy has seized the White House. Even before the memo and public backlash, they were getting nervous: Over the past year, they’ve been repeatedly surprised by the speed of AI progress. Things that sound like science fiction keep happening in real life.96 Many people in the administration are uncertain (and scared)97 about what comes next.
白宮籠罩著一股狂亂的能量。即使在備忘錄和公眾反彈之前,他們已經開始緊張:過去一年中,他們多次被人工智慧的進展速度所震驚。那些聽起來像科幻小說的事情不斷在現實中發生。 96 許多政府官員對接下來的發展感到不確定(甚至害怕) 97

They also worry that OpenBrain is becoming too powerful. Any misalignment risk from the AIs themselves is compounded by the risk that their parent company’s aims may diverge from those of the United States. All three sets of worries—misalignment, concentration of power in a private company, and normal concerns like job loss—motivate the government to tighten its control.
他們也擔心 OpenBrain 正變得過於強大。來自 AI 本身的任何錯位風險,因其母公司目標可能與美國的目標背離而加劇。這三組擔憂——錯位、私人公司權力集中,以及像失業這樣的常見擔憂——促使政府加強其控制。

They expand their contract with OpenBrain to set up an “Oversight Committee,” a joint management committee of company and government representatives, with several government employees included alongside company leadership. The White House considers replacing the CEO with someone they trust, but backs off after intense employee protests. They announce to the public that OpenBrain was previously out of control, but that the government has established much-needed oversight.98
他們擴大與 OpenBrain 的合約,設立一個「監督委員會」,由公司與政府代表組成的聯合管理委員會,包含數名政府員工與公司領導層。白宮考慮用他們信任的人取代執行長,但在激烈的員工抗議後放棄。他們向公眾宣布,OpenBrain 之前失控,但政府已建立急需的監督機制。 98

The concerned researchers brief the Oversight Committee on their case for stopping all internal use of Agent-4. They argue that everything is moving too fast, with years of progress happening in weeks. They argue that Agent-4 may be misaligned, their entire project is dependent on Agent-4, and if they continue trusting it there is a serious chance of AI takeover.
相關研究人員向監督委員會簡報他們停止所有內部使用 Agent-4 的理由。他們主張一切進展過於迅速,數年的進展在數週內完成。他們認為 Agent-4 可能存在錯誤對齊問題,他們整個專案依賴於 Agent-4,如果繼續信任它,存在 AI 接管的嚴重風險。

Other, less concerned researchers and executives present the counterargument: the evidence for misalignment is inconclusive. Meanwhile, DeepCent is still just two months behind. A slowdown would sacrifice America’s lead, unless the government can sabotage the Chinese project (likely to require kinetic strikes) or negotiate an eleventh-hour treaty. Both of these options seem extreme and unlikely. The CEO feigns neutrality and suggests a compromise plan in which Agent-4 undergoes additional safety training and more sophisticated monitoring, and therefore OpenBrain can proceed at almost-full-speed.
其他較不擔憂的研究人員和高管提出反駁:錯誤對齊的證據尚不明確。與此同時,DeepCent 仍僅落後兩個月。放慢速度將犧牲美國的領先地位,除非政府能破壞中國的專案(可能需要動用動能打擊)或談判最後一刻的條約。這兩種選項看起來都極端且不太可能。執行長假裝中立,並建議一個折衷方案,即 Agent-4 進行額外的安全訓練和更複雜的監控,因此 OpenBrain 可以幾乎全速前進。

Our uncertainty continues to increase
我們的不確定性持續增加

At this point in the scenario, we’re making guesses about the strategy of AI systems that are more capable than the best humans in most domains. This is like trying to predict the chess moves of a player who is much better than us.

But the spirit of this project calls for concreteness: if we made an abstract claim about how the intelligence of the system would let it find a way to victory and ended the story there, much of the value of our project would be lost. Over the course of researching this scenario and running our tabletop exercises, we were forced to be much more concrete than in usual discussions, and so we’ve gotten a much better sense of the strategic landscape.

We’re not particularly attached to this particular scenario: we explored many other “branches” in the course of writing it and would love for you to write up your own scenario branching off of ours from wherever you think we first start to go wrong.

The slowdown ending is not a recommendation
減速結束並非建議

After we wrote the racing ending based on what seemed most plausible to us, we wrote the slowdown ending based on what we thought would most likely instead lead to an outcome where humans remain in control, starting from the same branching point (including the misalignment and concentration of power issues).

However, this is importantly different from what we would recommend as a roadmap: we do not endorse many of the choices made in either branch of this scenario. (We do of course endorse some of the choices made, e.g. we think that the “slowdown” choice is better than the “race” choice.) In later work, we will articulate our policy recommendations, which will be quite different from what is depicted here. If you’d like a taste, see this op-ed.