Move Over, Mathematicians, Here Comes AlphaProof

Author: Siobhan Roberts

Length: • 6 mins

July 25

Annotated by howie.serious

A.I. is getting good at math — and might soon make a worthy collaborator for humans.
人工智能在数学方面越来越出色——并且可能很快就会成为人类值得信赖的合作伙伴。

Ringing the gong at Google Deepmind’s London headquarters, a ritual to celebrate each A.I. milestone, including its recent triumph of reasoning at the International Mathematical Olympiad. Credit Credit: Google Deepmind
在 Google Deepmind 伦敦总部敲响锣鼓，这是庆祝每个人工智能里程碑的仪式，包括其最近在国际数学奥林匹克竞赛中推理的胜利。来源：Google Deepmind

At the headquarters of Google DeepMind, an artificial intelligence laboratory in London, researchers have a longstanding ritual for announcing momentous results: They bang a big ceremonial gong.
在伦敦的 Google DeepMind 总部，一个人工智能实验室，研究人员有一个长期的仪式来宣布重要成果：他们敲响一个大礼仪锣。

In 2016, the gong sounded for AlphaGo, an A.I. system that excelled at the game Go. In 2017, the gong reverberated when AlphaZero conquered chess. On each occasion the algorithm had beaten human world champions.
2016 年，AlphaGo，一个在围棋游戏中表现出色的人工智能系统，敲响了锣。2017 年，当 AlphaZero 征服国际象棋时，锣声再次响起。每次算法都击败了人类世界冠军。

Last week the DeepMind researchers got out the gong again to celebrate what Alex Davies, a lead of Google DeepMind’s mathematics initiative, described as a “massive breakthrough” in mathematical reasoning by an A.I. system. A pair of Google DeepMind models tried their luck with the problem set in the 2024 International Mathematical Olympiad, or I.M.O., held from July 11 to July 22 about 100 miles west of London at the University of Bath. The event is said to be the premier math competition for the world’s “brightest mathletes,” according to a promotional post on social media.
上周，DeepMind 的研究人员再次敲响了锣，庆祝 Google DeepMind 数学计划负责人 Alex Davies 所描述的人工智能系统在数学推理方面的“重大突破”。一对 Google DeepMind 模型尝试了解 2024 年国际数学奥林匹克竞赛（I.M.O.）的问题集，该竞赛于 7 月 11 日至 7 月 22 日在伦敦以西约 100 英里的巴斯大学举行。据社交媒体上的宣传帖子称，该活动被认为是世界上“最聪明的数学选手”的顶级数学竞赛。

The human problem-solvers — 609 high school students from 108 countries — won 58 gold medals, 123 silver and 145 bronze. The A.I. performed at the level of a silver medalist, solving four out of six problems for a total of 28 points. It was the first time that A.I. has achieved a medal-worthy performance on an Olympiad’s problems.
人类问题解决者——来自 108 个国家的 609 名高中生——赢得了 58 枚金牌、123 枚银牌和 145 枚铜牌。人工智能的表现相当于银牌得主，解决了六个问题中的四个，总分 28 分。这是人工智能首次在奥林匹克竞赛的问题上取得值得奖牌的表现。

“It’s not perfect, we didn’t solve everything,” Pushmeet Kohli, Google DeepMind’s vice president of research, said in an interview. “We want to be perfect.”
“这并不完美，我们没有解决所有问题，”谷歌 DeepMind 研究副总裁 Pushmeet Kohli 在一次采访中说。“我们希望做到完美。”

Nonetheless, Dr. Kohli described the result as a “phase transition” — a transformative change — “in the use of A.I. in mathematics and the ability of A.I. systems to do mathematics.”
尽管如此，Kohli 博士将这一结果描述为“相变”——一种变革性的变化——“在数学中使用人工智能以及人工智能系统进行数学运算的能力。”

The lab asked two independent experts to adjudicate the A.I.’s performance: Timothy Gowers, a mathematician at the University of Cambridge in England and a Fields medalist, who has been interested in the math-A. I. interplay for 25 years; and Joseph Myers, a software developer in Cambridge. Both won I.M.O. gold in their day. And at previous Olympiads Dr. Myers was chair of this year’s problem selection committee and at previous Olympiads served as a coordinator, judging human solutions. “I endeavored to assess the A.I. attempts consistently with how human attempts were judged this year,” he said.
实验室请了两位独立专家来裁定人工智能的表现：剑桥大学的数学家、菲尔兹奖得主 Timothy Gowers，他对数学与人工智能的互动已经有 25 年的兴趣；以及剑桥的软件开发人员 Joseph Myers。他们都曾在国际数学奥林匹克竞赛中获得金牌。在之前的奥林匹克竞赛中，Myers 博士是今年问题选择委员会的主席，并在之前的奥林匹克竞赛中担任协调员，评判人类的解决方案。“我努力以今年评判人类尝试的方式一致地评估人工智能的尝试，”他说。

Dr. Gowers added in an email: “I was definitely impressed.” The lab had discussed its Olympiad ambitions with him a couple of weeks beforehand, so “my expectations were quite high,” he said. “But the program met them, and in one or two instances significantly surpassed them.” The program found the “magic keys” that unlocked the problems, he said.
Gowers 博士在一封电子邮件中补充道：“我确实印象深刻。”实验室在几周前与他讨论了其奥林匹克竞赛的目标，所以“我的期望相当高，”他说。“但该程序达到了这些期望，并且在一两次情况下显著超越了它们。”他说，该程序找到了“解锁问题的魔法钥匙。”

Hitting the gong 敲响锣

Google DeepMind’s “superhuman reasoning team” on July 16 in Mountain View, Calif. Credit: Aaron Cohen
Google DeepMind 的“超人类推理团队”于 7 月 16 日在加利福尼亚州山景城。图片来源：Aaron Cohen

After months of rigorous training, the students sat for two exams, three problems per day — in algebra, combinatorics, geometry and number theory.
经过几个月的严格训练，学生们参加了两场考试，每天三道题——涵盖代数、组合学、几何和数论。

The A.I. counterpart beavered away roughly in tandem at the lab in London. (The students were not aware that Google DeepMind was competing, in part because the researchers did not want to steal the spotlight.) Researchers moved the gong into the room where they had gathered to watch the system work. “Every time the system solved a problem, we hit the gong to celebrate,” David Silver, a research scientist, said.
人工智能对手在伦敦的实验室大致同步地努力工作。（学生们不知道 Google DeepMind 在竞争，部分原因是研究人员不想抢风头。）研究人员把锣搬到他们聚集观看系统工作的房间里。“每次系统解决一个问题，我们就敲锣庆祝，”研究科学家 David Silver 说。

Haojia Shi, a student from China, ranked No. 1 and was the only competitor to earn a perfect score — 42 points for six problems; each problem is worth seven points for a full solution. The U.S. team won first place with 192 points; China placed second with 190.
来自中国的学生史浩佳排名第一，是唯一一个获得满分的选手——六道题得了 42 分；每道题的满分是七分。美国队以 192 分获得第一名；中国队以 190 分获得第二名。

The Google system earned its 28 points for fully solving four problems — two in algebra, one in geometry and one in number theory. (It flopped at two combinatorics problems.) The system was allowed unlimited time; for some problems it took up to three days. The students were allotted only 4.5 hours per exam.
Google 系统通过完全解决四个问题获得了 28 分——两个代数问题，一个几何问题和一个数论问题。（它在两个组合学问题上失败了。）系统被允许有无限的时间；有些问题花了三天时间。学生们每场考试只有 4.5 小时。

For the Google DeepMind team, speed is secondary to overall success, as it “is really just a matter of how much compute power you’re prepared to put into these things,” Dr. Silver said.
对于 Google DeepMind 团队来说，速度是次要的，因为“这实际上只是你准备投入多少计算能力的问题，”Silver 博士说。

“The fact that we’ve reached this threshold, where it’s even possible to tackle these problems at all, is what represents a step-change in the history of mathematics,” he added. “And hopefully it’s not just a step-change in the I.M.O., but also represents the point at which we went from computers only being able to prove very, very simple things toward computers being able to prove things that humans can’t.”
“我们已经达到了这个门槛，能够解决这些问题，这代表了数学史上的一个飞跃，”他补充道。“希望这不仅仅是 I.M.O.的一个飞跃，也代表了我们从计算机只能证明非常非常简单的事情到计算机能够证明人类无法证明的事情的转变。”

Algorithmic ingredients 算法成分

Applying A.I. to mathematics has been part of DeepMind’s mission for several years, often in collaboration with world-class research mathematicians.
将人工智能应用于数学是 DeepMind 多年来的使命，通常与世界级的研究数学家合作。

“Mathematics requires this interesting combination of abstract, precise and creative reasoning,” Dr. Davies said. In part, he noted, this repertoire of abilities is what makes math a good litmus test for the ultimate goal: reaching so-called artificial general intelligence, or A.G.I., a system with capabilities ranging from emerging to competent to virtuoso to superhuman. Companies such as OpenAI, Meta AI and xAI are tracking similar goals.
“数学需要这种有趣的抽象、精确和创造性推理的结合，”Davies 博士说。他指出，这种能力的组合部分是使数学成为最终目标的良好试金石：达到所谓的通用人工智能（A.G.I.），一个从新兴到熟练到大师到超人的系统。像 OpenAI、Meta AI 和 xAI 这样的公司也在追踪类似的目标。

Olympiad math problems have come to be considered a benchmark.
奥林匹克数学题目已被视为一个基准。

In January, a Google DeepMind system named AlphaGeometry solved a sampling of Olympiad geometry problems at nearly the level of a human gold medalist. “AlphaGeometry 2 has now surpassed the gold medalists in solving I.M.O. problems,” Thang Luong, the principal investigator, said in an email.
今年一月，一个名为 AlphaGeometry 的谷歌 DeepMind 系统几乎达到了人类金牌得主的水平，解决了一些奥林匹克几何问题。“AlphaGeometry 2 现在已经超越了金牌得主在解决 I.M.O.问题上的表现，”首席研究员 Thang Luong 在一封电子邮件中说。

Riding that momentum, Google DeepMind intensified its multidisciplinary Olympiad effort, with two teams: one led by Thomas Hubert, a research engineer in London, and another led by Dr. Luong and Quoc Le in Mountain View, each with some 20 researchers. For his “superhuman reasoning team,” Dr. Luong said he recruited a dozen I.M.O. medalists — “by far the highest concentration of I.M.O. medalists at Google!”
借助这一势头，谷歌 DeepMind 加强了其多学科奥林匹克努力，组建了两个团队：一个由伦敦的研究工程师 Thomas Hubert 领导，另一个由 Dr. Luong 和 Quoc Le 在山景城领导，每个团队有大约 20 名研究人员。对于他的“超人推理团队”，Dr. Luong 说他招募了 12 名 I.M.O.奖牌得主——“这是谷歌迄今为止 I.M.O.奖牌得主的最高集中度！”

The lab’s strike at this year’s Olympiad deployed the improved version of AlphaGeometry. Not surprisingly, the model fared rather well on the geometry problem, polishing it off in 19 seconds.
该实验室今年在奥林匹克上的攻势部署了改进版的 AlphaGeometry。不出所料，该模型在几何问题上表现相当出色，在 19 秒内解决了问题。

AlphaGeometry 2 solved Problem 4 of the Olympiad in 19 seconds, after receiving the formalized statement of the problem in AlphaGeometry’s own language. Credit: Google DeepMind
AlphaGeometry 2 在 19 秒内解决了奥林匹克的第 4 题，在收到用 AlphaGeometry 自己的语言形式化的问题陈述后。来源：谷歌 DeepMind

Dr. Hubert’s team developed a new model that is comparable but more generalized. Named AlphaProof, it is designed to engage with a broad range of mathematical subjects. All told, AlphaGeometry and AlphaProof made use of a number of different A.I. technologies.
Hubert 博士的团队开发了一种新的模型，它具有可比性但更具普遍性。名为 AlphaProof，它旨在涉及广泛的数学科目。总的来说，AlphaGeometry 和 AlphaProof 使用了许多不同的人工智能技术。

One approach was an informal reasoning system, expressed in natural language. This system leveraged Gemini, Google’s large language model. It used the English corpus of published problems and proofs and the like as training data.
一种方法是非正式推理系统，用自然语言表达。该系统利用了 Google 的大型语言模型 Gemini。它使用已发布的问题和证明的英语语料库作为训练数据。

The informal system excels at identifying patterns and suggesting what comes next; it is creative and talks about ideas in an understandable way. Of course, large language models are inclined to make things up — which may (or may not) fly for poetry and definitely not for math. But in this context, the L.L.M. seems to have displayed restraint; it wasn’t immune to hallucination, but the frequency was reduced.
非正式系统擅长识别模式并建议下一步；它具有创造性，并以一种易于理解的方式讨论想法。当然，大型语言模型倾向于编造内容——这在诗歌中可能（或可能不）行得通，但在数学中绝对不行。但在这种情况下，L.L.M.似乎表现出了克制；它并非完全免于幻觉，但频率有所降低。

Another approach was a formal reasoning system, based on logic and expressed in code. It used theorem prover and proof-assistant software called Lean, which guarantees that if the system says a proof is correct, then it is indeed correct. “We can exactly check that the proof is correct or not,” Dr. Hubert said. “Every step is guaranteed to be logically sound.”
另一种方法是基于逻辑并用代码表达的正式推理系统。它使用了名为 Lean 的定理证明和证明助手软件，该软件保证如果系统说证明是正确的，那么它确实是正确的。“我们可以准确地检查证明是否正确，”Hubert 博士说。“每一步都保证是逻辑上正确的。”

Another crucial component was a reinforcement learning algorithm in the AlphaGo and AlphaZero lineage. This type of A.I. learns by itself and can scale indefinitely, said Dr. Silver, who is Google DeepMind’s vice-president of reinforcement learning. Since the algorithm doesn’t require a human teacher, it can “learn and keep learning and keep learning until ultimately it can solve the hardest problems that humans can solve,” he said. “And then maybe even one day go beyond those.”
另一个关键组件是 AlphaGo 和 AlphaZero 系列中的强化学习算法。Google DeepMind 的强化学习副总裁 Silver 博士说，这种类型的人工智能可以自我学习并无限扩展。由于该算法不需要人类教师，它可以“学习并不断学习，直到最终能够解决人类可以解决的最难问题，”他说。“然后也许有一天甚至可以超越这些问题。”

Dr. Hubert added, “The system can rediscover knowledge for itself.” That’s what happened with AlphaZero: It started with zero knowledge, Dr. Hubert said, “and by just playing games, and seeing who wins and who loses, it could rediscover all the knowledge of chess. It took us less than a day to rediscover all the knowledge of chess, and about a week to rediscover all the knowledge of Go. So we thought, Let’s apply this to mathematics.”
Hubert 博士补充道：“系统可以自行重新发现知识。”这就是 AlphaZero 发生的事情：它从零知识开始，Hubert 博士说，“通过仅仅玩游戏，看看谁赢谁输，它可以重新发现所有的国际象棋知识。我们花了不到一天的时间重新发现了所有的国际象棋知识，大约一周的时间重新发现了所有的围棋知识。所以我们想，把这个应用到数学上。”

Dr. Gowers doesn’t worry — too much — about the long-term consequences. “It is possible to imagine a state of affairs where mathematicians are basically left with nothing to do,” he said. “That would be the case if computers became better, and far faster, at everything that mathematicians currently do.”
Gowers 博士并不太担心长期的后果。“可以想象一种情况，数学家基本上无事可做，”他说。“如果计算机在数学家目前所做的一切方面变得更好且更快，那将是这种情况。”

“There still seems to be quite a long way to go before computers will be able to do research-level mathematics,” he added. “It’s a fairly safe bet that if Google DeepMind can solve at least some hard I.M.O. problems, then a useful research tool can’t be all that far away.”
他补充道：“在计算机能够进行研究级数学之前，似乎还有很长的路要走。如果谷歌 DeepMind 能够解决至少一些难题，那么一个有用的研究工具就不会太远了，这是一个相当安全的赌注。”

A really adept tool might make mathematics accessible to more people, speed up the research process, nudge mathematicians outside the box. Eventually it might even pose novel ideas that resonate.
一个真正熟练的工具可能会使数学对更多人可及，加快研究过程，推动数学家跳出框框。最终，它甚至可能提出共鸣的新想法。