On Thursday, Google DeepMind announced an achievement: their AI systems, AlphaProof and AlphaGeometry 2, successfully solved four out of six problems from this year’s International Mathematical Olympiad (IMO).
Google DeepMind’s AlphaProof uses reinforcement learning to prove mathematical statements in the formal language called Lean. This system self-trains by generating and verifying millions of proofs, progressively handling more challenging problems. AlphaGeometry 2, an upgraded version of a previous geometry-solving AI, now leverages a Gemini-based language model trained on a larger dataset.
Prominent mathematicians Sir Timothy Gowers and Dr. Joseph Myers evaluated the AI’s solutions using official IMO rules. The combined system scored 28 out of 42 possible points, narrowly missing the 29-point gold medal threshold. Notably, the AI achieved a perfect score on the competition’s hardest problem, a feat matched by only five human contestants this year.
The IMO, established in 1959, challenges elite pre-college mathematicians with problems in algebra, combinatorics, geometry, and number theory. Success in the IMO is a recognized benchmark for evaluating an AI system’s mathematical reasoning capabilities. Google reports that AlphaProof solved two algebra problems and one number theory problem, while AlphaGeometry 2 tackled the geometry problem. The AI struggled with the two combinatorics problems, with solution times ranging from minutes to up to three days.
Before engaging with the problems, Google translated them into formal mathematical language for the AI to process. This step differs from the official IMO, where human contestants work directly with problem statements during two 4.5-hour sessions. Google highlights that AlphaGeometry 2 could solve 83% of historical IMO geometry problems from the past 25 years, a significant improvement over its predecessor’s 53% success rate. This year’s geometry problem was solved by the AI in just 19 seconds after receiving the formalized version.
Despite the impressive results, Sir Timothy Gowers provided a nuanced perspective on Google DeepMind’s achievement in a thread on X. He acknowledged the AI’s capabilities as “well beyond what automatic theorem provers could do before” but noted several important qualifications.
“The main qualification is that the program needed a lot longer than the human competitors—for some of the problems over 60 hours—and of course much faster processing speed than the poor old human brain,” Gowers wrote. He emphasized that if human competitors had been given the same amount of time per problem, they would likely have scored higher.
Gowers also pointed out that the problems were manually translated into Lean by humans before the AI began its work, indicating that while the AI handled the core mathematical reasoning, the initial “autoformalization” step required human intervention. Regarding the broader implications for mathematical research, Gowers expressed caution.
“Are we close to the point where mathematicians are redundant? It’s hard to say. I would guess that we’re still a breakthrough or two short of that,” he wrote.
While the AI’s long processing times suggest it hasn’t “solved mathematics,” Gowers acknowledged that “something is interesting going on when it operates.”
Despite these limitations, Gowers speculated that such AI systems could become valuable research tools.
“So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren’t too difficult—the kind of thing one can do in a couple of hours. That would be massively useful as a research tool, even if it wasn’t itself capable of solving open problems.”