DeepMind’s AI co-mathematician cracks a 60-year-old open problem
An agentic system built on Gemini 3.1 set a new high on FrontierMath Tier 4 — and handed an Oxford professor the missing idea inside a proof its own reviewers had rejected..
Google DeepMind has built an AI system that does not just answer math questions — it works alongside professional mathematicians on problems no human has solved. Called the “AI co-mathematician” and powered by Gemini 3.1, it is not a single chatbot but a workbench: a team of specialised AI agents that propose proofs, hunt for counter-examples, search the literature, and critique each other’s work, while the human researcher steers. On a benchmark of research-level problems used to stress-test the frontier, it set a new high. More importantly, it helped an Oxford professor close an open question in group theory that had sat unsolved for decades. The signal for industry: agentic research tools are no longer toys for benchmarks. They are starting to do something that looks like science.
Marc Lackenby, a topologist at the University of Oxford, typed Problem 21.10 from the Kourovka Notebook — a famously stubborn list of open questions in group theory, first compiled in 1965 — into a private DeepMind workbench. The system did not answer. Instead, it spawned two parallel workstreams: one trying to prove the conjecture, another trying to break it. Within minutes the “prover” returned a candidate argument. Within minutes more, the workbench’s own reviewer agent flagged a hole in the logic and threw the proof out. That should have been the end of it. But Lackenby, scrolling through the rejected draft, stopped at one of the discarded lemmas. The reviewer was right that the proof did not work as written, but the strategy was, in his words, “really, really clever.” He realised he knew how to fill the gap. A problem that had defeated specialists for 60 years closed that afternoon. DeepMind published the system, internally known as Aletheia, in a paper led by Pushmeet Kohli, VP of Research at DeepMind, who framed it not as a replacement for mathematicians but as a “collaborator that takes intellectual risks the user can audit.” The architecture matters as much as the result. Aletheia is a stateful, asynchronous environment in which a top-level project-coordinator agent breaks an open problem into sub-tasks, dispatches them to specialised sub-agents — provers, refuters, literature scouts, computational explorers — and routes every output through a reviewer that can reject, revise, or escalate. Failed hypotheses are not deleted; they are logged, annotated, and offered back to the human. On Epoch AI’s FrontierMath Tier 4, a 50-problem set the benchmark’s authors described as “designed to remain unsolved by AI for decades,” the system scored 48%. The previous public state of the art, set by GPT-5.4 Pro in March, was 38%. A year earlier it was 19%. To put that in perspective, FrontierMath Tier 4 is calibrated against early-postdoc mathematics; the IMO gold medals that grabbed headlines in 2024 were, by comparison, high-school problems. The pivot the Lackenby episode forces is uncomfortable. The most valuable output of the system, on the day it broke a 60-year-old problem, was a proof its own quality control had thrown in the bin.
Aletheia is striking less for any single capability than for what it stops doing. It does not try to one-shot answers. It does not optimise for a single chain of thought. It treats a research question the way a research group does — with parallel attempts, internal disagreement, and an editor in the loop. The base model is Gemini 3.1. On the same internal benchmark of 100 research-level problems with code-checkable answers, Gemini 3.1 Pro alone scores 57%. Gemini 3.1 Deep Think, the longer-horizon reasoning variant, reaches 70%. Wrapped in the Aletheia agent harness, the score jumps to 87%. The lift from the harness, in other words, is comparable to the lift from a full generation of base-model improvement. That is the engineering claim DeepMind is making, and it is the one enterprise buyers should read most carefully: the next leg of capability gains may come less from bigger models than from how those models are orchestrated. The workbench’s mechanics are concrete. A project coordinator decomposes the user’s problem into a directed graph of sub-goals. Worker agents — instances of Gemini 3.1 with specialised system prompts and tool access — attempt each sub-goal in parallel, sometimes pursuing contradictory strategies on purpose. A reviewer agent, fine-tuned to find errors rather than produce them, gates every output. Outputs that survive get composed into a LaTeX draft, with provenance notes and margin annotations showing which agent produced which step and what was rejected along the way. Failed branches are preserved and surfaced. That last detail is what made Lackenby’s discovery possible: the rejected proof was still there, with the reviewer’s critique attached. On FrontierMath Tier 4, Aletheia solved 23 of 48 non-public sample problems in fully autonomous mode — no human nudging mid-run. The contrast with the SAT-solver era of automated reasoning is instructive. In 2003, the Robbins conjecture, an algebra problem open since 1933, was settled by EQP, a custom prover at Argonne National Lab, after weeks of compute on a single, narrowly-scoped problem. Aletheia is a general-purpose research environment that the user can point at almost any pure-maths question and expect non-trivial engagement. The catch is that, like any peer reviewer, Aletheia is wrong sometimes — and so are its reviewers. Lackenby’s experience is a feature only if the human in the loop is good enough to overrule the machine’s verdict.
For corporate R&D leaders, the Lackenby story is the template. The system did not replace a domain expert; it gave one a strategy he would not have generated alone, including in a draft his own quality gate rejected. That is the realistic shape of agentic research inside a pharma discovery team, a materials lab, or a quant desk: parallel hypotheses, internal critique, and provenance trails that let a senior scientist audit the reasoning, not just the answer. The 30-point lift from base model to agent harness is the headline number for procurement. It implies that the buyers who win the next two years will be the ones who invest in orchestration — agent topologies, reviewer models, verifier loops, tool integrations — rather than the ones who simply queue up for the next Gemini or Claude. Expect Accenture, McKinsey QuantumBlack and Big Four AI practices to repackage the Aletheia pattern as a consulting offer within the quarter.
The EU AI Office’s full enforcement powers under the AI Act activate on 2 August 2026, including the regime for general-purpose AI models with systemic risk — a threshold Gemini 3.1 comfortably crosses. The Commission’s February 2026 guidelines on high-risk classification say little about scientific-research deployments, which sit in an awkward middle ground: not consumer-facing, not safety-critical in the AI Act’s narrow sense, yet capable of producing outputs that downstream products will rely on. A proof generated by an agent, verified by another agent, and published under a human author’s name raises authorship, liability and disclosure questions that current rules barely touch. Brussels has signalled that the Scientific Panel inside the AI Office will start issuing opinions on research-use GPAI in the second half of the year. Enterprise legal teams should assume that provenance logs, of the kind Aletheia already produces, will become evidentiary defaults.
The venture thesis around agentic research is now testable rather than speculative. FutureHouse, the Eric Schmidt-backed non-profit, has shipped a public platform of literature and chemistry agents — Crow, Falcon, Owl, Phoenix — and is the closest open analogue to Aletheia’s design. Sakana AI in Tokyo, fresh off a Series B, claims the first peer-reviewed paper written end-to-end by its AI Scientist-v2. Lila Sciences, the Flagship Pioneering spinout, is raising on a closed-loop wet-lab autonomy pitch. The DeepMind paper is bullish for all of them in one sense — it validates the multi-agent thesis — and bearish in another: it sets a quality bar that frontier-lab orchestration plus a frontier base model will keep raising. Startups without privileged model access will compete on domain-specific verifiers, tool integrations, and trust UX. Expect the next funding round in this category to be priced on retention inside research orgs, not on demo videos.
Sources 10 references
- [1]AI Co-Mathematician: Accelerating Mathematicians with Agentic AI (arXiv preprint)
- [2]Accelerating mathematical and scientific discovery with Gemini Deep Think — Google DeepMind
- [3]Google’s Aletheia Advances the State of the Art of Fully Autonomous Agentic Math Research — InfoQ
- [4]FrontierMath Tier 4 — Epoch AI
- [5]‘The job description is changing’: mathematician Terence Tao on the rise of AI — Nature
- [6]Mathematical methods and human thought in the age of AI — Terence Tao
- [7]FutureHouse Platform: Superintelligent AI Agents for Scientific Discovery
- [8]The AI Scientist-v2 — Sakana AI
- [9]Article 51: Classification of GPAI Models with Systemic Risk — EU AI Act
- [10]Google DeepMind Releases AI Co-Mathematician That Creates New High Score On FrontierMath Benchmark — OfficeChai