DeepMind's AlphaProof Nexus Industrialises AI Mathematics
A Gemini-plus-Lean agent autonomously cracked nine open Erdős problems for a few hundred dollars each — and quietly handed enterprises a template for hallucination-proof AI..
Paul Erdős was a 20th-century Hungarian mathematician who posed thousands of problems, many of which remain open decades after his death — small, deceptively simple questions that have resisted entire careers of effort. Google DeepMind has now published a paper describing a system, AlphaProof Nexus, that solved nine of them on its own. The trick is not raw model power. It is a pairing: a large language model (Gemini 3.1 Pro) writes candidate proofs in a language called Lean, and Lean — a formal proof assistant — mechanically checks every logical step. If the proof is wrong, Lean rejects it and the model tries again. The model can hallucinate; the verifier cannot. That feedback loop is the same architectural pattern enterprises are starting to adopt anywhere a wrong answer is unacceptable.
On a Thursday morning in late May, George Tsoukalas and twenty co-authors at Google DeepMind posted arXiv preprint 2605.22763 with a title that sounded modest — “Advancing Mathematics Research with AI-Driven Formal Proof Search” — and a result that was not. Their system, AlphaProof Nexus, had taken 353 open Erdős problems formalised in the Lean proof language and, running unattended, produced verified solutions to nine of them. Two had been open for fifty-six years. The bill for each successful proof was, in the authors’ phrase, “a few hundred dollars.” The timing was not accidental. Exactly one week earlier, on May 20, OpenAI had announced that an internal reasoning model disproved the planar unit distance conjecture, an 80-year-old Erdős problem in discrete geometry. Fields medallist Terence Tao called that result “perhaps the most unambiguous instance” of an LLM solving an open mathematical problem. It was a single proof, hand-shepherded by humans, celebrated as a one-off. DeepMind’s response, prepared in parallel rather than in reaction, was to publish what reads less like a triumphal announcement and more like a manufacturing report: 9 solves out of 353 attempts, 44 confirmations out of 492 OEIS conjectures, with the unit economics worked out. The system architecture explains the difference. AlphaProof Nexus is not a single model. It is a swarm: subagents each propose Lean proof attempts, an evolutionary algorithm coordinates which avenues to pursue, and a focused proof tool — the original AlphaProof, descended from the silver-medal IMO 2024 system — handles olympiad-level subgoals. Every candidate proof goes to the Lean compiler. If Lean accepts it, the result is, by construction, mathematically correct. There is no need for a human to check the chain-of-thought, because the chain-of-thought is a machine-verified formal object. The more striking finding, buried in the post-hoc ablations, is that the bare-bones version of the agent — Gemini 3.1 Pro plus Lean, no evolution, no AlphaProof subroutine — also solved all nine problems given enough budget. The full-featured stack was simply two-to-five times cheaper at the same solve rate. The implication is uncomfortable for the established “AI-for-science” narrative: most of the value came from giving a strong general model a strict referee, not from bespoke mathematical machinery. Demis Hassabis, asked about the result on the Big Technology Podcast, deflated the inevitable AGI talk. Current systems, he said, are “nowhere near” true general intelligence; genuine AGI would need to be “original, creative and have a wide range of skills across a multitude of areas, not just a specific one.” The careful framing matters. DeepMind is not claiming a thinking mathematician. It is claiming an industrialised, verifiable search procedure that happens to work in a domain where wrongness is cheap to detect.
Most enterprise AI deployments today have the same shape: a frontier LLM generates a response, a human or a downstream system uses it, and somebody — eventually — discovers whether it was right. The cost of being wrong is borne after the fact, and the dominant mitigation is the expensive ritual of human review. AlphaProof Nexus inverts that arrangement. Wrongness is detected before the output leaves the system, by a tool that cannot itself be fooled. Lean is the key. It is a programming language and a proof assistant, descended from the type theories of the 1970s, in which mathematical statements compile only if their proofs are logically sound. A Lean proof of the Pythagorean theorem is not prose; it is code that the compiler accepts or rejects. There is no judgement, no probability, no “seems right.” This is the same property that makes formal methods attractive in microprocessor verification (Intel has used them since the Pentium FDIV bug), aerospace control software, and cryptographic protocols. What was missing was a generator powerful enough to produce candidate proofs at scale. Gemini 3.1 Pro, judging by DeepMind’s numbers, now clears that bar. The economic structure that emerges is worth describing precisely. The LLM is the expensive component per token, but its outputs are cheap to discard. The verifier is nearly free to run but accepts almost nothing. Together they form a generate-and-test loop whose end product — a verified proof — has a hard guarantee attached. The cost per successful proof reported in the paper, a few hundred dollars, is essentially the cost of throwing enough LLM attempts at Lean until one sticks. This pattern generalises well beyond mathematics. In financial modelling, a similar loop can pair an LLM that proposes hedging strategies with a constraint solver that rejects any portfolio violating regulatory or risk limits. In code review, an LLM can propose patches that a type checker, test suite, and symbolic execution engine accept or reject — the architecture Anthropic’s recent Claude-based vulnerability-finding work appears to use at scale. In legal reasoning, a model can draft a clause that a contract-checking system validates against a corpus of binding constraints. The recipe is the same: cheap generator, strict verifier, no human in the inner loop. The more remarkable point is what the paper does not claim. AlphaProof Nexus does not produce more correct natural-language proofs; it produces more proofs that can be checked without natural language at all. The trust model has shifted from “the model is reliable” to “the model is allowed to be unreliable because the verifier is not.” For any enterprise serious about deploying generative AI in regulated or high-stakes settings, that is the architectural lesson, and it is independent of which model wins the next benchmark.
In July 2024, DeepMind announced that AlphaProof and AlphaGeometry 2, working together, had solved four of the six International Mathematical Olympiad problems — a silver-medal performance, the first time an AI had reached the level of the world’s top high-school mathematicians. That system was bespoke, slow, and required problems to be hand-translated into Lean by human formalisers. It was, even by its authors’ reckoning, a demonstration. Less than two years later, the same lineage has produced a system that ingests an entire repository of formalised conjectures, runs unattended, and returns verified solutions at commodity prices. The acceleration is the story. Olympiad problems are hard but bounded; their solutions exist and are known to humans. The Erdős problems in the test set are, by construction, ones nobody had solved. The two cases where the system cracked questions that had been open for fifty-six years — predating not just modern AI but most of computer science as an industry — make the historical comparison stark. The field of combinatorics produced perhaps a few hundred Erdős-problem resolutions across its entire history; an automated agent has now contributed nine in a single batch. The published Lean proofs sit in a public GitHub repository under the google-deepmind organisation, and the formal-conjectures repo against which the system was evaluated — the source of those 353 problems — remains open. Anyone can attempt to extend the result with a different generator. That is unusual for a result of this commercial weight, and it removes the most common skeptical objection (that the headline numbers are cherry-picked or unreproducible). A quiet critical thread remains. No prominent mathematician has yet matched Tao’s endorsement of the OpenAI result for the DeepMind batch; reactions in the community have ranged from impressed to wary, with concerns focused on whether the solved problems were the “easy” end of the Erdős corpus and whether the formalisation pipeline introduces subtle biases. Gary Marcus, the field’s most prominent AI skeptic, has not publicly engaged with the paper as of writing. The absence of strong critical framing is itself notable — and worth watching as the proofs receive peer scrutiny in the coming weeks. For the enterprise reader, the historical pattern is the one to internalise. AI capability in narrow, verifiable domains compounds quickly once the verifier exists. Chess collapsed to superhuman performance over a decade once the rules became explicit; Go took six years from AlphaGo to obsolescence-grade tools. Formal mathematics looks to be on a similar curve, and the same logic will apply to any domain that can be reduced to a generator-plus-verifier loop. The question for any incumbent is not whether the loop will be built for their industry, but whether they will build it or buy it.
The takeaway for CIOs is architectural, not mathematical. AlphaProof Nexus is a working production-scale demonstration that the way to deploy LLMs in high-stakes contexts is to pair them with a domain-specific verifier and let the model fail loudly and cheaply. Banks running model-risk-management functions, pharma companies validating regulatory submissions, and law firms reviewing contracts at scale should be asking the same question: what is our Lean? Where is the symbolic, deterministic checker that can reject 99% of generated content and accept only the 1% that is provably right? The companies that build that infrastructure will deploy generative AI in places competitors cannot.
Formal verification has been a regulatory aspiration for decades — in aviation software (DO-178C), in cryptographic standards (FIPS), in nuclear control systems. The bottleneck was always the cost of producing the proofs. If AlphaProof Nexus’s economics generalise even partially, regulators acquire a tool they have lacked: the ability to demand machine-verifiable evidence that an AI system’s outputs satisfy specified constraints, at a cost low enough to mandate. The EU AI Act’s high-risk category, which currently leans on documentation and human oversight, becomes considerably more enforceable when “prove it” can mean exactly that. Expect the formal-verification clause to migrate from aspirational annex to operative requirement.
A new layer is opening between the foundation-model labs and the application layer: the verifier-and-orchestration layer. Lean is open source; so is Coq; so are Z3 and CVC5. The interesting startups are those wrapping these tools in agent infrastructure for specific verticals — formal contracts, financial compliance, cryptographic proof generation, smart-contract auditing. The DeepMind paper is, in venture terms, a category-creation event: it proves the architecture works and publishes the playbook. Expect a wave of “Lean for X” companies in the next eighteen months, with the defensible ones being those that own the verifier and the domain-specific schema, not those that simply wrap a frontier model.
Sources 10 references
- [1]Advancing Mathematics Research with AI-Driven Formal Proof Search (arXiv 2605.22763v1)
- [2]Advancing Mathematics Research with AI-Driven Formal Proof Search (HTML)
- [3]google-deepmind/formal-conjectures (GitHub)
- [4]google-deepmind/alphaproof-nexus-results (GitHub)
- [5]Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars (The Decoder)
- [6]Olympiad-level formal mathematical reasoning with reinforcement learning (Nature)
- [7]An OpenAI model has disproved a central conjecture in discrete geometry (OpenAI)
- [8]AI just solved an 80-year-old 'Erdős problem' (Scientific American)
- [9]Google DeepMind AI solved 56-year-old math problems but Demis Hassabis says this is not AGI (Techlusive)
- [10]AI achieves silver-medal standard solving International Mathematical Olympiad problems (DeepMind blog)