Daily AI Briefing · Tuesday, 26 May 2026

01 / 04 · Research & Open Source

8 min read

DeepMind's AlphaProof Nexus Industrialises AI Mathematics

A Gemini-plus-Lean agent autonomously cracked nine open Erdős problems for a few hundred dollars each — and quietly handed enterprises a template for hallucination-proof AI..

·01Primer

Paul Erdős was a 20th-century Hungarian mathematician who posed thousands of problems, many of which remain open decades after his death — small, deceptively simple questions that have resisted entire careers of effort. Google DeepMind has now published a paper describing a system, AlphaProof Nexus, that solved nine of them on its own. The trick is not raw model power. It is a pairing: a large language model (Gemini 3.1 Pro) writes candidate proofs in a language called Lean, and Lean — a formal proof assistant — mechanically checks every logical step. If the proof is wrong, Lean rejects it and the model tries again. The model can hallucinate; the verifier cannot. That feedback loop is the same architectural pattern enterprises are starting to adopt anywhere a wrong answer is unacceptable.

·02What Happened

On a Thursday morning in late May, George Tsoukalas and twenty co-authors at Google DeepMind posted arXiv preprint 2605.22763 with a title that sounded modest — “Advancing Mathematics Research with AI-Driven Formal Proof Search” — and a result that was not. Their system, AlphaProof Nexus, had taken 353 open Erdős problems formalised in the Lean proof language and, running unattended, produced verified solutions to nine of them. Two had been open for fifty-six years. The bill for each successful proof was, in the authors’ phrase, “a few hundred dollars.” The timing was not accidental. Exactly one week earlier, on May 20, OpenAI had announced that an internal reasoning model disproved the planar unit distance conjecture, an 80-year-old Erdős problem in discrete geometry. Fields medallist Terence Tao called that result “perhaps the most unambiguous instance” of an LLM solving an open mathematical problem. It was a single proof, hand-shepherded by humans, celebrated as a one-off. DeepMind’s response, prepared in parallel rather than in reaction, was to publish what reads less like a triumphal announcement and more like a manufacturing report: 9 solves out of 353 attempts, 44 confirmations out of 492 OEIS conjectures, with the unit economics worked out. The system architecture explains the difference. AlphaProof Nexus is not a single model. It is a swarm: subagents each propose Lean proof attempts, an evolutionary algorithm coordinates which avenues to pursue, and a focused proof tool — the original AlphaProof, descended from the silver-medal IMO 2024 system — handles olympiad-level subgoals. Every candidate proof goes to the Lean compiler. If Lean accepts it, the result is, by construction, mathematically correct. There is no need for a human to check the chain-of-thought, because the chain-of-thought is a machine-verified formal object. The more striking finding, buried in the post-hoc ablations, is that the bare-bones version of the agent — Gemini 3.1 Pro plus Lean, no evolution, no AlphaProof subroutine — also solved all nine problems given enough budget. The full-featured stack was simply two-to-five times cheaper at the same solve rate. The implication is uncomfortable for the established “AI-for-science” narrative: most of the value came from giving a strong general model a strict referee, not from bespoke mathematical machinery. Demis Hassabis, asked about the result on the Big Technology Podcast, deflated the inevitable AGI talk. Current systems, he said, are “nowhere near” true general intelligence; genuine AGI would need to be “original, creative and have a wide range of skills across a multitude of areas, not just a specific one.” The careful framing matters. DeepMind is not claiming a thinking mathematician. It is claiming an industrialised, verifiable search procedure that happens to work in a domain where wrongness is cheap to detect.

·03Architecture: Why the Verifier Is the Product

Most enterprise AI deployments today have the same shape: a frontier LLM generates a response, a human or a downstream system uses it, and somebody — eventually — discovers whether it was right. The cost of being wrong is borne after the fact, and the dominant mitigation is the expensive ritual of human review. AlphaProof Nexus inverts that arrangement. Wrongness is detected before the output leaves the system, by a tool that cannot itself be fooled. Lean is the key. It is a programming language and a proof assistant, descended from the type theories of the 1970s, in which mathematical statements compile only if their proofs are logically sound. A Lean proof of the Pythagorean theorem is not prose; it is code that the compiler accepts or rejects. There is no judgement, no probability, no “seems right.” This is the same property that makes formal methods attractive in microprocessor verification (Intel has used them since the Pentium FDIV bug), aerospace control software, and cryptographic protocols. What was missing was a generator powerful enough to produce candidate proofs at scale. Gemini 3.1 Pro, judging by DeepMind’s numbers, now clears that bar. The economic structure that emerges is worth describing precisely. The LLM is the expensive component per token, but its outputs are cheap to discard. The verifier is nearly free to run but accepts almost nothing. Together they form a generate-and-test loop whose end product — a verified proof — has a hard guarantee attached. The cost per successful proof reported in the paper, a few hundred dollars, is essentially the cost of throwing enough LLM attempts at Lean until one sticks. This pattern generalises well beyond mathematics. In financial modelling, a similar loop can pair an LLM that proposes hedging strategies with a constraint solver that rejects any portfolio violating regulatory or risk limits. In code review, an LLM can propose patches that a type checker, test suite, and symbolic execution engine accept or reject — the architecture Anthropic’s recent Claude-based vulnerability-finding work appears to use at scale. In legal reasoning, a model can draft a clause that a contract-checking system validates against a corpus of binding constraints. The recipe is the same: cheap generator, strict verifier, no human in the inner loop. The more remarkable point is what the paper does not claim. AlphaProof Nexus does not produce more correct natural-language proofs; it produces more proofs that can be checked without natural language at all. The trust model has shifted from “the model is reliable” to “the model is allowed to be unreliable because the verifier is not.” For any enterprise serious about deploying generative AI in regulated or high-stakes settings, that is the architectural lesson, and it is independent of which model wins the next benchmark.

·04Context: From IMO Silver to Industrial Output in 22 Months

In July 2024, DeepMind announced that AlphaProof and AlphaGeometry 2, working together, had solved four of the six International Mathematical Olympiad problems — a silver-medal performance, the first time an AI had reached the level of the world’s top high-school mathematicians. That system was bespoke, slow, and required problems to be hand-translated into Lean by human formalisers. It was, even by its authors’ reckoning, a demonstration. Less than two years later, the same lineage has produced a system that ingests an entire repository of formalised conjectures, runs unattended, and returns verified solutions at commodity prices. The acceleration is the story. Olympiad problems are hard but bounded; their solutions exist and are known to humans. The Erdős problems in the test set are, by construction, ones nobody had solved. The two cases where the system cracked questions that had been open for fifty-six years — predating not just modern AI but most of computer science as an industry — make the historical comparison stark. The field of combinatorics produced perhaps a few hundred Erdős-problem resolutions across its entire history; an automated agent has now contributed nine in a single batch. The published Lean proofs sit in a public GitHub repository under the google-deepmind organisation, and the formal-conjectures repo against which the system was evaluated — the source of those 353 problems — remains open. Anyone can attempt to extend the result with a different generator. That is unusual for a result of this commercial weight, and it removes the most common skeptical objection (that the headline numbers are cherry-picked or unreproducible). A quiet critical thread remains. No prominent mathematician has yet matched Tao’s endorsement of the OpenAI result for the DeepMind batch; reactions in the community have ranged from impressed to wary, with concerns focused on whether the solved problems were the “easy” end of the Erdős corpus and whether the formalisation pipeline introduces subtle biases. Gary Marcus, the field’s most prominent AI skeptic, has not publicly engaged with the paper as of writing. The absence of strong critical framing is itself notable — and worth watching as the proofs receive peer scrutiny in the coming weeks. For the enterprise reader, the historical pattern is the one to internalise. AI capability in narrow, verifiable domains compounds quickly once the verifier exists. Chess collapsed to superhuman performance over a decade once the rules became explicit; Go took six years from AlphaGo to obsolescence-grade tools. Formal mathematics looks to be on a similar curve, and the same logic will apply to any domain that can be reduced to a generator-plus-verifier loop. The question for any incumbent is not whether the loop will be built for their industry, but whether they will build it or buy it.

Three Perspectives What this story means for different readers

The takeaway for CIOs is architectural, not mathematical. AlphaProof Nexus is a working production-scale demonstration that the way to deploy LLMs in high-stakes contexts is to pair them with a domain-specific verifier and let the model fail loudly and cheaply. Banks running model-risk-management functions, pharma companies validating regulatory submissions, and law firms reviewing contracts at scale should be asking the same question: what is our Lean? Where is the symbolic, deterministic checker that can reject 99% of generated content and accept only the 1% that is provably right? The companies that build that infrastructure will deploy generative AI in places competitors cannot.

Formal verification has been a regulatory aspiration for decades — in aviation software (DO-178C), in cryptographic standards (FIPS), in nuclear control systems. The bottleneck was always the cost of producing the proofs. If AlphaProof Nexus’s economics generalise even partially, regulators acquire a tool they have lacked: the ability to demand machine-verifiable evidence that an AI system’s outputs satisfy specified constraints, at a cost low enough to mandate. The EU AI Act’s high-risk category, which currently leans on documentation and human oversight, becomes considerably more enforceable when “prove it” can mean exactly that. Expect the formal-verification clause to migrate from aspirational annex to operative requirement.

A new layer is opening between the foundation-model labs and the application layer: the verifier-and-orchestration layer. Lean is open source; so is Coq; so are Z3 and CVC5. The interesting startups are those wrapping these tools in agent infrastructure for specific verticals — formal contracts, financial compliance, cryptographic proof generation, smart-contract auditing. The DeepMind paper is, in venture terms, a category-creation event: it proves the architecture works and publishes the playbook. Expect a wave of “Lean for X” companies in the next eighteen months, with the defensible ones being those that own the verifier and the domain-specific schema, not those that simply wrap a frontier model.

Sources 10 references

02 / 04 · Law & Governance

8 min read

Anthropic's Mythos finds 10,000 critical bugs. The model stays locked.

Project Glasswing has converted an unreleased frontier model into a private cyber-defence utility, leaving every CISO to trust one company's release discipline..

·01Primer

Anthropic has built a frontier model, Claude Mythos Preview, that is unusually effective at finding and weaponising software vulnerabilities. Rather than ship it, Anthropic has wrapped it in a closed coalition called Project Glasswing: roughly 50 partner organisations, including AWS, Apple, Microsoft, Google, JPMorgan, Cisco, CrowdStrike and the Linux Foundation, get controlled access to use Mythos defensively on their own code and on critical open-source infrastructure. In about a month, the coalition has reported more than 10,000 high- or critical-severity flaws. The headline find is CVE-2026-4747, a 17-year-old unauthenticated root remote code execution in FreeBSD's NFS stack. Anthropic argues that releasing Mythos publicly would dismantle the friction that today's defence-in-depth security model depends on. Enterprises, in other words, are being asked to trust one vendor's release discipline.

·02What Happened

On a Tuesday in late March, an on-call engineer at a European cloud provider opened a ticket marked Severity 1 from a small triage firm she had never worked with. Inside was a sealed report with a SHA-3-512 hash, a redacted bug class, and the line: “Identified by Claude Mythos Preview; validated by external researchers; disclosure window 90 days.” By the time she had finished her coffee, her counterpart at Cloudflare was reading a similar packet. By the end of April, roughly 50 organisations were doing the same thing in parallel, and a quiet, coordinated firefight had begun against a backlog of bugs that, in many cases, had been sitting in production for a decade. That firefight is Project Glasswing, the cybersecurity coalition Anthropic stood up around an unreleased frontier model it calls Claude Mythos Preview. In its initial update, Anthropic reported that partners had used Mythos to find more than 10,000 high- or critical-severity vulnerabilities across what it described as the most systemically important software in the world. Anthropic's own internal sweep of 1,000 open-source projects produced 6,202 high- and critical-severity findings out of 23,019 total issues. Cloudflare, one of the named partners, reported roughly 2,000 bugs from its internal testing, around 400 of them high or critical severity, and noted that the model produced fewer false positives than its conventional human-led testing. The flagship find is CVE-2026-4747, a stack buffer overflow in FreeBSD's RPCSEC_GSS implementation that yields unauthenticated remote code execution in the kernel when NFS is exposed. The vulnerable code had been in tree for roughly 17 years. According to a detailed write-up published by Califio's MAD Bugs project, Mythos did not just point at the bug: it stood up a FreeBSD VM with NFS and Kerberos, drove a QEMU debugger, read crash dumps, built a ROP chain from the available kernel gadgets and worked around legacy debug-register quirks, with a human researcher supplying 44 prompts across roughly eight hours. The blast radius is not academic. FreeBSD-derived stacks sit underneath Netflix's CDN, PlayStation's operating system and a long tail of network appliances; an unauthenticated root in NFS is the kind of bug that, in the wrong hands, ends careers and quarters. Logan Graham, who leads Anthropic's Frontier Red Team and helped stand up Glasswing, framed the moment more bluntly than the corporate post does. “Mythos is an extraordinary model. But it is not about the model,” he wrote on X. “It's about what the world needs to do to prepare for a future of models that are extremely good at cybersecurity. This is the start.” Anthropic has paired that warning with money: up to $100M in Mythos usage credits for Glasswing partners, plus $4M in direct donations to open-source security (roughly $2.5M to Alpha-Omega and OpenSSF via the Linux Foundation, and $1.5M to the Apache Software Foundation). Crucially, Anthropic has not released Mythos publicly and has said it will not until safeguards exist that prevent a leaked or copied version from being turned around as an offensive utility.

·03The Numbers and the Asymmetry

Headline counts of vulnerabilities are notoriously misleading, so the Mythos numbers are worth disaggregating. Anthropic's coordinated vulnerability disclosure dashboard at red.anthropic.com/2026/cvd is the cleanest signal: as of 22 May 2026, 1,596 findings have been filed with maintainers across 281 open-source projects, with 97 patched and 88 already assigned a CVE or GHSA identifier. That is the auditable layer. Above it sits the 10,000-plus aggregate across Glasswing partners, which mixes internal corporate findings, validated open-source disclosures, and a long tail of “high-severity in context” issues that may never receive a public identifier. Below it sits the more interesting datum: of the 23,019 issues Anthropic surfaced in its 1,000-project OSS sweep, 6,202 — roughly 27% — were judged high or critical after external triage. For a fully automated first pass, that triage rate is not noise. For comparison, the canonical industry baseline for application security pipelines is that the great majority of automated SAST findings are downgraded or discarded after human review; double-digit true-positive rates on critical severity are normally the preserve of bespoke human red teams. The 17-year tail on CVE-2026-4747 is also less anomalous than it looks: Heartbleed lived in OpenSSL for two years, Shellshock in bash for 25, and the Linux kernel's Dirty COW for nine. Long-lived bugs in critical libraries are the rule, not the exception. What changes with Mythos is the cost of finding them. If a model can plausibly do an eight-hour human-guided session to ship a working FreeBSD kernel RCE, the economics of historical bug-hunting collapse for both sides of the wire. This is where the counterpoint matters. Bruce Schneier has been the loudest sceptic, telling The Tech Report that the wave of coverage is mostly “marketing hype” and writing on his blog that “this is very much a PR play by Anthropic — and it worked. Lots of reporters are breathlessly repeating Anthropic's talking points without engaging with them critically.” A widely circulated paper from AISLE Security reproduced several of Mythos' headline findings using smaller, cheaper open-source models when given tight scaffolding — file paths, bug-class hints, suspected sinks. The UK AI Security Institute has separately judged OpenAI's GPT-5.5 broadly comparable on cyber benchmarks. The implication is uncomfortable for Anthropic's containment argument: if Mythos-class capability is within reach of a determined attacker using already-shipped models plus elbow grease, then withholding Mythos buys defenders a head start measured in months, not years. CETaS, the Centre for Emerging Technology and Security at the Alan Turing Institute, lands somewhere in the middle. In its analysis Claude Mythos: What Does Anthropic's New Model Mean for the Future of Cybersecurity?, CETaS director Sacha Babuta credits Mythos with “major improvements in mathematics, cyber security, software engineering and automated vulnerability detection”, while pressing the policy question Anthropic's blog post conspicuously avoids: who, exactly, decides when a Mythos-class model is safe to release, and on what evidence? Today the answer is Anthropic. That is a governance position no jurisdiction has explicitly granted and none has yet contested.

·04Why CISOs Should Care This Quarter

The operational message for a DAX40 or FTSE 100 security organisation is narrower and more urgent than the policy debate. First, the FreeBSD advisory FreeBSD-SA-26:08.rpcsec_gss should be treated as drop-everything. Any internet-exposed NFS surface on a FreeBSD-derived stack — and there is more of that in European telco, media and gaming infrastructure than CISOs typically acknowledge — is, until patched, a single-packet path to unauthenticated kernel root. Second, the Glasswing CVD ledger is now a forward indicator of what attackers will be reverse-engineering as disclosure windows close; patch cadence on the projects listed at red.anthropic.com/2026/cvd should be tracked as a named KPI by the vulnerability-management function. Third, the asymmetry argument cuts both ways inside the enterprise: if defenders only get Mythos-class capability through a vendor-mediated coalition, then board-level conversations about cyber spend now need a line item for “AI-assisted defensive testing” that did not exist in the FY26 budget. The honest framing for a CISO this quarter is that the floor on attacker capability has moved, the ceiling on defender capability is now gated by an Anthropic partnership, and no regulator has yet written a rule that touches either.

Three Perspectives What this story means for different readers

For enterprise security leaders, Glasswing creates a two-tier market overnight. Partners — AWS, Microsoft, Google, JPMorgan, Cisco, Palo Alto Networks, CrowdStrike — get a months-long head start to harden their own code and the OSS they depend on. Everyone else inherits the patches after the disclosure window closes, in the same window in which attackers reverse-engineer them. Procurement teams should ask vendors three concrete questions: are you a Glasswing partner; if not, what equivalent AI-assisted assurance program are you running; and how quickly do you commit to patching once a Mythos-attributed CVE lands in your stack? Expect Glasswing partner status to start appearing in RFP responses within two quarters.

Anthropic is currently the de facto regulator of Mythos-class capability — it decides who gets access, on what terms, and when (if ever) public release happens. The EU AI Act's systemic-risk obligations and the UK AI Security Institute's pre-deployment testing regime both touch this, but neither was written for a model whose primary externality is asymmetric cyber capability rather than misuse content. Expect the Bundesamt für Sicherheit in der Informationstechnik and ENISA to push for visibility into the Glasswing CVD pipeline. Brussels will likely treat Mythos as a test case for how the AI Office handles a model that a frontier lab voluntarily withholds from market — a precedent with implications well beyond cyber.

The application-security category just got compressed. SAST and DAST vendors whose value proposition is “we find vulnerabilities humans miss” face an awkward conversation when a frontier model produces 6,202 high/critical findings across 1,000 OSS projects in a single sweep. The defensible plays shift to triage and orchestration: integrating Mythos-class output into developer workflows, prioritisation, patch generation, and verification at scale. Expect a wave of “AI SOC” and “AI AppSec” rounds in H2 2026, plus accelerated M&A as incumbents like Snyk, Checkmarx and Veracode buy their way into the new stack. The contrarian bet is on offensive-security boutiques that pair human creativity with model output — the human-in-the-loop pattern Califio used on CVE-2026-4747 is the template.

Sources 13 references

03 / 04 · Markets & FinOps

8 min read

The ghost-token economy: why AI bills break Q3 reforecasts

Azeem Azhar's Jevons-paradox data shows AI spend has gone structurally non-linear — and DAX40 finance teams are forecasting on the wrong unit..

·01Primer

A “token” is the unit AI models bill on — roughly three-quarters of a word. Every time a model reads input or writes output, it consumes tokens, and customers pay per million. For two years the per-token price has fallen sharply. The puzzle is that enterprise AI invoices have risen sharply at the same time. Azeem Azhar's Exponential View this week names the mechanism: a Jevons paradox. When something useful gets cheaper, people use much more of it — enough that the total bill grows even as the unit price shrinks. The twist for AI is that modern “agents” silently consume tokens the user never sees: re-reading context, calling tools, retrying steps, generating private reasoning. Azhar estimates active inference is only 15–20% of total token spend. The other 80% is what he calls ghost tokens.

·02What Happened

On the morning of Monday 25 May, a FinOps lead at a DAX40 industrial firm opened the Q2 AWS Bedrock invoice and saw a line item she had not modelled: a coding-agent service her platform team enabled in March was running at roughly 11x its budget envelope. The headcount using it had not changed. The per-token price from Anthropic was, if anything, lower than in Q1. What had changed was that engineers had moved from one-shot chat to multi-turn agents — and each agent run was quietly re-reading its entire context every turn, dispatching tool calls, and looping until a build passed. The same Monday, Azeem Azhar and William Gildea published “Why AI bills rise as costs fall” in Exponential View, putting numbers to what FinOps teams across Europe were starting to see in their dashboards. Their headline figure: token volume processed globally has grown roughly 17,000-fold over four years, even as per-token prices collapsed. Demand for machine intelligence, they argue, is highly elastic — cheaper tokens have made agents economically viable, and agents consume tokens at rates “orders of magnitude higher than those of chatbots.” A coding agent operating over ten turns may re-read its full context every turn, burning as much as 55x more tokens than a single-turn query for the same task. Citing a TrueFoundry survey, Azhar puts “actual active inference” — the model thinking about your question — at only 15–20% of total token consumption. The rest is invisible work: tool calls, retrieval, reasoning traces, retries. “All of these steps consume tokens,” Azhar writes. “They become hidden multipliers.” The pivot for CFOs is not that tokens are expensive. They are not. The pivot is that aggregate spend has decoupled from anything a 2026 budget plan can reasonably forecast. Per-seat pricing — the model finance teams quietly imported from the SaaS playbook — assumes a roughly bounded user. An agent is not a bounded user. It is a process that can, if you let it, spend $1,400 in a single Claude Code session, as one engineering manager told Pragmatic Engineer's Gergely Orosz in his 30 April survey of fifteen companies. Orosz's interviewees described token spend rising roughly 10x in six months with no sign of slowing. One Series-A founder reported per-developer monthly spend climbing from $200 to $3,000 in half a year. “The cost in token spend is off the charts,” a staff engineer at a 5,000-person fintech told him, “and leadership has shared this trend with us. They have not said anything beyond showing growth in spend, and mentioning that this won't be sustainable.” Azhar's contribution is to put a clean economic frame on the same field reports: this is not waste, it is elasticity meeting amplification. Cheaper tokens plus ghost-token multipliers plus agentic adoption equal expensive aggregates.

·03The Numbers

Three numbers carry the argument. The first is the 17,000x growth in tokens processed per quarter over four years, much of it driven, Azhar notes, by Chinese demand and providers such as ByteDance and Alibaba. The second is the 55x amplification: a coding agent run over ten turns, re-reading its context each turn, consumes 55 times more tokens than a single-turn query for the same task. The third is the 15–20% — the share of total token spend that is actual inference. The remainder is the ghost stack: tool calls (anywhere between five and twenty-five per task, according to GrisLabs and arXiv data Azhar cites), repeated context loads, hidden retrieval, reasoning traces, and retries when a tool call returns garbage. The economic analogue is older than the cloud. In 1865, William Stanley Jevons observed that more-efficient steam engines did not reduce British coal demand — they expanded it, because cheaper steam unlocked uses that had previously been uneconomic. The same logic ran through electricity post-WWII (real prices fell, household consumption rose more than tenfold), and through cloud compute in the 2010s (AWS unit prices dropped roughly 70% over a decade while enterprise cloud bills tripled). The AI curve is steeper than either. Gergely Orosz's 15-company sample puts hard observation behind it: at one publicly-traded infrastructure firm, the engineering director told him API budget limits had been raised “multiple times in April” alone. At a 2,000-person finance company, the initial cap of $100 per user was being exhausted in three to five working days. At a US healthcare firm, one engineer spent $1,400 on a single Claude Code session — and the company kept a monthly spend leaderboard to encourage it. Ed Zitron, in the more skeptical register of his 19 May piece “AI Is Too Expensive,” reports that Zillow spent over $1 million on AI in Q1 2026 and $749,000 in April alone across Cursor, Anthropic, and AWS Bedrock — and that Anthropic is currently subsidizing Pro and Max subscriptions to the point of “burning 8x-13.5x their fees in tokens.” Zitron's read is that the subsidy will end, prices will rise, and the bill will spike again. Azhar's read is that even without a price reset, the elasticity dynamic alone produces an aggregate cost curve that does not fit inside annual budget plans built on per-seat assumptions. Both reads point at the same forecasting failure. The implication for DAX40 boards entering their Q3 reforecast cycle is concrete: per-seat AI pricing is no longer a defensible planning input. The unit of forecast must shift to agent-execution budgets — a hard cap on tokens per agent run, per workflow, per business process — with model-routing rules that send trivial tasks to cheaper Sonnet- or open-source-class models and reserve frontier Opus-class spend for tasks where a single error in production would cost more than the inference.

·04The Forecast-Variance Problem

What makes ghost tokens a finance problem rather than an engineering problem is variance, not mean. A traditional SaaS line item is roughly Gaussian around a per-seat mean; an agent line item is fat-tailed. The Pragmatic Engineer data shows this directly: at one infrastructure company a single engineer hit $10,000 in a week from a caching bug; at another, daily per-developer spend ranges from $200 to $500 without anyone changing job description. A CFO building a 2026 plan needs not just an expected value but a 95th-percentile envelope, and the 95th percentile of agent spend is currently undefined because adoption is still climbing and amplification factors are still being discovered. Three controls map cleanly onto this. First, separate the inference budget from the headcount budget — agents are not seats, and treating them as seats hides the variance. Second, instrument at the workflow level, not the user level: a single business process (close-the-books, ticket triage, code review) should carry its own token meter and a hard cap, so a runaway loop is bounded by policy rather than goodwill. Third, negotiate pooled-spend contracts. Orosz reports Cursor is willing to offer tiered discounts above roughly $1 million in annual spend and that some vendors now offer “pooled spend” pools that absorb heavy individual users without forcing them onto stricter per-seat limits — a structure CFOs should be asking for explicitly in Q3 procurement reviews.

Three Perspectives What this story means for different readers

For European enterprises now mid-cycle on 2026 plans, the practical move is to retire per-seat AI pricing as a forecasting input before the Q3 reforecast lands. DAX40 platform teams should publish an agent-execution-budget standard: tokens per workflow, per business process, with hard caps enforced at the gateway layer rather than left to engineer discipline. Procurement should renegotiate enterprise agreements toward pooled-spend models with tiered discounts above the seven-figure mark, and CFOs should expect to add a new variance line to AI cost forecasts — not just expected value but a 95th-percentile envelope. The half-life of any current AI budget assumption is roughly one quarter; planning cycles need to compress to match.

Ghost tokens cut across at least two open regulatory files. Under the EU AI Act's general-purpose AI obligations, providers must disclose training compute; downstream deployers will increasingly be asked, by auditors and by the AI Office, to disclose inference compute and energy per agent run — a number most enterprises currently cannot produce. ESG reporting under CSRD compounds the problem: Scope 3 disclosures will need to incorporate inference emissions, and ghost tokens make that number 4-7x larger than a naive per-prompt estimate suggests. Expect the Bundesnetzagentur and BaFin to ask financial-services firms running agentic workflows for documented token-budget controls, by analogy to algorithmic-trading kill-switch rules. The compliance burden is non-trivial and falls on the deployer, not the model provider.

Sapphire Ventures' 2026 outlook flags that 57% of organisations now run agents in production (67% at large enterprises) and notes the rise of small and open-source models that can be one-to-two orders of magnitude cheaper per token than frontier closed-source offerings — the textbook bull case that the curve flattens through routing. Sequoia and a16z have made similar arguments through 2026. The bear case, articulated by Ed Zitron, is that current pricing is a venture-subsidised mirage: Anthropic is reportedly burning 8x-13.5x its subscription fees in delivered tokens, and once the subsidy ends, prices reset upward at the same time enterprises are most exposed. For startup founders, the planning implication is symmetric to the enterprise one: build the cost-control plumbing now, while tokens are cheap, so that gross margin does not invert when pricing normalises.

Sources 6 references

04 / 04 · Frontier Labs & Capex

8 min read

Anthropic Eyes Microsoft's Maia 200, Becoming a Four-Silicon Lab

If signed, the Azure deal would make Claude the first frontier customer of Microsoft's custom chip — and turn enterprise compute strategy into a three-axis problem..

·01Primer

Anthropic, the maker of Claude, is in early talks to rent Microsoft Azure servers powered by Microsoft's own custom AI chip, the Maia 200, according to a CNBC report on May 21, 2026. If the deal is signed, Anthropic would become the first major outside lab to run on Microsoft's homegrown silicon — a milestone Microsoft has been chasing for more than two years. The discussions follow a $5 billion investment Microsoft made in Anthropic in November 2025 and a separate $30 billion commitment from Anthropic to buy Azure compute. Anthropic already trains and serves Claude across three different chip families — Nvidia GPUs, Amazon's Trainium, and Google's TPUs. Adding Maia 200 would make it a four-silicon lab. For enterprise buyers, that turns the once-bundled question of “which AI vendor?” into three separate decisions: lab, hyperscaler, and chip.

·02What Happened

Inside a server hall in Des Moines, Iowa, racks of Microsoft's Maia 200 accelerators have been quietly serving OpenAI's GPT-5.2 traffic since the chip's January launch. According to CNBC's Jordan Novet, who broke the story on May 21, those same racks — and a planned expansion near Phoenix — are now the subject of a commercial conversation with Anthropic. Two people familiar with the talks told CNBC that Anthropic is considering leasing Azure capacity backed specifically by Maia 200 for Claude inference workloads. No contract has been signed, and both companies declined formal comment. The negotiation is the logical next step in a relationship that, six months ago, did not exist. Microsoft and Anthropic spent most of the 2023–2025 generative AI cycle as polite strangers, separated by Microsoft's exclusive depth with OpenAI. That changed in November 2025, when Satya Nadella announced a $5 billion equity investment in Anthropic alongside an Azure availability deal that immediately put Claude in front of every Microsoft Foundry customer. In the same week, Anthropic committed to spending $30 billion on Azure compute — a figure that, by itself, would have made the company a top-five Azure customer. The Maia 200 conversation puts a hardware floor under that commercial relationship. Microsoft's chip, unveiled by Cloud + AI executive vice president Scott Guthrie in January 2026, is an inference-class accelerator built on TSMC's 3-nanometer process: 140 billion transistors, 216 GB of HBM3e memory, native FP4 and FP8 tensor cores, and a 750-watt thermal envelope. On Microsoft's most recent earnings call, Nadella told analysts the chip delivers “over 30% improved tokens per dollar, compared to the latest silicon in our fleet” — language he has repeated at every public appearance since. The narrative pivot, though, is not the chip. It is Anthropic. Until this week, the strongest sceptical read on the Maia program was that it had no anchor tenant outside Microsoft's own walls. Maia 100, launched in late 2023, became known in semiconductor circles for two things: a tepid commercial reception and a six-month production delay that pushed broad availability into 2026. The chip mostly powered internal Copilot inference, with OpenAI lukewarm and external labs uninterested. A signed Anthropic deal would change that story in one stroke — not because Maia 200 suddenly out-benchmarks Nvidia's Rubin (it does not, on most published numbers), but because it would mean a frontier lab judged the chip good enough to take a structural position. The comparison veterans of the cloud era will reach for is AWS's 2015 acquisition of Annapurna Labs, which seemed marginal at the time and ended up underwriting both Graviton and Trainium. Microsoft has been hoping Maia would play the same role. Anthropic, by becoming the first external believer, would supply the validation Annapurna only earned in retrospect.

·03The Multi-Silicon Architecture

To understand why this matters, look at Anthropic's compute stack the way its infrastructure team does. The company already runs Claude across three distinct chip architectures. AWS Trainium is the primary training substrate: in April 2026, Anthropic signed a ten-year, $100-billion-plus commitment with Amazon for more than a million Trainium2 chips and roughly a gigawatt of Trainium2 and Trainium3 capacity by year-end. Google TPUs are the second leg: an October 2025 agreement brings one million TPUv7 “Ironwood” accelerators online during 2026, with a follow-on Anthropic–Google–Broadcom arrangement in April 2026 adding roughly 3.5 gigawatts of TPU capacity from 2027. Nvidia GPUs are the third — the default training and inference fabric that gives Claude portability across every cloud and on-prem environment that matters. Maia 200 would be the fourth. No frontier lab has ever run on four chip families simultaneously. OpenAI is trying to engineer a similar optionality through its own custom-silicon roadmap with Broadcom, but those parts are not yet in production. Meta's MTIA program serves internal recommendation workloads and has no external customer. Google's TPU is a closed garden by design. Amazon's Trainium has so far attracted exactly one frontier tenant — Anthropic itself. Against that map, the strategic logic of the Maia talks comes into focus. Anthropic is not chasing the cheapest tokens per dollar in a single quarter; it is engineering structural cost optionality across vendors that all compete with each other for its workloads. Each new silicon partner adds a credible threat in pricing conversations with the others. The economics also pencil. Microsoft's published Maia 200 numbers — 10 petaFLOPS at FP4, 5 petaFLOPS at FP8, 2.8 TB/s of scale-up bandwidth, clusters of up to 6,144 accelerators on standard Ethernet — put it in the same conversation as Trainium3 and TPUv7 for inference, while still well behind Nvidia Rubin for training. That fits Anthropic's apparent intent. The CNBC report describes the discussions as focused on inference, where Claude's commercial traffic is growing fastest and where Maia 200's tokens-per-dollar pitch is most directly testable. Inference is also where Microsoft has the strongest near-term need: GPT-5.2 already runs on Maia 200 in Iowa, and adding Claude traffic would lift utilization on a fleet that is still scaling out. For the wider market, the second-order effect is on Nvidia. Jensen Huang's company still holds north of 90% of trainable AI compute, and Nadella was at pains to remind investors in January that Microsoft “won't stop buying” from Nvidia or AMD. But the more frontier labs that diversify, the more Nvidia's pricing power becomes a negotiated variable rather than a fixed cost. Maia 200, by itself, does not threaten that position. Maia 200 with Anthropic on it, advertised to every Azure Foundry customer who wants Claude, is a different proposition.

Three Perspectives What this story means for different readers

For CIOs writing five-year AI compute contracts, the Anthropic–Maia conversation collapses a comforting fiction. “Multi-cloud” has, until now, been treated as a hedge against vendor lock-in at the hyperscaler layer. The new reality is that the chip underneath the cloud — and the lab on top of it — are independent variables. Buying Claude on Azure-Maia, Claude on AWS-Trainium, and Claude on GCP-TPU are now three different commercial and technical decisions, with different latency profiles, different token economics, and different exposure to silicon supply shocks. Procurement teams that modelled AI spend as a single line item will need to add at least two more axes. The upside is real negotiating leverage. The downside is that benchmarking and governance frameworks now need to track lab-version, cloud-region, and silicon-generation as separate dimensions.

The deal, if it closes, lands in an already-crowded regulatory inbox. The FTC's open inquiry into hyperscaler investments in frontier labs — initially focused on Microsoft–OpenAI and Amazon–Anthropic — will now have a second Microsoft–Anthropic data point to weigh, layered on top of the $5 billion equity stake from November. European Commission staff have flagged similar concerns under the Digital Markets Act's gatekeeper provisions, particularly around tying frontier model access to specific cloud or silicon stacks. The Maia angle complicates the analysis: a lab running on four chip families is structurally harder to characterise as captive to any single hyperscaler, which may actually help Anthropic's regulatory posture even as the gross dollar commitments climb. Expect the CMA in the UK to ask similar questions, focused on whether bundled Claude–Maia pricing on Azure forecloses competition.

For investors funding the AI infrastructure stack, the message is that the custom-silicon thesis just got its first frontier-lab data point. Broadcom, the design partner behind Google's TPU and most of the announced OpenAI and Meta custom chips, has been priced for that future for a year; Anthropic-on-Maia gives the thesis a concrete commercial reference. The harder question is for the GPU-cloud neoclouds — CoreWeave, Lambda, Crusoe, Nebius — whose entire pitch is Nvidia capacity at scale. If frontier labs shift even 20% of inference workloads to hyperscaler custom silicon, the neocloud TAM compresses. Conversely, inference-tooling startups (vLLM commercial forks, SGLang vendors, the model-serving layer) become more valuable, because portability across four silicon families is now a paid feature, not a research curiosity. Watch Series B and C decks pivot accordingly over the next two quarters.

Sources 8 references

Simulate real-world places with Project Genie and Street View (Google DeepMind blog, May 19, 2026)

Google DeepMind connected Project Genie, its general-purpose world model, to nearly two decades of Street View imagery (about 280 billion photos across 110 countries), letting the model anchor generated, navigable 3D environments to real-world locations rather than purely synthetic ones; Genie 3 is already powering one of Waymo's robotaxi simulators for rare events such as tornadoes or animals on the road. Why it matters: for enterprises and consultancies building in robotics, autonomous mobility, logistics and field operations, this is the first credible signal that sim-to-real training pipelines can run on cheap synthetic data anchored in proprietary imagery moats — and a reminder that Google's 20-year Street View archive is a durable, hard-to-replicate competitive asset in the physical-AI race.

Source

Google's James Manyika is betting that doomers are wrong about AI and jobs (Casey Newton, Platformer, May 19, 2026)

In a long-form interview pegged to Google I/O, Alphabet SVP James Manyika argues that AI is automating tasks faster than it is replacing jobs, distinguishing between routine tasks (highly automatable) and bundled roles (much stickier), and pushing back on Silicon Valley's mass-displacement narrative with data from his McKinsey, UN and White House work. Why it matters: for enterprise and consulting leaders facing board-level pressure to shed headcount on the back of AI productivity claims, Manyika's task-vs-job framing offers a more defensible workforce-planning model than blanket automation targets — and signals that the most senior policy voice inside a hyperscaler is preparing the political ground for a reorganise-not-downsize message that boards and works councils will quickly adopt.

Source