Daily AI Briefing · Tuesday, 12 May 2026

01 / 05 · Research & Open Source

8 min read

DeepMind’s AI co-mathematician cracks a 60-year-old open problem

An agentic system built on Gemini 3.1 set a new high on FrontierMath Tier 4 — and handed an Oxford professor the missing idea inside a proof its own reviewers had rejected..

·01Primer

Google DeepMind has built an AI system that does not just answer math questions — it works alongside professional mathematicians on problems no human has solved. Called the “AI co-mathematician” and powered by Gemini 3.1, it is not a single chatbot but a workbench: a team of specialised AI agents that propose proofs, hunt for counter-examples, search the literature, and critique each other’s work, while the human researcher steers. On a benchmark of research-level problems used to stress-test the frontier, it set a new high. More importantly, it helped an Oxford professor close an open question in group theory that had sat unsolved for decades. The signal for industry: agentic research tools are no longer toys for benchmarks. They are starting to do something that looks like science.

·02What Happened

Marc Lackenby, a topologist at the University of Oxford, typed Problem 21.10 from the Kourovka Notebook — a famously stubborn list of open questions in group theory, first compiled in 1965 — into a private DeepMind workbench. The system did not answer. Instead, it spawned two parallel workstreams: one trying to prove the conjecture, another trying to break it. Within minutes the “prover” returned a candidate argument. Within minutes more, the workbench’s own reviewer agent flagged a hole in the logic and threw the proof out. That should have been the end of it. But Lackenby, scrolling through the rejected draft, stopped at one of the discarded lemmas. The reviewer was right that the proof did not work as written, but the strategy was, in his words, “really, really clever.” He realised he knew how to fill the gap. A problem that had defeated specialists for 60 years closed that afternoon. DeepMind published the system, internally known as Aletheia, in a paper led by Pushmeet Kohli, VP of Research at DeepMind, who framed it not as a replacement for mathematicians but as a “collaborator that takes intellectual risks the user can audit.” The architecture matters as much as the result. Aletheia is a stateful, asynchronous environment in which a top-level project-coordinator agent breaks an open problem into sub-tasks, dispatches them to specialised sub-agents — provers, refuters, literature scouts, computational explorers — and routes every output through a reviewer that can reject, revise, or escalate. Failed hypotheses are not deleted; they are logged, annotated, and offered back to the human. On Epoch AI’s FrontierMath Tier 4, a 50-problem set the benchmark’s authors described as “designed to remain unsolved by AI for decades,” the system scored 48%. The previous public state of the art, set by GPT-5.4 Pro in March, was 38%. A year earlier it was 19%. To put that in perspective, FrontierMath Tier 4 is calibrated against early-postdoc mathematics; the IMO gold medals that grabbed headlines in 2024 were, by comparison, high-school problems. The pivot the Lackenby episode forces is uncomfortable. The most valuable output of the system, on the day it broke a 60-year-old problem, was a proof its own quality control had thrown in the bin.

·03Architecture

Aletheia is striking less for any single capability than for what it stops doing. It does not try to one-shot answers. It does not optimise for a single chain of thought. It treats a research question the way a research group does — with parallel attempts, internal disagreement, and an editor in the loop. The base model is Gemini 3.1. On the same internal benchmark of 100 research-level problems with code-checkable answers, Gemini 3.1 Pro alone scores 57%. Gemini 3.1 Deep Think, the longer-horizon reasoning variant, reaches 70%. Wrapped in the Aletheia agent harness, the score jumps to 87%. The lift from the harness, in other words, is comparable to the lift from a full generation of base-model improvement. That is the engineering claim DeepMind is making, and it is the one enterprise buyers should read most carefully: the next leg of capability gains may come less from bigger models than from how those models are orchestrated. The workbench’s mechanics are concrete. A project coordinator decomposes the user’s problem into a directed graph of sub-goals. Worker agents — instances of Gemini 3.1 with specialised system prompts and tool access — attempt each sub-goal in parallel, sometimes pursuing contradictory strategies on purpose. A reviewer agent, fine-tuned to find errors rather than produce them, gates every output. Outputs that survive get composed into a LaTeX draft, with provenance notes and margin annotations showing which agent produced which step and what was rejected along the way. Failed branches are preserved and surfaced. That last detail is what made Lackenby’s discovery possible: the rejected proof was still there, with the reviewer’s critique attached. On FrontierMath Tier 4, Aletheia solved 23 of 48 non-public sample problems in fully autonomous mode — no human nudging mid-run. The contrast with the SAT-solver era of automated reasoning is instructive. In 2003, the Robbins conjecture, an algebra problem open since 1933, was settled by EQP, a custom prover at Argonne National Lab, after weeks of compute on a single, narrowly-scoped problem. Aletheia is a general-purpose research environment that the user can point at almost any pure-maths question and expect non-trivial engagement. The catch is that, like any peer reviewer, Aletheia is wrong sometimes — and so are its reviewers. Lackenby’s experience is a feature only if the human in the loop is good enough to overrule the machine’s verdict.

Three Perspectives What this story means for different readers

01

For corporate R&D leaders, the Lackenby story is the template. The system did not replace a domain expert; it gave one a strategy he would not have generated alone, including in a draft his own quality gate rejected. That is the realistic shape of agentic research inside a pharma discovery team, a materials lab, or a quant desk: parallel hypotheses, internal critique, and provenance trails that let a senior scientist audit the reasoning, not just the answer. The 30-point lift from base model to agent harness is the headline number for procurement. It implies that the buyers who win the next two years will be the ones who invest in orchestration — agent topologies, reviewer models, verifier loops, tool integrations — rather than the ones who simply queue up for the next Gemini or Claude. Expect Accenture, McKinsey QuantumBlack and Big Four AI practices to repackage the Aletheia pattern as a consulting offer within the quarter.

02

The EU AI Office’s full enforcement powers under the AI Act activate on 2 August 2026, including the regime for general-purpose AI models with systemic risk — a threshold Gemini 3.1 comfortably crosses. The Commission’s February 2026 guidelines on high-risk classification say little about scientific-research deployments, which sit in an awkward middle ground: not consumer-facing, not safety-critical in the AI Act’s narrow sense, yet capable of producing outputs that downstream products will rely on. A proof generated by an agent, verified by another agent, and published under a human author’s name raises authorship, liability and disclosure questions that current rules barely touch. Brussels has signalled that the Scientific Panel inside the AI Office will start issuing opinions on research-use GPAI in the second half of the year. Enterprise legal teams should assume that provenance logs, of the kind Aletheia already produces, will become evidentiary defaults.

03

The venture thesis around agentic research is now testable rather than speculative. FutureHouse, the Eric Schmidt-backed non-profit, has shipped a public platform of literature and chemistry agents — Crow, Falcon, Owl, Phoenix — and is the closest open analogue to Aletheia’s design. Sakana AI in Tokyo, fresh off a Series B, claims the first peer-reviewed paper written end-to-end by its AI Scientist-v2. Lila Sciences, the Flagship Pioneering spinout, is raising on a closed-loop wet-lab autonomy pitch. The DeepMind paper is bullish for all of them in one sense — it validates the multi-agent thesis — and bearish in another: it sets a quality bar that frontier-lab orchestration plus a frontier base model will keep raising. Startups without privileged model access will compete on domain-specific verifiers, tool integrations, and trust UX. Expect the next funding round in this category to be priced on retention inside research orgs, not on demo videos.

Sources 10 references

02 / 05 · Compute Economics

9 min read

The Broken Bargain: TSMC Says No to ASML’s $380M Machine

For the first time in six decades, the world’s most advanced chipmaker is refusing the most advanced tool — because the math no longer works..

·01Primer

For sixty years, chipmaking has run on a simple bargain: each new generation of equipment costs more, but it prints transistors so much smaller and so much more densely that the cost per transistor falls. That is Moore’s Law expressed not as a count of transistors but as an economic promise. ASML, the Dutch monopolist that builds the lithography machines used to pattern every advanced chip on Earth, now sells a new tool called High-NA EUV — the TWINSCAN EXE:5200 — for roughly $380 million each. Its customer of record is TSMC, the Taiwanese foundry that makes the silicon for Apple, Nvidia and most of the AI industry. This week TSMC told Bloomberg it will not deploy High-NA in production through 2029. The reason: the machine costs more per transistor than the tool it replaces. The bargain has stopped working.

·02What Happened

In a quiet conference room on the eighth floor of TSMC’s headquarters in Hsinchu, chief executive C.C. Wei sat across from a small group of reporters last week and said the words ASML’s leadership had spent two years hoping no one would say out loud. “We see no economic case for High-NA in this decade,” Wei said, according to a Bloomberg account published May 9. “The productivity is not yet there. The cost is not yet there. We will continue with our current EUV fleet and multi-patterning where we need it.” The remark, delivered without flourish, was the most consequential sentence spoken in the semiconductor industry this year. ASML’s High-NA EUV machine is the largest, most complex piece of industrial equipment ever sold commercially. A single unit weighs 165 tonnes, ships in 250 crates, and requires its own three-story clean-room enclosure. It uses a 0.55-numerical-aperture optical column from Zeiss to shrink the printable feature size from 13 nanometres to about 8 nanometres in a single exposure, eliminating the double- and triple-patterning workarounds that have made sub-3nm production so expensive. Intel took delivery of the first High-NA tool at its Oregon fab in December 2023 — chief executive Pat Gelsinger called it “the most strategic capital purchase Intel has ever made” — and a second was installed in late 2024. Samsung Foundry, which had publicly committed to High-NA for its 1.4nm node, is now understood to be slipping that timeline to 2028. The pivot, however, belongs to TSMC. The Hsinchu foundry produces roughly 90 per cent of the world’s leading-edge logic chips, including every Nvidia GB200 and every Apple M-series die. If TSMC will not buy High-NA, ASML cannot recoup the estimated €12 billion it has spent developing the platform on its current volume forecasts. Christophe Fouquet, ASML’s chief executive, told analysts on the company’s April 16 earnings call that “High-NA adoption will follow the customer’s economic readiness, not our shipping schedule,” a sentence that read, on close inspection, as the first formal acknowledgement that the schedule had slipped. ASML shares fell 7.8 per cent in Amsterdam the next day. They have not recovered. The deeper signal was picked up by Azeem Azhar, whose Exponential View essay “The broken bargain of Moore’s Law,” published May 11, argued that TSMC’s refusal is “the first time in the history of semiconductors that the most advanced fab on Earth has looked at the most advanced tool on Earth and concluded that the arithmetic does not work.” Azhar’s piece, which drew on cost-per-transistor data from TechInsights and IBS, framed the moment as a phase change rather than a delay. For two generations the EUV transition delivered a clean cost-down. High-NA, by his numbers, delivers a cost-up.

·03The Numbers

Strip away the optics and the geopolitics, and the case against High-NA reduces to a single ratio: transistors printed per dollar of wafer cost. On TSMC’s N3 node, produced with low-NA EUV tools that cost roughly $180 million each, a 300mm wafer carries approximately 815 million transistors per dollar of fab-loaded wafer cost, according to IBS’s 2025 cost model. On the High-NA roadmap, that figure falls to 562 million per dollar — a 31 per cent regression. The reason is brutally simple. A High-NA tool costs about 2.1 times what a low-NA tool costs, but its throughput is roughly 185 wafers per hour versus 220 for the current NXE:3800E, and its single-exposure productivity gains apply to only a fraction of the layers on a finished chip. The capex per wafer-pass roughly doubles; the area printed per pass rises by perhaps 40 per cent. The historical contrast is sharp. In 2004, when ASML introduced its first 193nm immersion tool, the TWINSCAN XT:1400, the machine sold for around $25 million. Adjusted for inflation, that is $40 million in 2026 dollars. The High-NA EUV machine that TSMC just declined costs 9.5 times as much in real terms — and, for the first time, delivers fewer transistors per dollar than its predecessor. From 1971 to roughly 2014, transistor cost fell at a compound annual rate of about 28 per cent. From 2014 to 2022, with the move to FinFET and the difficulty of sub-20nm patterning, the rate slowed to about 8 per cent. Since 2022 it has been roughly flat. High-NA, on the published economics, would push it negative. The implications cascade into the hyperscaler capex stack. A 2026 frontier training run on a 100,000-GPU Nvidia GB300 cluster consumes roughly $14 of silicon depreciation per million training tokens, by Bernstein’s reckoning. If transistor cost is flat rather than falling at the historical 28 per cent rate, the silicon line of the AI training cost curve flattens with it. Bain projected last December that hyperscaler capex would compound at 19 per cent through 2030 on the assumption of continued Moore’s-Law cost-down; without it, either capex must rise faster or unit economics for inference deteriorate. Morgan Stanley’s Joseph Moore put the point bluntly in a May 7 note to clients: “The debate is no longer when AI inference becomes cheap. The debate is whether it ever does.” There is a counter-argument, and it deserves a hearing. ASML and Intel both insist that High-NA’s economics improve sharply at the 1.4nm and 1.0nm nodes, where low-NA’s multi-patterning overhead becomes prohibitive. ASML’s investor deck, dated April 2026, models a cross-over point at the A14 node in late 2028, after which High-NA delivers a 22 per cent cost-per-transistor advantage. Intel’s foundry chief Naga Chandrasekaran has argued publicly that TSMC is making “a strategic mistake of the same kind it made on EUV in 2014,” when TSMC was a late adopter and Samsung briefly led. That history, however, cuts both ways: TSMC was wrong on EUV timing for eighteen months, then right for a decade.

·04Strategy & Transition

The pivot from a six-decade cost-down to a flat or rising cost-per-transistor curve, if it holds, rewires the strategic logic of every company that buys, builds or runs chips. For TSMC, the choice is rational: defend margins on N2 and A16 with the current EUV fleet, push customers toward chiplet architectures and advanced packaging — the company’s CoWoS capacity is sold out through 2027 — and let Intel absorb the cost of being the High-NA pioneer. For ASML, the choice is harder. The company has guided to €44–60 billion in 2030 revenue on an installed base assumption that includes High-NA at TSMC. Without it, the high end of that range is implausible. The European angle compounds the problem. ASML, headquartered in Veldhoven, is the EU’s most strategically important industrial company by a wide margin — roughly 6 per cent of the Dutch stock market and the de facto crown jewel of the EU Chips Act, which committed €43 billion in 2023 to triple European semiconductor capacity by 2030. If High-NA stalls, the Chips Act’s centrepiece tool stalls with it. DAX40 exposure is direct: Infineon and Siemens both consume advanced foundry capacity for automotive and industrial silicon, and both have been planning around the assumption that 2nm-class chips would arrive on the historical cost curve. The German automotive sector’s transition to software-defined vehicles assumes inference silicon that follows Moore’s-Law economics. It may not. The narrative pivot here is uncomfortable. For thirty years the semiconductor industry’s worst case was that Moore’s Law would slow. The actual case, if Azhar and the IBS numbers are right, is that it has reversed at the bleeding edge while continuing modestly at trailing nodes. That is a different world. It rewards advanced packaging over advanced lithography, system architecture over transistor count, and software efficiency over hardware brute force. It also, quietly, vindicates the skeptics.

Three Perspectives What this story means for different readers

01

For CIOs and heads of AI platform, the practical takeaway is to stop modelling inference unit economics on a 25-to-30 per cent annual cost-down. The silicon line of the cost curve is flattening, and the gains that have made GPT-class models progressively cheaper to run since 2023 have come largely from algorithmic efficiency, quantisation and better serving infrastructure — not from cheaper transistors. Procurement should model a flat-silicon scenario through 2028 and price three-year AI contracts accordingly. Workloads that assume falling inference cost — agentic systems running continuously, always-on copilots, real-time video generation at consumer scale — need a second financial model in which transistor cost is flat. Enterprises with strong on-prem inference needs should accelerate Blackwell and MI400 procurement now, before the supply-demand balance tightens further around a slower-growing leading-edge wafer pool.

02

Brussels has a problem it has not yet named. The EU Chips Act was designed around a model in which ASML’s roadmap pulled European semiconductor sovereignty forward automatically; subsidies merely accelerated an inevitable cost-down. If TSMC’s refusal of High-NA marks a genuine break in the cost curve, the Chips Act’s central assumption — that throwing capital at advanced fabs in Dresden and Magdeburg eventually yields competitive cost-per-transistor — needs rewriting. The Dutch government’s export controls on High-NA to China, agreed under US pressure in 2024, also look different in a world where the tool is commercially marginal even for its intended customers. Expect calls in the European Parliament this summer for a Chips Act 2.0 that funds advanced packaging, chiplet ecosystems and silicon photonics rather than betting the bloc’s industrial future on a single piece of lithography equipment whose economics no longer compound.

03

The venture implications cut in two directions. On the bearish side, Cory Doctorow and Ed Zitron’s running argument — that AI economics are propped up by an implicit subsidy from Moore’s Law that is quietly expiring — gains a concrete data point this week. Funds modelling SaaS-style gross margins for AI-native companies on the assumption of falling inference cost should stress-test the flat-silicon case. On the bullish side, the opportunity set in compute efficiency expands sharply: model distillation, sparse architectures, custom inference silicon (Groq, Cerebras, Tenstorrent, Etched), advanced packaging startups, and photonic interconnect plays all become structurally more valuable if the leading-edge transistor cost-down is over. The single best venture bet of the next five years may be the company that makes a 3nm chip behave like a 1.4nm chip through architecture rather than lithography.

Sources 9 references

03 / 05 · Defense

8 min read

a16z’s autonomy manifesto reframes the moral case for armed AI

Andreessen and Horowitz argue American values demand autonomous weapons — just as Helsing, Anduril and the Bundeswehr race to scale them..

·01Primer

On May 11, 2026, Andreessen Horowitz published a manifesto titled “No Man Left Behind: American Technology Ships with Our Values.” The argument flips a familiar ethics frame: autonomous warfare is not a regrettable necessity, the authors write, but the morally preferable option because it embeds American values into the targeting loop. The essay anchors a16z’s American Dynamism portfolio — Anduril, Shield AI, Saronic and Hadrian — as the commercial vehicle for that position. It lands the same week Munich-based Helsing closed a $1.2 billion round at an $18 billion valuation, and as Berlin’s €108 billion 2026 defense package moves into procurement. Enterprise technology buyers, regulators and venture investors are now being asked, in effect, to take a side on whether software-defined lethality is a civic virtue or an unbounded risk.

·02What Happened

In a Manhattan auditorium last week, a recovered F-15E weapons systems officer — call sign DUDE44 Alpha — was introduced to a defense-tech audience as the moral hinge of an argument. His Strike Eagle had been shot down 200 miles inside Iran on April 3, and a 155-aircraft rescue package had pulled him from behind enemy lines in what one Air Force colonel called “the most extraordinary combat rescue in U.S. history.” The story, retold this week on a16z’s own platform, sets the scene for the firm’s new essay: every American serviceman recovered, the implicit argument runs, is a future autonomous system not yet built. The manifesto, published May 11 by Marc Andreessen and Ben Horowitz under the title “No Man Left Behind: American Technology Ships with Our Values,” makes the case directly. “America must now build autonomous systems at both the quality and quantity required to win,” the partners write. “We have the talent advantage. We are losing the production race.” The essay positions a16z’s American Dynamism practice — a $1.2 billion vehicle co-founded by general partner Katherine Boyle — as the commercial answer. The portfolio numbers have moved fast. Anduril, founded by Palmer Luckey, is closing a $4 billion round at a $60 billion valuation, double the $30 billion mark set in mid-2025, and is in talks on a U.S. Army contract worth up to $20 billion. Shield AI’s Hivemind software now flies autonomous F-16 surrogate missions. Saronic, valued at $4 billion after a $600 million Series C, builds unmanned surface vessels with a 1,000-nautical-mile range — enough, the company likes to note, to cross the Taiwan Strait ten times. Hadrian, a robotics-driven precision-parts factory, is the supply-chain pillar. The pitch is that autonomy is not the abandonment of American values but their delivery mechanism: machines that apply rules of engagement consistently, that do not get tired, that do not seek revenge, and — critically — that keep American pilots out of Iranian airspace in the first place. “The choice is not between human warfare and machine warfare,” the manifesto argues. “It is between machines built by free societies and machines built by their adversaries.” That is a deliberate reframing of the standard ethicist objection, which holds that delegating lethal decisions to software erodes accountability regardless of who writes the code. The firm is betting that, in a procurement cycle defined by Ukraine drone economics and Pacific deterrence math, the reframing will hold.

·03Timeline & Context

The manifesto is the latest move in a four-year escalation. American Dynamism launched in January 2022 as a thematic practice; the inaugural $600 million fund became a $1.2 billion vehicle, and by 2025 a16z was steering capital into seven defense-tech companies that had each raised more than $500 million — what one observer dubbed the “Capital Cannons Club.” The Pentagon’s Replicator initiative, announced by then-Deputy Secretary Kathleen Hicks in August 2023 to field “multiple thousands” of attritable autonomous systems, was the policy bookend. Replicator missed its August 2025 fielding target — only hundreds, not thousands, of systems were delivered — and was transitioned in late 2025 from the Defense Innovation Unit to a new Defense Autonomous Warfare Group inside SOCOM. The a16z essay is, in part, a venture-side critique of that shortfall: the production race, the partners argue, is being lost on capacity, not concept. The European parallel sharpens the moment. On May 9, Helsing AG confirmed a $1.2 billion round led by Dragoneer at an $18 billion valuation — the largest funding round in German startup history, surpassing every prior DACH technology benchmark. Helsing’s pitch, like Anduril’s, is software-first lethality: AI for battlefield data fusion, now expanded into uncrewed strike platforms and the Eurofighter sensor stack. Berlin-based Quantum Systems crossed a €3 billion valuation last November. Lisbon’s Tekever raised £400 million at over £1 billion. Airbus Defence’s €50 million DGA framework with France, covered in last week’s briefing, formalizes a European loyal-wingman track aimed at fielding by 2029. The historical comparison venture investors keep reaching for is the early Cold War — RAND, Lockheed Skunk Works, the Whiz Kids — when private engineering culture and federal procurement co-evolved. The more uncomfortable analogy is the interwar period, when European defense industries scaled faster than their political institutions could regulate them. Berlin’s Zeitenwende, declared by Chancellor Olaf Scholz in February 2022, has by 2026 produced a €108 billion defense outlay (including the Sondervermögen), a draft Bundeswehr Planning and Procurement Acceleration Act, and an explicit Bundeswehr commitment to fielding Collaborative Combat Aircraft by 2029. ZITiS, Germany’s cyber agency, and the Bundeswehr’s drone-defense forum have moved autonomy from working-paper to line-item. The narrative pivot is that until this month, the moral debate and the procurement debate ran on separate tracks. The a16z essay collapses them. By naming values as the deliverable — not just capability — Andreessen and Horowitz are inviting regulators, journalists and the International Committee of the Red Cross to argue ethics on a terrain VCs have already mapped. Whether that gambit holds depends on whether DACH parliaments, the EU AI Act’s military carve-outs, and the next U.S. defense authorization treat “American values” as a procurement criterion or a marketing slogan.

Three Perspectives What this story means for different readers

01

For enterprise technology buyers — particularly in DACH industrials, logistics and critical infrastructure — the a16z essay is less a defense story than a supply-chain signal. Anduril’s Arsenal-1 in Ohio, Hadrian’s precision-parts automation and Saronic’s hull production are the same robotics, MES and digital-twin stacks that Siemens, Bosch and ABB sell into civilian factories. A $60 billion Anduril and an $18 billion Helsing now set wage benchmarks, GPU-allocation priorities and security-clearance norms that flow into commercial engineering hiring. CIOs at Tier 1 suppliers should expect dual-use export-control reviews to widen, ITAR-equivalent EU controls on AI model weights to tighten, and customer due-diligence questionnaires to start asking whether their software touches defense workloads. The practical near-term effect: longer procurement cycles for any AI vendor whose model card cannot cleanly separate civilian from military applicability.

02

Regulators are now squeezed between three live processes. The UN Convention on Certain Conventional Weapons continues to debate a legally binding instrument on lethal autonomous weapons, with the ICRC arguing that unpredictable systems and anti-personnel autonomy should be prohibited outright. The EU AI Act exempts military and national-security uses, but member-state implementations — Germany’s in particular — are testing how far that carve-out reaches into dual-use AI infrastructure. And the U.S. Department of Defense Directive 3000.09, governing autonomy in weapon systems, is being read against the Replicator-to-DAWG transition. A manifesto framing autonomy as a values export pressures Berlin and Brussels: if Washington codifies “American values” into munitions, European regulators must decide whether “European values” mean stricter human-in-the-loop requirements, or whether competitive parity with Helsing-class vendors overrides them.

03

For European founders, the message in the a16z essay is unsubtle: the moral high ground and the capital pool are both being claimed by U.S. investors, and the window to build a sovereign alternative is narrow. Helsing’s $18 billion mark, Quantum Systems at €3 billion, Tekever past £1 billion and ARX Robotics scaling ground autonomy show the DACH and broader European ecosystem can mint defense unicorns — but the LP base remains thin, and most growth rounds still draw on U.S. or U.K. crossover capital. The strategic question for early-stage founders is whether to position as a Helsing supplier, an Anduril-compatible component vendor, or a clean-sheet sovereign play. Expect more European LPs — particularly sovereign wealth funds and family offices previously allergic to defense — to revisit mandates in the next two quarters, and expect dual-use seed rounds to price up sharply.

Sources 10 references

04 / 05 · Frontier Labs & Capex

8 min read

Clark’s 60% Bet: The Capex Tell on Recursive Self-Improvement

Anthropic’s co-founder gives recursive self-improvement a 60% probability by 2028 — and the hyperscalers’ 2026 spending suggests they agree..

·01Primer

On May 4, Anthropic co-founder Jack Clark published a long essay in his Import AI newsletter putting a 60% probability on recursive self-improvement — an AI system capable of autonomously training its own successor — arriving before the end of 2028. Six days later, Azeem Azhar built Exponential View #573 around a sharper question: if the labs really believe their own forecast, what should we see in their capex, hiring and procurement behavior? The answer, both writers argue, is already on the tape. Hyperscalers have guided to roughly $700 billion of 2026 capital expenditure. Anthropic is in talks at a $900 billion valuation. For CIOs evaluating multi-year AI contracts, the prediction is less interesting than the revealed preference behind it.

·02What Happened

The essay landed quietly on a Monday morning. Clark, who founded Import AI in 2016 and now runs policy at Anthropic, opened with an admission: he had spent weeks combing public benchmarks, internal lab signals and his own private reading, and had updated. “I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028,” he wrote on X, posting alongside the long-form analysis. The number was deliberate. A year ago Clark sat closer to 25%. The revision was driven by what he framed as four converging curves: SWE-Bench scores, where the leading model has moved from 2% to 93.9% real-world coding success between 2023 and Q1 2026; the METR time-horizon benchmark, where autonomous task length has stretched from thirty seconds in 2022 to roughly twelve hours today; the appearance of usable AI contributions to alignment research and CUDA kernel optimization inside Anthropic itself; and what Clark called a tightening ‘feedback loop’ between agentic coders, automated post-training and multi-agent orchestration. By Wednesday, Axios had run a ‘Behind the Curtain’ column on the prediction; by Friday, Anthropic had quietly added two new pages to its careers site titled ‘Research Engineering Multipliers’ and ‘Automated R&D.’ Azeem Azhar’s Exponential View #573, sent on Sunday May 10, framed the question that mattered more than Clark’s number: are the AI labs building for an intelligence explosion? Azhar’s verdict was unambiguous. The capex pattern, the procurement contracts, and the hires all say yes. He pointed to Anthropic’s deal to consume the entire Colossus 1 site in Memphis — first reported May 8 — adding 300 megawatts and an estimated 220,000 Nvidia GPUs to Anthropic’s footprint, on top of 5 gigawatts already committed via Amazon, 5 gigawatts via the Google-Broadcom partnership, and the more than $100 billion Anthropic has agreed to pay AWS over ten years. “You don’t sign for a hundred billion of compute,” Azhar wrote, “if you think your timeline is 2032.” Behind the prose lay a more uncomfortable observation. The labs are not asking the market to believe RSI is a 60% event. They are asking it to fund the 60% scenario, while pricing in the 100% one. Anthropic’s $30 billion Series G closed in February at a $380 billion post-money valuation. Three months later, TechCrunch and CNBC reported the company was in talks for a $50 billion round at a $900 billion valuation — higher than OpenAI’s. The narrative pivot is not Clark’s essay. It is the gap between the public probability and the private bet.

·03The Numbers

Four hyperscalers told investors in late April that combined 2026 capex would reach roughly $700 billion, nearly double 2025. Microsoft’s cash capex moved from $17 billion to $31 billion quarter-on-quarter; Alphabet’s from $17 billion to $36 billion, with full-year guidance raised to $180–190 billion; Amazon’s from $25 billion to $44 billion, on track for roughly $200 billion in 2026; Meta’s from $13 billion to $20 billion, with a full-year band of $125–145 billion. Meta shares fell about 6% on the announcement. Alphabet rose. The bifurcation matters: it tells you the market still cannot decide whether AI capex is the 1996 internet build-out — capacity that later got absorbed — or the 1999-2000 fiber overhang. The historical parallel cuts both ways. Worldcom and Global Crossing also believed in a thesis that turned out to be correct on a longer horizon. Their investors were not made whole. What is different in 2026 is the second-order signal: the hiring shift. Anthropic has grown from roughly 1,100 employees in mid-2025 to about 4,585 by February 2026, with retention of recent hires at 80% — eight points above DeepMind, thirteen above OpenAI. Fortune reported that engineers leaving OpenAI for Anthropic outnumbered the reverse flow eight to one; the DeepMind ratio was closer to eleven to one. Inside the labs, the named search is no longer for ML researchers as individual contributors. It is for what Google DeepMind’s spring 2026 job listings label ‘research multipliers’ — engineers who can build the harnesses, evaluators and data pipelines that let one researcher run ten experiments in parallel through an agentic loop. This is the revealed-preference signal Azhar and Clark both circle. A lab that thinks RSI is a 2032 problem hires more researchers. A lab that thinks it is a 2028 problem hires multipliers and locks in compute. Anthropic’s $86 billion training-cost commitment through 2029, first reported by the Wall Street Journal, is consistent with the latter. So is its willingness to project $11 billion of losses in both 2026 and 2027 against $18 billion of revenue this year — figures Anthropic shared with investors and Ed Zitron has flagged as evidence of ‘circular psychosis’ in the compute market. Two readings of the same data. The CIO question is which one prices a five-year enterprise license correctly.

Three Perspectives What this story means for different readers

01

For procurement boards, Clark’s 60% number is the wrong unit of analysis. The relevant question is contract duration. If RSI lands inside the term of a three-year enterprise agreement signed in 2026, the per-token price negotiated today is irrelevant — the model under the contract will be a generation behind the one Anthropic uses internally to build its successor. Several Fortune 500 CIOs have already started inserting ‘capability-tier’ clauses that re-baseline pricing when a vendor releases a successor model. Others are doing the opposite: signing longer to lock in compute access. Both bets are coherent. What is incoherent is treating 2026 AI procurement as a normal SaaS cycle. The reasonable enterprise posture: shorter terms, harder portability clauses, and a written assumption about which RSI year the contract is priced against.

02

Brussels has been preparing for this scenario without naming it. The EU AI Act’s general-purpose AI obligations become enforceable on August 2, 2026, with a 48-hour serious-incident reporting window for frontier models and mandatory adversarial testing for systemic-risk systems. The AI Office is staffing its scientific panel; the BSI in Bonn has rolled out QUAIDAL, a quality and security testbed for GPAI. Germany’s KI-Bundesverband has argued publicly that Clark’s timeline, if even directionally right, makes the current code-of-practice voluntary regime inadequate. The political pressure point is straightforward: a model that trains its successor is, definitionally, outside the supervised learning loop the Act was drafted to govern. The next eighteen months will determine whether Europe regulates the training run itself, the procurement contract, or the data center — and the answer will reshape where the next gigawatt of compute lands.

03

Andreessen Horowitz’s 2026 fund earmarks $3.4 billion specifically for AI apps and infrastructure, and its ‘Big Ideas 2026’ memo openly states the firm is pricing for agentic, recursive workloads rather than copilots. Sequoia has been quieter but is reportedly leading a tranche of the Anthropic $900 billion round. The European picture is thinner. Index has stayed late-stage and selective; Atomico’s recent letters lean toward applied AI in regulated verticals rather than frontier compute; HV Capital’s public commentary remains cautious on lab valuations. The structural question for VCs is whether the lab oligopoly tightens or fractures. If Clark is right, the marginal dollar belongs at the lab level — and only the firms already on the cap table benefit. If he is wrong, or off by four years, the application layer wins. Few funds can afford to be wrong on both.

Sources 14 references

05 / 05 · Enterprise & Architecture

8 min read

BCG finds the agent ceiling: four tools per worker breaks the brain

A Harvard Business Review study of 1,488 US workers puts the first hard empirical limit on agent-multiplicity — and it lands at three..

·01Primer

Boston Consulting Group and a team of co-authors published a study in Harvard Business Review describing a new occupational condition they call “AI brain fry” — mental fatigue from supervising AI tools beyond a worker’s cognitive capacity. Surveying 1,488 full-time US employees at large firms, the researchers found a clean inflection point: workers using three or fewer AI tools reported genuine productivity gains, while those using four or more reported brain fog, slower decisions, more errors and sharply higher intent to quit. The finding directly contradicts the marketing arc of Salesforce Agentforce, Microsoft 365 Copilot, ServiceNow and SAP Joule, all of which sell stacks of five to ten task-specific agents per knowledge worker. For CIOs running 2026 agent rollouts, BCG has just supplied the first defensible ceiling.

·02What Happened

On a Thursday in early March 2026, six BCG consultants — Julie Bedard, Matthew Kropp, Megan Hsu, Olivia Karaman, Jason Hawes and Gabriella Kellerman — published a piece in Harvard Business Review with a phrase no enterprise software vendor wanted to read in print: “AI brain fry.” By early May the term had migrated from HBR into Marketplace, CNN Business, Fortune, Axios, Futurism and a sober Algorithmic Bridge guide titled “How to Get More From AI by Using Fewer Tools,” where Alberto Romero argued that the workers extracting real gains use three tools deeply rather than ten tools shallowly. The thesis is simple, and uncomfortable for the agent-platform incumbents: oversight is the cost, not the work. The scene the paper paints is a familiar one to anyone who has watched an analyst at a German insurer or a London bank attempt a Tuesday afternoon. The worker has a chat assistant open on the left monitor, a coding copilot in the IDE, a marketing agent in the CRM, a separate research agent surfacing competitor moves, a fifth agent reconciling expenses in the background, and an inbox triage bot flagging anything that needs a human signature. Each tool produces output. Each output needs review. Each review is a small, costly act of judgement. At four agents, BCG’s data says, that judgement load tips over. “We want fewer errors, we want better decisions, and we want our best people to stay,” Bedard told reporters in the days after publication. “Those are all real costs.” The paper does not tell employers to delete AI; it tells them to stop conflating tool count with productivity. The mechanism the authors isolate is specific and, for once, falsifiable: it is the supervision of AI outputs — not the delegation of work to AI — that exhausts people. Workers who trusted a single agent end-to-end reported gains. Workers who were forced into the role of human orchestrator across many agents reported what one respondent called a “buzzing” they had to walk away from the desk to clear. The historical analogue is the early-2000s ERP wave, when enterprises bought one suite from SAP or Oracle and then bolted on best-of-breed modules until the human cost of integration consumed the productivity case. That cycle ended with consolidation. The agentic wave is now beginning the same arc, two decades faster. Gartner expects 40% of enterprise apps to embed task-specific agents by year-end; Salesforce Agentforce already runs at 18,500 customers; GitHub Copilot sits inside roughly 90% of the Fortune 100. The narrative pivot in BCG’s paper is that the bottleneck has moved. It used to be model quality. Then it was data plumbing. Now it is the carbon-based reviewer at the keyboard.

·03The Numbers

The methodology is unflashy and that is its strength. BCG surveyed 1,488 full-time US workers at large companies across marketing, human resources, operations, software engineering, legal and compliance. Respondents reported the number of AI tools used at work, the depth of oversight required, and a battery of cognitive, affective and behavioural outcomes ranging from headaches and mental fog to decision fatigue, error rates and intent to quit. The study is correlational, not causal, and the authors are explicit about that. But the cliff at four tools is sharp enough that it cannot be waved away as noise. The headline number: 14% of all AI-using workers reported full-blown brain fry — fog, headaches, slower decisions, the urge to physically leave the screen. Within that group the secondary effects are large. Workers who experienced higher levels of AI oversight expended 14% more mental effort, reported 12% greater mental fatigue and 19% greater information overload than workers with lighter oversight loads. Workers reporting brain fry showed 33% more decision fatigue, 11% more minor errors and 39% more major errors. Intent to quit ran at 34% among the brain-fry group versus 25% among the unaffected — a 39% relative increase in attrition risk, which on a 50,000-person workforce is the difference between a manageable churn line item and an HR emergency. The sectoral distribution is the part CIOs should pin to the wall. Brain fry rates were highest in marketing (26%) and human resources (19%), with software engineering and development at 18%. Legal and compliance, where AI adoption is more cautious and oversight is already a professional habit, came in lower. The Pragmatic Engineer’s late-April Pulse on token-spend, which documented engineering teams whose API budgets jumped roughly tenfold in six months and Salesforce’s internal $175-per-month minimum spend quota, supplies the cost-side companion: the same teams that are now token-maxxing are also the teams BCG flags as most brain-fried. The counterpoint research is thinner than vendors would like. MIT Sloan and BCG’s own earlier consultant study found a 38% performance lift from a single well-deployed model on creative tasks; three field experiments with 4,867 software developers found a 26% lift in completed tasks from a single coding assistant. None of those studies tested the multi-agent regime. The a16z and Gartner forecasts that 40% of enterprise apps will embed agents are forecasts of supply, not productivity. PwC’s 2026 AI Performance Study notes that three-quarters of measurable AI gains accrue to just 20% of companies, and those companies are not the ones stacking the most agents — they are the ones building the most focused workflows. For DAX40 buyers the calibration is immediate. Allianz, which has 30,000 internal agents in its GenAI Lab and is preparing AllianzGPT 2.0 for general workforce rollout in 2026, sits squarely in the zone BCG warns about. Anthropic-backed claims-orchestration agents land on top of an existing AllianzGPT stack; before the rollout the average analyst is already at three. ZEW’s digitalisation tracker and Fraunhofer IAIS’s production studies have, for two years, asked whether German firms can convert AI access into productivity at all. The BCG paper suggests a depressing answer: many of them will get the access, deploy four-plus agents per worker, and never see the curve bend.

Three Perspectives What this story means for different readers

01

For CIOs the BCG study is the first usable design constraint in an architecture conversation that has so far been dominated by vendor roadmaps. The implication is that an agent strategy needs a per-worker budget, not a per-process one. Three deep agents — a thinking partner, a task executor, a domain specialist — outperform a sprawling marketplace of ten task bots. That reframes procurement: Agentforce, Copilot, Joule and ServiceNow are not additive purchases, they are competing claims on a single scarce resource called the employee’s attention. Boards at Allianz, BMW and Deutsche Bank should ask their CIOs the new question — not how many agents are live, but how many land on the average desk on the average Tuesday — and treat anything above three as a redesign trigger rather than a milestone.

02

European regulators have so far framed AI-at-work through the EU AI Act and the works-council provisions of the German Betriebsverfassungsgesetz, both of which focus on transparency and consent. The brain-fry finding opens a second front: occupational health. If 14% of AI-using workers report a measurable cognitive harm tied directly to tool count, German Berufsgenossenschaften and the EU’s EU-OSHA agency have a credible mandate to treat agent-multiplicity as an ergonomic hazard, comparable to screen-time and posture rules. Works councils, already empowered under §87 BetrVG to co-determine the introduction of technical monitoring systems, will read this paper and reach for a hard cap. Expect the first DAX40 IT-Betriebsvereinbarung specifying a maximum number of concurrent agents per workstation before year-end.

03

The agent-orchestration thesis that powered 2025’s funding cycle — every employee gets ten specialised agents, each from a different startup, all coordinated by a meta-agent — just acquired an empirical headwind. If three is the ceiling and the bottleneck is human oversight, the winning startup is not the one shipping the eleventh vertical agent but the one collapsing five agents into one trusted surface. That favours full-stack incumbents (Salesforce, Microsoft) and a small set of horizontal challengers building general-purpose work assistants over the long tail of single-task agent startups currently raising Series A rounds on Agentforce-marketplace metrics. Expect a16z and Sequoia to quietly reposition portfolio pitches from “agent for X” to “workflow consolidation,” and expect at least one prominent agent-marketplace down-round in the next two quarters.

Sources 11 references

Tuesday, 12 May 2026

DeepMind’s AI co-mathematician cracks a 60-year-old open problem

The Broken Bargain: TSMC Says No to ASML’s $380M Machine

a16z’s autonomy manifesto reframes the moral case for armed AI

Clark’s 60% Bet: The Capex Tell on Recursive Self-Improvement

BCG finds the agent ceiling: four tools per worker breaks the brain

Investigating the Consequences of Accidentally Grading CoT During RL (OpenAI Alignment Research, May 8, 2026)

The Inference Shift (Stratechery / Ben Thompson, May 11, 2026)