Mira Murati’s Second Act: An AI That Talks Over You
Thinking Machines Lab’s first model bets that conversation, not autonomy, is the next interface frontier..
On 11 May 2026, Thinking Machines Lab, the year-old startup founded by former OpenAI chief technology officer Mira Murati, released a research preview of its first model. The system, TML-Interaction-Small, is built to listen and speak at the same time. Today’s voice assistants are turn-based: you talk, they wait, they reply, you wait. Thinking Machines argues that humans do not converse this way, and that wrapping a chat model in a microphone and a clock is the wrong way to fix it. Their answer is a single neural network that treats audio, video and text as parallel streams, slicing time into 200-millisecond chunks. The model responds in 0.40 seconds on average, against 1.18 seconds for OpenAI’s GPT-Realtime-2.0. It will reach selected partners in coming months, with a wider release later in 2026.
The clip lasts under a minute. On a video posted to the Thinking Machines Lab site on a Monday evening in San Francisco, a researcher begins describing a chemistry diagram on screen. Halfway through the sentence, the model says “mhm.” A beat later, before the human finishes the question, it begins answering, then pauses politely when the researcher cuts back in. It is an unremarkable exchange between two people. Coming from software, it is the entire point. With that demo, Mira Murati, founder and chief executive of Thinking Machines Lab, made her first public product reveal since leaving OpenAI in September 2024. In a blog post accompanying the preview, the company described its goal as moving beyond the “turn-based” pattern that has defined every chatbot from the original ChatGPT onward. “People have learned to phrase their questions like emails,” the post argues, because today’s assistants cannot tolerate interruption, backchannelling or silence. Thinking Machines wants to retire that habit. The model in question, TML-Interaction-Small, is a 276-billion-parameter mixture-of-experts network with 12 billion active parameters. It is paired with a second, slower “Background Model” that handles reasoning, web search and tool calls behind the scenes while the conversational front-end keeps the line warm. On FD-bench v1.5, the company’s own interaction-quality benchmark, the system scored 77.8, against 54.3 for Google’s Gemini-3.1-flash-live and 46.8 for OpenAI’s GPT-Realtime-2.0. End-to-end latency clocked in at 0.40 seconds, roughly the cadence of natural human turn-taking; Google’s system was measured at 0.57 seconds, OpenAI’s at 1.18. The comparison most often reached for is not another AI release at all, but the introduction of duplex telephony itself: for nearly a century, “half-duplex” walkie-talkie etiquette — say “over” and wait — was how voice travelled over wires, until full-duplex circuits let both parties speak at once and conversation began to feel like conversation. Murati is making the same argument about machines. The catch: this is a research preview, not a product. There is no public API, no consumer app, no enterprise SLA, and the benchmarks are Thinking Machines’ own. Connie Loizos at TechCrunch put it bluntly: the numbers are impressive and the underlying idea is interesting, “whether the real-world experience lives up to the technical claims is something we won’t know until people can actually use it.” Murati, for her part, is not selling polish. She is selling a thesis: that the right unit of AI capability is not the prompt and not the agent, but the live exchange — and that almost everyone else, including her former employer, has been optimising the wrong variable.
Under the hood, TML-Interaction-Small breaks from the standard generative recipe in three places. First, it abandons the alternating input/output token sequence that defines every transformer-based chatbot. Instead, the model runs on what Thinking Machines calls multi-stream micro-turns: every 200 milliseconds, the network ingests whatever audio, video and text have arrived across separate input streams, and emits whatever is appropriate on its output streams — which may be silence, a single phoneme, a backchannel “uh-huh,” a verbal interjection, or the continuation of a longer response begun several micro-turns earlier. There is no end-of-turn signal, because there are no turns. The model is always listening and, when it should be, always speaking. Second, it drops the heavy external encoders that today’s speech-and-vision systems use to translate raw audio and video into model-readable embeddings. The blog post calls the alternative “encoder-free early fusion”: raw signals are passed through a lightweight embedding layer directly into the transformer, where reasoning happens on the unified stream. This is what buys the latency. A conventional realtime stack — voice activity detection, automatic speech recognition, large language model, text-to-speech, plus interruption logic — accumulates dozens to hundreds of milliseconds at each hop. Collapse it into one network and the budget shrinks to the network’s own forward pass. Third, the company splits cognition across two models running in parallel. The 276B/12B-active Interaction Model handles presence, dialogue management and immediate follow-ups. A separate Background Model — left undescribed in size — handles long-horizon reasoning, retrieval and tool use, returning results into the live stream when they are ready. It is a deliberate inversion of the agentic-loop architecture pursued by Anthropic, OpenAI and Google, in which a single large model plans, acts and reports back over minutes or hours. Thinking Machines is betting that for a wide class of work — calls, meetings, supervision, tutoring, triage — the binding constraint is not horizon length but conversational bandwidth. Not by accident, the design forces hard tradeoffs. Mixture-of-experts gating at 12B active parameters keeps the per-token compute cheap enough to clear the 200ms budget on commodity inference hardware, but it caps the depth of reasoning the front-end model can do alone. Long sessions, the post concedes, will need more work on context management. There is no published video benchmark and no third-party replication of the latency numbers. Scaling the architecture to a larger pretrained base, Thinking Machines says, remains a 2026 project. The bet, in other words, is that solving turn-taking is a real research problem worth a 276B-parameter answer — and that the rest of the industry, having spent the past two years racing toward autonomous agents that can be left alone for an hour, has been looking through the wrong end of the telescope.
What makes the launch interesting is less the model than the positioning. Thinking Machines Lab was founded in February 2025 by Murati along with a clutch of former OpenAI colleagues — among them John Schulman, Barret Zoph and Luke Metz. In July 2025, the company closed a $2 billion seed round led by Andreessen Horowitz, with Nvidia, AMD, Cisco, Accel, ServiceNow and Jane Street on the cap table, valuing the lab at $12 billion before any product existed. That is the largest seed round in venture history by an order of magnitude, and it has hung over the company for ten months as an unanswered question: what, exactly, is the thesis? The answer, on the evidence of this week’s release, is a deliberate second-mover bet. Murati left OpenAI months after the launch of GPT-4o’s voice mode, the demo that more than any other re-set expectations for natural conversation with AI. The Interaction Model release is, in effect, an argument that GPT-4o pointed at the right destination and then took the wrong road — bolting a realtime harness onto a turn-based model rather than rebuilding the model around interaction from the start. Sam Altman’s OpenAI, meanwhile, has spent 2025 and early 2026 funnelling resources into long-running agentic systems and into the joint enterprise venture with Anthropic announced earlier in May. Murati is staking the $12B on the wager that whichever lab owns the live interface — the layer that mediates every contact-centre call, every voice agent, every embedded copilot — owns the substrate beneath the agents. Enterprises do not have to choose, but capital will. The strategic read for European buyers is narrower still: a US-based lab founded by an OpenAI alumna, financed by US infrastructure capital, signalling its first product into voice — the most regulated AI surface in the EU. The question Brussels will ask in the next twelve months is not whether the technology works, but where the data sits, who keeps the transcripts, and which competent authority signs off when a model is always listening.
For the buyers of voice AI — contact centres, field-service operators, healthcare triage, in-car assistants — latency is not a benchmark, it is a churn metric. Industry studies have long shown that response delays above roughly 600 milliseconds cause callers to talk over the agent, ask whether anyone is there, or hang up. A model that lands at 0.40 seconds with native interruption handling collapses the gap between an AI voicebot and a competent junior agent. The catch for CIOs is that TML-Interaction-Small is not yet a production system: no SLA, no regional deployment story, no integration with the dominant CCaaS platforms, and benchmark numbers that have not been independently reproduced. Procurement teams should treat the announcement as a signal that the voice-agent latency floor is about to drop, not as an immediate vendor decision.
A model that listens continuously, processes video in parallel, and reacts inside 400 milliseconds runs straight into the EU AI Act’s rules on real-time biometric processing and emotion recognition, both of which carry tighter obligations as the Act’s general-purpose model and high-risk provisions phase in through 2026. Always-on audio-video capture also engages GDPR purpose-limitation and consent requirements that turn-based assistants largely sidestep. In the United States, the FCC’s 2024 ruling that AI-generated voices are covered by the Telephone Consumer Protection Act puts outbound voice-agent use cases on notice. Thinking Machines has said little publicly about safety tooling for the model; partners taking the research preview into regulated workflows in Europe will need their own answers on consent capture, transcript handling and biometric-template avoidance before the wider release later this year.
The seed-round arithmetic is the story underneath the story. A $12 billion valuation on a $2 billion cheque — Andreessen Horowitz leading, Nvidia and AMD strategic — sets the floor for what a credible OpenAI-alumnus second-mover can command. The Interaction Models launch validates the thesis enough to justify a priced Series A in 2026 at, plausibly, two to three times the seed mark. The pressure flows downstream. Pure-play voice-AI startups — Hume, Sesame, Cartesia, ElevenLabs’ Conversational stack, the half-dozen call-centre wrappers — now face a frontier-funded competitor with a fundamentally different architecture and the implicit Nvidia-AMD supply line. Expect consolidation among the wrappers and a flight to differentiation (vertical data, compliance, on-prem) among the rest. The voice stack is about to get a lot more crowded at the top and a lot thinner in the middle.
Sources 5 references
- [1]Interaction Models: A Scalable Approach to Human-AI Collaboration — Thinking Machines Lab
- [2]Thinking Machines wants to build an AI that actually listens while it talks — TechCrunch
- [3]Thinking Machines drops a new, highly responsive model designed for humanlike interactions in real time — SiliconANGLE
- [4]Mira Murati’s Thinking Machines Lab Unveils Full-Duplex AI That Responds in 0.4 Seconds — The AI Insider
- [5]Mira Murati’s Thinking Machines Lab is worth $12B in seed round — TechCrunch