AI experts sharing free tutorials to accelerate your business.
← Back to News
Breaking

Thinking Machines Unveils Interaction Models for Real-Time AI

Krasa AI

2026-05-16

7 minute read

Thinking Machines Unveils Interaction Models for Real-Time AI

Mira Murati's Thinking Machines Lab has finally shown what it's been building. On May 12, the company unveiled what it calls "interaction models" — a new class of AI architecture designed to listen, watch, and speak in continuous 200-millisecond micro-turns, replacing the request-response loop that has defined chatbots since ChatGPT. The first model, TML-Interaction-Small, is a 276-billion-parameter mixture-of-experts system with 12 billion active parameters per token, available now to a small group of research partners.

The bet is straightforward: every voice and video AI shipped so far — including OpenAI's Realtime API and Google's Gemini Live — is a patchwork of speech-to-text, language models, and text-to-speech glued together with voice-activity detection. Thinking Machines argues this stitched architecture is the reason today's voice AI still feels stilted. Their solution is to start over with a model that's natively multimodal from the first training token.

What's Actually New

The architecture splits AI into two parts that run side by side. An "interaction model" stays live with the user, handling 200ms chunks of audio and video and producing 200ms chunks of speech in return. A "background model" handles slower reasoning and tool use asynchronously, sharing the full conversation context the whole time. The interaction model can interrupt itself, change tone mid-sentence, or pause when you start talking — without waiting for a turn boundary that traditional voice systems rely on.

That's the technical break. Existing voice AI uses voice-activity detection (VAD — a small classifier that decides when you've finished speaking) to mark turn boundaries. The model waits, transcribes, generates a response, then speaks. Each step adds latency, and the whole pipeline assumes humans take turns. Real conversation doesn't work that way. We overlap, interrupt, finish each other's sentences, and react to what we're seeing in real time.

Thinking Machines built a model that does the same. There's no VAD, no turn boundary, no stitched pipeline. The model produces and consumes streams continuously, and it can call tools mid-response without breaking the conversation.

The Numbers

On FD-bench v1.5, the benchmark Thinking Machines published alongside the release, TML-Interaction-Small scored 77.8 on overall interaction quality, compared to 54.3 for Gemini 3.1 Flash Live and 46.8 for GPT-Realtime-2.0. End-to-end response latency was 0.40 seconds — the wall-clock time from when you stop speaking to when the model starts replying. That's compared to 1.18 seconds for GPT-Realtime-2.0 and 0.57 seconds for Gemini 3.1 Flash Live.

The 276B/12B MoE architecture is also notable. By keeping active parameters small, Thinking Machines can run the model fast enough for 200ms micro-turns on reasonable hardware while keeping the total parameter count high enough to be competitive on quality. The same trick is what made DeepSeek V3 and Mistral's earlier MoE models efficient — Thinking Machines is applying it to the realtime constraint.

Why This Matters Now

Voice and video are the next frontier for AI products, and every major lab has shipped something in the category. OpenAI's Realtime API powers ChatGPT's Advanced Voice Mode. Google's Gemini Live runs on Pixel phones and inside the Gemini app. Anthropic has been quieter on voice but is expected to ship something by year-end. The category has been growing fast, but the products still feel like prototypes — laggy turns, interruption handling that breaks easily, and the unmistakable sense you're talking to something rather than with it.

Thinking Machines is the first lab to publicly argue that the entire architectural approach is wrong. If interaction models prove out, every voice AI system in production today is built on an outdated foundation. That's a big claim, and the FD-bench numbers are the first piece of evidence supporting it.

For Thinking Machines specifically, this is also the company's first real product disclosure. Murati raised $2 billion at a $12 billion valuation in 2025, and the round was widely characterized as a bet on her team rather than any specific technical roadmap. Interaction models are now the roadmap. The company isn't going after the same coding-agent and enterprise-search markets where Anthropic and OpenAI are competing. It's going after the experience of talking to a computer.

Who This Affects

In the short term: every developer building voice AI products. The OpenAI Realtime API, Gemini Live, and various open-source pipelines (LiveKit, Pipecat, OpenAI Whisper + TTS) all use the stitched architecture Thinking Machines is arguing against. If interaction models become available through a similar API, developers will face a real choice about which stack to build on.

In the longer term: anyone whose product roadmap includes natural voice or video AI. Customer support, language tutoring, telehealth, gaming NPCs, and consumer assistants all hit the wall of today's voice AI quickly. A model that can hold an actual conversation — interrupting, reacting, watching the user — unlocks product categories that are technically possible today but commercially unviable because the experience is just bad enough.

What Industry Insiders Are Saying

Early reaction has been mixed but interested. Several researchers on X have pointed out that 200ms micro-turns are a significant engineering challenge — each turn requires a small prefill and decode on the GPU, with strict latency budgets. The fact that Thinking Machines pulled it off at 276B parameters suggests serious infrastructure work, not just a model trick.

Critics have pushed back on the benchmark choice. FD-bench v1.5 is new and published by Thinking Machines itself, so the comparison numbers should be read with that in mind. Until other labs run their own evaluations, the 77.8 vs. 54.3 gap is best understood as Thinking Machines' own measurement of their own model — directionally useful, not yet independently verified.

The architectural argument has landed harder than the benchmarks. Several voice AI startup founders posted versions of "this is obviously the right direction" within hours of the announcement, even as they cautioned that getting from a research preview to a production API at competitive cost is the real test.

What's Next

The model is available now to a limited research-partner group. Thinking Machines hasn't disclosed a broader API release date but said a wider rollout is planned for later in 2026. The company's announcement notes that the architecture is designed to scale, suggesting TML-Interaction-Medium and TML-Interaction-Large are in development.

For developers, the immediate watch is whether Thinking Machines opens applications for research access or skips straight to a paid API. For competitors, the question is how quickly OpenAI and Google can ship native multimodal models of their own. Both labs have hinted at moving beyond stitched pipelines, but neither has shipped a production system that does it.

Google I/O is May 19-20, and any Google response will likely show up then. OpenAI's release cadence is less predictable, but a native realtime model has been rumored for months.

The Bottom Line

Mira Murati's first big public reveal at Thinking Machines is a credible technical bet against the way every existing voice AI is built. The benchmarks need independent confirmation, the API isn't public yet, and the product implications will take months to play out. But if 200ms micro-turn interaction models work as advertised, the next generation of AI products won't sound like today's chatbots — they'll sound like conversations. That's a meaningful shift, and Thinking Machines just took the lead in defining what it looks like.

#ai#thinking-machines#multimodal#voice-ai

Related Articles