Thinking Machines Lab ships its first model and argues interactivity is what OpenAI gets wrong about voice
Thinking Machines Lab ships its first model and argues interactivity is what OpenAI gets wrong about voice
Maximilian Schreiner
View the LinkedIn Profile of Maximilian Schreiner
May 12, 2026
Thinking Machines Lab
Key Points
Thinking Machines Lab, founded by ex-OpenAI CTO Mira Murati, has released its first AI model that processes audio, video, and text in 200-millisecond chunks, replacing rigid turn-taking with fluid, real-time conversation.
The model outperforms OpenAI's GPT-Realtime-2 and Google's Gemini Live on interaction quality and latency benchmarks, pairing a fast interaction model with a background reasoning model.
Despite the technical promise, the startup still faces pressure, as several key employees have recently left the company.
Thinking Machines Lab has released a research preview of its first AI model, designed to break voice AI out of the traditional question-and-answer pattern. The model processes audio, video, and text in parallel 200-millisecond chunks, and the startup claims it beats OpenAI's GPT-Realtime-2 and Google's Gemini Live on interaction quality.
Thinking Machines Lab has published a research preview of what it calls
Interaction Models
, AI models that handle interaction natively rather than through external scaffolding. The core idea is that interactivity should scale alongside intelligence, not get treated as an afterthought.
Current voice AI systems still feel robotic
Today's real-time systems like GPT-Realtime or Gemini Live continuously take in audio, but the actual language model never sees it directly. According to Thinking Machines, a "harness" of separate components sits in front of the model, including things like a voice activity detector that decides when a speaker's turn is over. Only then does the finished utterance get handed to the model, which generates a complete response. While it's talking, its perception freezes, receiving no new information until it finishes or gets interrupted.
Ad
These components are far less intelligent than the model itself. That means behaviors that define real conversation simply don't work, according to Thinking Machines: proactively jumping in ("interrupt me if I say something wrong"), reacting to visual cues ("tell me when I've written a bug"), or speaking simultaneously, which would be useful for something like live translation. Citing Sutton's "Bitter Lesson," the lab argues that these hand-crafted systems will eventually be outpaced by the advance of general capabilities.
Ad
DEC_D_Incontent-1
Thinking Machines' Interaction Models replace the harness with a model that processes the audio and video stream directly rather than receiving pre-segmented utterances. The approach resembles full-duplex models like
Moshi
or
Nemotron VoiceChat
, which work in a similarly interleaved fashion but are smaller-scale models focused on latency rather than intelligence benchmarks.
A 200-millisecond clock replaces artificial turn boundaries
The real break from existing architectures is what the team calls time-aligned micro-turns. The model continuously processes 200 milliseconds of input and generates 200 milliseconds of output, with both token streams running in an interleaved fashion. Input and output no longer happen sequentially. Instead, they share the same clock cycle.
Ad
This eliminates artificial turn boundaries, letting the model decide on its own whether to stay silent, interject, or speak alongside the user. Audio and images aren't preprocessed through large, standalone encoders but are fed directly into the transformer with minimal preprocessing. That saves latency, though it could also limit the model's ability to pick up fine visual details like text.
The real-time model has another challenge, though. If you need to respond every 200 milliseconds, you can't simultaneously spend minutes reasoning or searching the web. Thinking Machines solves this by pairing the interaction model with a second, asynchronous background model that handles longer tasks like reasoning, tool use, and research.
Ad
DEC_D_Incontent-2
Both models share the same conversation context. The interaction model delegates tasks while keeping the conversation going, then weaves results from the background model into the conversation as they arrive, at a moment appropriate to what the user is currently doing rather than as an abrupt context switch. The goal is to combine the response speed of a fast model with the depth of a reasoning model.
Ad
Benchmarks suggest the approach works
The model is called TML-Interaction-Small, a 276-billion-parameter mixture-of-experts model with 12 billion active parameters. On FD-bench v1.5, which measures interaction quality across scenarios like user interruptions, backchanneling, and background speech, it significantly outperforms both OpenAI's GPT-Realtime-2 and Google's Gemini-3.1-flash-live. Response latency comes in at 0.40 seconds, compared to 1.18 seconds for GPT-Realtime-2 (minimum) and 0.57 seconds for Gemini.
On Audio MultiChallenge, which tracks intelligence and instruction following, the model scores 43.4 percent, above the fast variants of its competitors but below
GPT-Realtime-2
in "xhigh" thinking mode, which hits 48.5 percent. On the lab's own benchmarks for time awareness (TimeSpeak, CueSpeak) and visual proactivity (RepCount-A, ProactiveVideoQA, Charades), Thinking Machines reports that no existing model can meaningfully perform any of these tasks. Tested competitors either stay silent or give incorrect answers.
A $2 billion startup with something to prove
Thinking Machines Lab was founded in February 2025 by Mira Murati and other former OpenAI researchers. In July 2025, the company closed a $2 billion seed round at a $12 billion valuation, all without a product. A follow-on round reportedly in the works at around $50 billion didn't come together by the end of 2025, and several key employees have
since left the company
. The Interaction Model is the first in-house AI model backing Murati's claim that she can build a real competitor alongside OpenAI, Anthropic, and Google Deepmind.
Before this, the company had released
Tinker
, a tool designed to let developers efficiently fine-tune open models using LoRAs without having to deal with distributed training.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
Source:
ThinkingMachines
← Zurück zu den Nachrichten