Editors Pick Agentic AI AI Agents Artificial Intelligence AI Infrastructure Technology AI Shorts Applications Language Model Machine Learning New Releases Physical AI Software Engineering Staff Tech News Most AI systems today work in turns. You type or speak, the model waits, processes your input, and then responds. That’s the entire interaction loop. Thinking Machines Lab, an AI research lab, is arguing that this model of interaction is a fundamental bottleneck. Thinking Machines Lab team introduced a research preview of a new class of system they call interaction models to address it. The main idea for their research is interactivity should be native to the model itself, not bolted on as an afterthought. What’s Wrong with Turn-Based AI If you’ve built anything with a language model or voice API, you’ve worked around the limitations of turn-based interaction. The model has no awareness of what’s happening while you’re still typing or speaking. It can’t see you pause mid-sentence, notice your camera feed, or react to something visual in real time. While the model is generating, it’s equally blind — perception freezes until it finishes or gets interrupted. This creates a narrow channel for human-AI collaboration that limits how much of a person’s knowledge, intent, and judgment can reach the model, and how much of the model’s work can be understood. To work around this, most real-time AI systems use a harness — a collection of separate components stitched together to simulate responsiveness. A common example is voice-activity detection (VAD) , which predicts when a user has finished speaking so a turn-based model knows when to start generating. This harness is made out of components that are meaningfully less intelligent than the model itself, and it precludes capabilities like proactive visual reactions, speaking while listening, or responding to cues that are never explicitly stated aloud. Thinking Machines Lab’s argument is a version of the ‘bitter lesson’ in machine learning: hand-crafted systems will eventually be outpaced by scaling general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator. https://thinkingmachines.ai/blog/interaction-models/ The Architecture: Multi-Stream, Micro-Turn Design The system has two components working in parallel: an interaction model that maintains constant real-time exchange with the user, and a background model that handles deeper reasoning tasks asynchronously. The interaction model is always on — continuously taking in audio, video, and text and producing responses in real time. When a task requires sustained reasoning (tool use, web search, longer-horizon planning), it delegates to the background model by sending a rich context package containing the full conversation — not a standalone query. Results stream back as the background model produces them, and the interaction model interleaves those updates into the conversation at a moment appropriate to what the user is currently doing, rather than as an abrupt context switch. Both models share their context throughout. Think of it like one person who keeps you engaged in conversation while a colleague in the background looks something up and passes notes forward in real time. The key architectural decision enabling this is time-aligned micro-turns . The interaction model continuously interleaves the processing of 200ms worth of input with the generation of 200ms worth of output. Rather than consuming a complete user turn and generating a complete response, both input and output are treated as streams processed in 200ms chunks. This is what allows the model to speak while listening, react to visual cues without being prompted verbally, handle true simultaneous speech, and make tool calls and browse the web while the conversation is still in progress — weaving results back in as they arrive. Encoder-free early fusion is the specific design choice that makes multimodal processing work at this cadence. Rather than routing audio and video through large, separate pretrained encoders (like a Whisper-style ASR model or a standalone TTS decoder), the architecture uses minimal pre-processing. Audio signals are ingested as dMel and transformed via a lightweight embedding layer. Video frames are split into 40×40 patches encoded by an hMLP . Audio output uses a flow head for decoding. All components are co-trained from scratch together with the transformer — there is no separately pretrained encoder or decoder at any stage. On the inference side, the 200ms chunk design creates engineering challenges. Existing LLM inference libraries aren’t optimized for frequent small prefills — they carry significant per-turn overhead. Thinking Machines implemented streaming sessions , where the client sends each 200ms chunk as a separate request while the inference server appends chunks into a persistent sequence in GPU memory, avoiding repeated memory reallocations and metadata computations. They’ve upstreamed a version of this to SGLang, the open-source inference framework. Additionally, they use a gather+gemv strategy for MoE kernels instead of standard grouped gemm, following prior work from PyTorch and Cursor, to optimize for the latency-sensitive shapes required by bidirectional serving. https://thinkingmachines.ai/blog/interaction-models/ Benchmarks: Where It Stands The model, named TML-Interaction-Small , is a 276B parameter Mixture-of-Experts (MoE) with 12B active parameters . The benchmark table distinguishes between Instant models (no extended reasoning) and Thinking models (with reasoning). TML-Interaction-Small is an Instant model. Among all Instant models in the comparison, it achieves the highest score on Audio MultiChallenge APR at 43.4% — above GPT-realtime-2.0 (minimal) at 37.6%, GPT-realtime-1.5 at 34.7%, and Gemini-3.1-flash-live-preview (minimal) at 26.8%. The Thinking models, GPT-realtime-2.0 (xhigh) at 48.5% and Gemini-3.1-flash-live (high) at 36.1%, use extended reasoning to achieve their scores. On FD-bench v1.5 , which measures interaction quality across user interruption, backchanneling, talking-to-others, and background speech scenarios, TML-Interaction-Small scores 77.8 average quality — compared to 54.3 for Gemini-3.1-flash-live (minimal), 48.3 for GPT-realtime-1.5, and 47.8 for GPT-realtime-2.0 (xhigh). On FD-bench v1 turn-taking latency, the model responds in 0.40 seconds — compared to 0.57s for Gemini, 0.59s for GPT-realtime-1.5, and 1.18s for GPT-realtime-2.0 (minimal). On FD-bench v3 , which evaluates response quality and tool use (audio + tools combined), TML-Interaction-Small (with background agent enabled) scores 82.8% Response Quality / 68.0% Pass@1 — the highest in the comparison table. https://thinkingmachines.ai/blog/interaction-models/ Thinking Machines research team also introduced new internal benchmarks targeting capabilities that no existing model handles: TimeSpeak — Tests whether the model initiates speech at user-specified times with correct content. TML: 64.7 macro-accuracy vs. 4.3 for GPT-realtime-2.0 (minimal). CueSpeak — Tests whether the model responds to verbal cues at the correct moment. TML: 81.7 vs. 2.9. RepCount-A (adapted from an existing repetition-counting dataset) — Tests visual counting of repeated physical actions in a streaming setting. TML: 35.4 off-by-one accuracy vs. 1.3. ProactiveVideoQA (adapted benchmark) — Tests whether the model answers a question at the exact moment the answer becomes visually available in a streamed video. TML: 33.5 PAUC@ω=0.5 vs. 25.0 (the no-response baseline). Charades (adapted for temporal action localization) — The model is asked to say “start” and “stop” as an action begins and ends in a streamed video. TML: 32.4 mIoU vs. 0 for GPT-realtime-2.0 (minimal) — a clean zero. So far, no existing model can meaningfully perform any of these tasks. Marktechpost’s Visual Explainer Interaction Models — Getting Started Guide 01 / 07 01 — Overview What Are Interaction Models? Research Preview — May 2026 Thinking Machines Lab introduced interaction models — a new class of AI system where real-time interactivity is native to the model itself, not bolted on through external scaffolding. Unlike standard LLM APIs that work in a request—response loop, interaction models continuously perceive and respond across audio, video, and text at the same time — the way a live human conversation works. Standard LLM APIs Turn-based. Model waits for your full input, then generates a full response. Perception freezes during generation. Interaction Models Continuous. The model perceives and responds in parallel in 200ms chunks — across audio, video, and text simultaneously. 02 — Architecture How the Two-Model System Works The system is built around two components that run in parallel and share the same context at all times. Interaction Model Always live. Receives audio, video, and text in continuous 200ms chunks. Handles conversation flow, interruptions, backchanneling, and immediate responses in real time. Background Model Runs asynchronously. Handles deep reasoning, tool calls, web search, and longer-horizon work. Receives the full conversation — not just a standalone query — and streams results back as they arrive. The interaction model stays present during background tasks — taking new input, answering follow-ups, and weaving results into the conversation at the right moment, not as an abrupt context switch. 03 — Capabilities What You Can Actually Do Because interactivity is native to the model, these are built-in behaviors — not harness features: Simultaneous speech — Speak and listen at the same time (e.g. live translation from Spanish to English as you talk) Verbal interjections — Model jumps in mid-sentence based on context, not just when you stop talking Visual proactivity — Model reacts to what it sees on camera without you saying anything (e.g. counti