Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time

The tension in the AI debate has always been a binary choice: respond quickly or respond intelligently. Real-time speech-to-speech (S2S) models – the kind that power intuitive voice assistants – start talking almost instantly, but their responses are often shallow. Cascaded systems that deliver speech through a large language model (LLM) are very informative, but the pipeline delay is long enough to make the conversation sound stiff and robotic. Researchers at Sakana AI, An AI lab in Tokyo presents KAME (Knowledge-Access Model Extension), a hybrid structure that maintains the near-zero feedback delay of a linear S2S system while injecting the rich knowledge of the backend LLM in real time.

The Problem: Two Paradigms, Two Trade-offs

To understand why KAME is important, it helps to understand two prominent bridge designs.

A direct S2S model like Moshi (developed by KyutAI) is a monolithic transformer that takes sound tokens and produces sound tokens in a continuous loop. Because it doesn’t need to synchronize with external systems, its response latency is very low – for most queries, the model starts talking before the user finishes their query. But because acoustic signals are more information-dense than text, the model must use significant capacity to match paralinguistic features such as tone, emotion, and rhythm. That leaves little room for true knowledge and critical thinking.

A cascaded system, in contrast, runs the user’s speech through an Automatic Speech Recognition (ASR) model, feeds the resulting text into a powerful LLM, and then converts the LLM’s response back into speech through a Text-to-Speech (TTS) engine. The level of information is very good – you can connect to any border LLM – but the system has to wait for the user to finish talking before the ASR and LLM processing starts. The result is an average delay of 2.1 seconds, which is long enough to significantly disrupt the flow of natural conversation.

KAME’s Architecture: Talking While Thinking

KAME works as a tandem system with two asynchronous components working in parallel.

I front S2S module it is based on the Moshi architecture and processes audio in real time in a cycle of different audio tokens (approximately every 80 milliseconds). It begins to produce a spoken response immediately. Inside, Moshi’s original three-stream design — input audio, internal monologue (text), and output audio — is extended to KAME with a fourth stream: the oracle stream. This is the main point of creativity.

I back-end LLM module it contains a speech-to-text (STT) component paired with a full-scale LLM. As the user speaks, the STT component continues to create a partial transcript and periodically sends it to the back-end LLM. For each partial text it receives, LLM generates an answer to the candidate text – called an oracle – and broadcasts it back to the end. Because the user’s speech is still input, these sentences begin as educated guesses and become progressively more accurate as the transcription becomes more complete.

The S2S converter at the end then sets its output of the continuous expression to both its internal context and these incoming oracle tokens. When a new, better oracle comes along, the model can correct course – effectively updating its answer mid-sentence, just as a human would. Because both modules work equally and independently, the initial response delay is always close to zero.

Training on Simulated Oracles

One challenge is that there is no naturally occurring dataset that contains oracle signals. The Sakana AI research team talks about this in a way called Oracle Augmentation Done. Using a ‘simulator’ LLM and a standard conversational dataset (user speech + ground-truth feedback), the research team generates voice sequences that mimic what real-time LLM would produce at all different levels of text completeness. They defined six levels of inference (0–5), ranging from completely unguided guess at inference level 0 to low sound truth response at inference level 5. KAME training data was constructed from 56,582 artificial conversations taken from MMLU-Pro, GSM8K, and HSSBench, converted to augment oracled with TTS and enhanced for audio respectively.

Results: Near-Cascaded Quality, Near-Zero Latency

Tests on the speech-combined subset of the MT-Bench multi-turn Q&A benchmark — especially the thinking, STEM, and humanities domains (Coding, Extraction, Math, Roleplay, and Writing are excluded as inappropriate for speech interaction) — show dramatic improvements. Moshi alone does 2.05 on average. KAME with gpt-4.1 as background score 6.43, and KAME with claude-opus-4-1 as background score 6.23 – both at the same waiting time as Moshi. The best cascade system, Unsilence (also supported by gpt-4.1), scores 7.70, but with an average latency of 2.1 seconds compared to near zero in KAME.

To isolate the ability to go back to the results of time, the research team also tested the responses of the back-end LLM script from the last oracle injection in each KAME session directly – bypassing the problem of early generation completely. That score averaged 7.79 (thinking 6.48, STEM 8.34, humanities 8.56), compared to Susa’s 7.70. This ensures that the KAME gap in the reduced programs is not the last in the background LLM knowledge, but the first effect to speak before the full question of users is heard.

Importantly, KAME is perfect back-end agnostic. The front-end was trained using gpt-4.1-nano as the main back-end, but swapping with claude-opus-4-1 or gemini-2.5-flash during reference does not require retraining. In Sakana AI’s tests, claude-opus-4-1 tends to outperform gpt-4.1 in reasoning tasks, while gpt-4.1 scored higher in human questions – suggesting that doctors can move questions to the most appropriate LLM without affecting the final model.

Key Takeaways

KAME combines the speed-vs-knowledge trade-off in conversational AI by using a speech-to-speech model in the middle and a back-end LLM in parallel – the S2S model responds quickly while the LLM continuously injects continuously refined ‘oracle’ signals in real time, shifting the paradigm from ‘think, then speak’ to ‘speak while you think.’
The performance benefits are substantial without the cost of delay — KAME increases the MT-Bench score from 2.05 (Moshi base) to 6.43, approaching the Cascaded Unmute system at 7.70, while maintaining an average response delay of 2.1 seconds compared to Unmute.
The architecture is completely back-end agnostic — the front-end was trained using gpt-4.1-nano but supports plug-and-play swapping of any frontier LLM (gpt-4.1, claude-opus-4-1, gemini-2.5-flash) at runtime without retraining, allowing task-specific LLM selection based on domain capabilities.

Check it out Model weights, Paper, Inference code again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

The Problem: Two Paradigms, Two Trade-offs

KAME’s Architecture: Talking While Thinking

Training on Simulated Oracles

Results: Near-Cascaded Quality, Near-Zero Latency

Key Takeaways

Leave a Comment Cancel reply