Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose STITCH, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken CoT tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, STITCH matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; STITCH also performs equally well on non-reasoning datasets as those baseline models.
SLMs take speech input and generate speech output.
$N_{text}$
text tokens, then generate $N_{speech}$
tokens, and then generate $N_{text}$
text tokens, then generate $N_{speech}$
. The output interleaved token chunks will be like this:Humans think before we speak, which helps us give a better spoken response. The inner thinking can be complicated, but we can summarize our inner thinking into a concise and easy-to-follow answer.
Current SLMs do not include an unspoken thinking process before generating the speech response. The text tokens generated by SLMs directly correspond to what will be spoken.
If we want reasoning before speaking, then we just generate an unspoken reasoning process before speaking the response. The output is like this:
Although this is not the main method we propose, I think we are actually the first to do this. I didn’t find any prior works on speech-to-speech SLMs that generate unspoken reasoning before speaking.
This achieves really good performance on reasoning datasets, but the latency can be high since we need to wait for the full reasoning process to be completed before we get the first chunk of text-speech tokens to synthesize the audio output.
Key idea:
⇒ Let’s generate some partial reasoning chunks, each with size $N_{reason}$
, and alternate those unspoken partial reasoning chunks with the spoken text-speech chunks! The output chunk will be like this:
We call this method STITCH-R (Simultaneous Thinking and Talking with Chunked Reasoning). The R stands for reasoning first, and we will have another method called STITCH-S, which is the speech first one.
The above rationales seem great and reasonable, but does alternating unspoken and spoken chunks in the output make the output speech discontinuous? In other words, will the output speech stop for a while and wait the the unspoken reasoning chunk to generate?
No!!! The key observation is that to generate a 2-second output speech, we need about 26 speech tokens (and an additional 13 text tokens to guide those speech tokens). When running a 9B model on A100 using vLLM, we can generate 160 tokens in 2 seconds. This means that “the duration of a chunk of audio output” is much longer than “the time for generating the text and speech tokens for synthesizing one chunk of audio output”. So we have plenty of time left, and we use that spare time to generate a partial reasoning chunk.
Some illustrative samples are as follows. Note that the speed shown here is an ideal case, and the real speed depends on the hardware and implementation.
STITCH-R still needs to generate a partial reasoning chunk before the text and speech tokens are generated. This is not the best way to generate the speech first. We can generate the speech first, and then generate the reasoning after the speech. The output is like this:
In this case,the number of tokens we need to wait before starting to speak is exatly the same as the number of the no-reasoning case. So the latency is the same as the no-reasoning case. In illustrative samples, we can see that the speech is generated first, and then the reasoning is generated after the speech.
We evaluate STITCH-R and STITCH-S on the following datasets:
The performance of STITCH-R and STITCH-S is much better than the baselines without reasoning. TBS and STITCH-R/S are comparable on average.
On non-reasoning datasets, all models perform similarly. This means that fine-tuning a model to generate unspoken reasoning does not harm the performance on non-reasoning datasets.
STITCH is purely a research project. Currently, we have no plans to incorporate STITCH into a product or expand access to the public. STITCH can generate speech and can be used for interactive voice response systems, chatbots, and so on. As a spoken language model, it may carry potential risks in the misuse of the model, including spoofing voice identification or impersonating a specific speaker. In our current implementation, STITCH is fine-tuned from GLM-4-Voice, which is released under the Apache-2.0 license. Our project follows the license and does not violate the intended use of the original model.
@misc{chiang2025stitchsimultaneousthinkingtalking,
title={STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models},
author={Cheng-Han Chiang and Xiaofei Wang and Linjie Li and Chung-Ching Lin and Kevin Lin and Shujie Liu and Zhendong Wang and Zhengyuan Yang and Hung-yi Lee and Lijuan Wang},
year={2025},
eprint={2507.15375},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.15375},
}