STITCH

Abstract

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose STITCH, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken CoT tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, STITCH matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; STITCH also performs equally well on non-reasoning datasets as those baseline models.

Background: Spoken Language Models (SLMs)

SLMs take speech input and generate speech output.

To generate speech outputs, most SLMs generate some discrete speech tokens (also called audio tokens). The speech tokens will be converted into an audio waveform by a speech decoder.
SLMs are mostly fine-tuned from text LLMs, and teaching those LLMs to directly generate speech tokens is very difficult since the speech tokens are very different from the text tokens, which are more familiar to the LLM.
A common solution is to predict some intermediate text tokens before predicting the speech tokens. Those text tokens are the transcription of future speech tokens. The text tokens can better guide the speech tokens.
To support streaming inference, SLMs generate text tokens (Text) and speech tokens (Speech) chunk by chunk in an interleaved manner. That is, first generate $N_{text}$ text tokens, then generate $N_{speech}$ tokens, and then generate $N_{text}$ text tokens, then generate $N_{speech}$ . The output interleaved token chunks will be like this:

no reasoning interleaved text speech token chunks

Illustration of existing SLMs: Cannot generate unspoken reasoning and lowest latency

Problem with Current SLMs

Humans think before we speak, which helps us give a better spoken response. The inner thinking can be complicated, but we can summarize our inner thinking into a concise and easy-to-follow answer.

Current SLMs do not include an unspoken thinking process before generating the speech response. The text tokens generated by SLMs directly correspond to what will be spoken.

Trivial Solution: Thinking before Speaking (TBS)

If we want reasoning before speaking, then we just generate an unspoken reasoning process before speaking the response. The output is like this:

Illustration of TBS: Generates unspoken reasoning before speaking but very high latency

Although this is not the main method we propose, I think we are actually the first to do this. I didn’t find any prior works on speech-to-speech SLMs that generate unspoken reasoning before speaking.

This achieves really good performance on reasoning datasets, but the latency can be high since we need to wait for the full reasoning process to be completed before we get the first chunk of text-speech tokens to synthesize the audio output.

Our Solution: STITCH: Simultaneous Thinking and Talking with Chunked Reasoning

Key idea:

We want to generate reasoning before speaking
But we don’t want to generate the full reasoning since the latency will be high

⇒ Let’s generate some partial reasoning chunks, each with size $N_{reason}$ , and alternate those unspoken partial reasoning chunks with the spoken text-speech chunks! The output chunk will be like this:

We call this method STITCH-R (Simultaneous Thinking and Talking with Chunked Reasoning). The R stands for reasoning first, and we will have another method called STITCH-S, which is the speech first one.

The above rationales seem great and reasonable, but does alternating unspoken and spoken chunks in the output make the output speech discontinuous? In other words, will the output speech stop for a while and wait the the unspoken reasoning chunk to generate?

No!!! The key observation is that to generate a 2-second output speech, we need about 26 speech tokens (and an additional 13 text tokens to guide those speech tokens). When running a 9B model on A100 using vLLM, we can generate 160 tokens in 2 seconds. This means that “the duration of a chunk of audio output” is much longer than “the time for generating the text and speech tokens for synthesizing one chunk of audio output”. So we have plenty of time left, and we use that spare time to generate a partial reasoning chunk.

Illustration of STITCH-R: Thinking when talking with reasoning first; lower latency

Some illustrative samples are as follows. Note that the speed shown here is an ideal case, and the real speed depends on the hardware and implementation.

Illustration of STITCH-R: A real token example

STITCH-S: STITCH with Speaking First

STITCH-R still needs to generate a partial reasoning chunk before the text and speech tokens are generated. This is not the best way to generate the speech first. We can generate the speech first, and then generate the reasoning after the speech. The output is like this:

In this case,the number of tokens we need to wait before starting to speak is exatly the same as the number of the no-reasoning case. So the latency is the same as the no-reasoning case. In illustrative samples, we can see that the speech is generated first, and then the reasoning is generated after the speech.

Illustration of STITCH-S: Speaking first; same latency as no-reasoning case.

Illustration of STITCH-S: A real token example

Experiments

We evaluate STITCH-R and STITCH-S on the following datasets:

Math Reasoning: AddSub, MultiArith, SVAMP, and GSM8K
Non-reasoning: Llama Questions, Web Questions, TriviaQA, and AlpacaEval

Accuracy on math reasoning datasets

The performance of STITCH-R and STITCH-S is much better than the baselines without reasoning. TBS and STITCH-R/S are comparable on average.

Accuracy on non-reasoning datasets. AlpacaEval is evaluated by GPT-4o-score.

On non-reasoning datasets, all models perform similarly. This means that fine-tuning a model to generate unspoken reasoning does not harm the performance on non-reasoning datasets.

Ethical Statement

STITCH is purely a research project. Currently, we have no plans to incorporate STITCH into a product or expand access to the public. STITCH can generate speech and can be used for interactive voice response systems, chatbots, and so on. As a spoken language model, it may carry potential risks in the misuse of the model, including spoofing voice identification or impersonating a specific speaker. In our current implementation, STITCH is fine-tuned from GLM-4-Voice, which is released under the Apache-2.0 license. Our project follows the license and does not violate the intended use of the original model.

BibTeX

@misc{chiang2025stitchsimultaneousthinkingtalking,
        title={STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models}, 
        author={Cheng-Han Chiang and Xiaofei Wang and Linjie Li and Chung-Ching Lin and Kevin Lin and Shujie Liu and Zhendong Wang and Zhengyuan Yang and Hung-yi Lee and Lijuan Wang},
        year={2025},
        eprint={2507.15375},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2507.15375}, 
  }

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Illustration of STITCH-R, a spoken language model (SLM) that can think and talk simultaneously. The duration of each audio chunk is 2 second, and the time is sufficient to generate text and audio tokens for the next audio chunk and some additional reasoning tokens.