SHANKS (シャンクス)

Abstract

Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user’s turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally “think while listening.” In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user–SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends.

Problem with Current Reasoning Language Models

Reasoning language models (RLMs) take text input and generate text output. However, the thinking process of the RLMs happens after the full user input is received. This can create a long latency for the response.

Illustration of existing models: The model can only take the full user input, perform internal thinking, and then generate the response.

Our Solution: SHANKS (シャンクス): Simultaneous Hearing and Thinking for Spoken Language Models

Key motivation: Humans can think while listening. We can reason about what we just heard, parse the information, recall relevant knowledge, and prepare for the response while the speaker is still speaking.

Thinking while listening is a natural ability of humans, and we want to make SLMs have this ability.

Method Overview

We propose a general inference method for SLMs called SHANKS (Simultaneous Hearing and Thinking with Chunked Input Speech).

SHANKS assume that the user input speech is streamed to the SLM. We chunk the user input speech every $t_{chunk}$ seconds, obtaining a user speech chunk $S_{i}$. Next, we feed the user speech chunk $S_{i}$ to the SLM, and the SLM will generate a unspoken reasoning chunk $R_{i}$; meanwhile, the user is still speaking. This achieves simultaneous listening and thinking.

As the user continues speaking, the SLM takes the user speech chunk $S_{i}$ every $t_{chunk}$ seconds, and generate the unspoken reasoning chunk $R_{i}$ based on all previous speech and reasoning. This process ends until the user finishes their turn.

During this process, the SLM can decide whether to interrupt the user and to >make tool calls to complete the task.

An illustration of SHANKS.

Application 1 of SHANKS: Interrupting the user

SHANKS can be useful in many different ways. Here, we focus on an interesting scenario, where we use the SLM to interrupt the user when the user is wrong. In this task, the user describes a math question they want to solve and then start to solve it step-by-step. We use SHANKS to listen to the user speaking, and interrupt the user when they are wrong. This is a promising scenario since we can expect that in the future, we may use SLMs as a tutor and help the student learn by interrupting them when they are wrong.

In the following figure, we can see that as the user unfolds the question, the model already starts to think about the problem and calculate intermediate variables. When the user finish speaking the question, the model already has the answer in their mind. As the user continue to solve the problem, the model continue to reason about the problem and interrupt the user when they are wrong.

We find that SHANKS can achieve 37.1% higher interruption accuracy than a baseline that interrupts without thinking. SHANKS also has a much lower interruption rate when the user is correct and a higher interruption rate when the user is wrong. This shows that thinking while listening can indeed improve the interruption behavior of the SLM.

An example from the interruption application. The chunks in red are the transcriptions of a user describing a math problem and attempting to solve it step-by-step. The thinking chunks (in green) and interruption response (in orange) are generated by SHANKS. For each time slot from $nt_{\rm chunk}$ to $(n+1)t_{\rm chunk}$, the thinking chunks and interruption response happen sequentially, while the user speech chunk happens concurrent to other blocks in the same time slot.

Application 2 of SHANKS: Tool-augmented dialogue

In another application, we focus on a tool-augmented dialogue scenario. In this scenario, the user specifies a request about their travel plans, and the model need to call tools (Booking.com APIs for flight search or car rental information) to complete the task. Traditionally, the model need to wait for the user to finish their turn to call the tools. However, this can be slow and not real-time. SHANKS can be used in tool-augmented dialogue to generate the tool call and when the user is still speaking. This is an important scenario since we can expect that in the future, we may use SLMs as a customer service agent and complete the user's request with tools. SHANKS can greatly reduce the response latency and improve the user experience.

In the below example, we can observe that when the user is still speaking, some API calls, including searching for the airport information and car rental information, can already be invoked since all the information about the API is specified by the user. Among all six API calls, four of them can be invoked before the user finishes their turn.

On ComplexFuncBench, we find that SHANKS can complete 56.9% of the tool calls before the user finishes their turn.

An exampleuser query from ComplexFuncBench (in red), including the unspoken thinking process (in green) and the spoken final response (in orange) from SHANKS. For each time slot from $nt_{\rm chunk}$ to $(n+1)t_{\rm chunk}$, the chunks in green (SLM thinking chunks) , blue (API call responses), and orange (output response) happen sequentially, while the user speech chunk (in red) happens concurrent to other blocks in the same time slot. The $t=T$ means the time when the user's speech terminates.

Related Work: STITCH: Thinking while Talking

SHANKS makes the SLM think while listening. In a prior work, we propose STITCH, a method that makes the SLM think while talking. STITCH is a general inference method for SLMs that can be used in many different scenarios. Chcek out STITCH here: STITCH.

STITCH is a method that makes the SLM think while talking.

Ethical Statement

SHANKS is purely a research project. Currently, we have no plans to incorporate SHANKS into a product or expand access to the public. SHANKS can generate speech and can be used for interactive voice response systems, chatbots, and so on. As a spoken language model, it may carry potential risks in the misuse of the model, including spoofing voice identification or impersonating a specific speaker. In our current implementation, SHANKS is fine-tuned from Qwen-2.5-Omni, which is released under the Apache-2.0 license. Our project follows the license and does not violate the intended use of the original model.

BibTeX

@misc{chiang2025shankssimultaneoushearingthinking,
        title={SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models}, 
        author={Cheng-Han Chiang and Xiaofei Wang and Linjie Li and Chung-Ching Lin and Kevin Lin and Shujie Liu and Zhendong Wang and Zhengyuan Yang and Hung-yi Lee and Lijuan Wang},
        year={2025},
        eprint={2510.06917},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2510.06917}, 
  }

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models