TY - JOUR
T1 - Controllable timbre cloning and style replication with reference speech examples for multimodal human-computer interaction
AU - Lan, Tianwei
AU - Guo, Yuhang
AU - Deng, Mengyuan
AU - Wang, Jing
AU - Wang, Wenwu
AU - Feng, Chong
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2026/3/14
Y1 - 2026/3/14
N2 - Natural and personalized speech interaction is one of the core requirements for advancing Multimodal Human-Computer Interaction (HCI), with applications widely seen in smart home devices, voice assistants, and mobile devices. In recent years, the demand for speech in the HCI field has shifted from basic speech generation to precise customization of speaker timbre and speaking style, aiming to achieve more intuitive and immersive multimodal human-computer interaction. However, existing speech personalization technologies have significant limitations: zero-shot speech synthesis methods lack the capability for style control, while traditional style-controllable synthesis methods fail to accurately specify speaker timbre, making it difficult to balance personalization between speaker timbre and speaking style. To address this issue, we define a new task: Controllable Timbre Cloning and Style Replication with Reference Speech Examples. This task aims to directly control speaker timbre and speaking style through two reference speech examples, allowing timbre cloning and style replication to generate new timbre-style combinations. To tackle this task, we propose the Control-TTS model. This model utilizes distinct reference speeches to separately control the timbre and speaking style features of the speaker in the synthesized audio, enabling free combinations of timbre and style. This approach generates synthetic speech with rich expressivity, providing a more flexible and customizable solution for speech personalization in HCI scenarios. Our experiments on the VccmDataset demonstrate that Control-TTS achieves comparable or state-of-the-art performance in terms of metrics such as naturalness mean opinion score (NMOS), word error rate (WER), speaker similarity, and style similarity. Our demo is available at https://progressivetts.github.io/Control_TTS/.
AB - Natural and personalized speech interaction is one of the core requirements for advancing Multimodal Human-Computer Interaction (HCI), with applications widely seen in smart home devices, voice assistants, and mobile devices. In recent years, the demand for speech in the HCI field has shifted from basic speech generation to precise customization of speaker timbre and speaking style, aiming to achieve more intuitive and immersive multimodal human-computer interaction. However, existing speech personalization technologies have significant limitations: zero-shot speech synthesis methods lack the capability for style control, while traditional style-controllable synthesis methods fail to accurately specify speaker timbre, making it difficult to balance personalization between speaker timbre and speaking style. To address this issue, we define a new task: Controllable Timbre Cloning and Style Replication with Reference Speech Examples. This task aims to directly control speaker timbre and speaking style through two reference speech examples, allowing timbre cloning and style replication to generate new timbre-style combinations. To tackle this task, we propose the Control-TTS model. This model utilizes distinct reference speeches to separately control the timbre and speaking style features of the speaker in the synthesized audio, enabling free combinations of timbre and style. This approach generates synthetic speech with rich expressivity, providing a more flexible and customizable solution for speech personalization in HCI scenarios. Our experiments on the VccmDataset demonstrate that Control-TTS achieves comparable or state-of-the-art performance in terms of metrics such as naturalness mean opinion score (NMOS), word error rate (WER), speaker similarity, and style similarity. Our demo is available at https://progressivetts.github.io/Control_TTS/.
KW - Controllable speech synthesis
KW - Multimodal human-computer interaction
KW - Timbre cloning and style replication
UR - https://www.scopus.com/pages/publications/105026859247
U2 - 10.1016/j.neucom.2025.132529
DO - 10.1016/j.neucom.2025.132529
M3 - Article
AN - SCOPUS:105026859247
SN - 0925-2312
VL - 670
JO - Neurocomputing
JF - Neurocomputing
M1 - 132529
ER -