Controllable timbre cloning and style replication with reference speech examples for multimodal human-computer interaction

Research output: Contribution to journalArticlepeer-review

Abstract

Natural and personalized speech interaction is one of the core requirements for advancing Multimodal Human-Computer Interaction (HCI), with applications widely seen in smart home devices, voice assistants, and mobile devices. In recent years, the demand for speech in the HCI field has shifted from basic speech generation to precise customization of speaker timbre and speaking style, aiming to achieve more intuitive and immersive multimodal human-computer interaction. However, existing speech personalization technologies have significant limitations: zero-shot speech synthesis methods lack the capability for style control, while traditional style-controllable synthesis methods fail to accurately specify speaker timbre, making it difficult to balance personalization between speaker timbre and speaking style. To address this issue, we define a new task: Controllable Timbre Cloning and Style Replication with Reference Speech Examples. This task aims to directly control speaker timbre and speaking style through two reference speech examples, allowing timbre cloning and style replication to generate new timbre-style combinations. To tackle this task, we propose the Control-TTS model. This model utilizes distinct reference speeches to separately control the timbre and speaking style features of the speaker in the synthesized audio, enabling free combinations of timbre and style. This approach generates synthetic speech with rich expressivity, providing a more flexible and customizable solution for speech personalization in HCI scenarios. Our experiments on the VccmDataset demonstrate that Control-TTS achieves comparable or state-of-the-art performance in terms of metrics such as naturalness mean opinion score (NMOS), word error rate (WER), speaker similarity, and style similarity. Our demo is available at https://progressivetts.github.io/Control_TTS/.

Original languageEnglish
Article number132529
JournalNeurocomputing
Volume670
DOIs
Publication statusPublished - 14 Mar 2026
Externally publishedYes

Keywords

  • Controllable speech synthesis
  • Multimodal human-computer interaction
  • Timbre cloning and style replication

Fingerprint

Dive into the research topics of 'Controllable timbre cloning and style replication with reference speech examples for multimodal human-computer interaction'. Together they form a unique fingerprint.

Cite this