SSL-VC: One-Shot Voice Conversion Through Self-Supervised Learning

  • Chenglong Jiang
  • , Linrong Pan
  • , Ying Gao*
  • , Kuanghua Su
  • , Gaoze Hou
  • , Xiping Hu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Currently, the prevailing approach in voice conversion (VC) involves separating clearer linguistic information from the source audio and then reconstructing it with the identity of the target speaker. However, existing methods, whether employing in-formation perturbation techniques or carefully designed information bottleneck methods, encounter challenges related to unsatisfactory audio separation effects and insufficient robustness. This article introduces a VC through the self-supervised learning method (SSL-VC). First, it utilizes a self-supervised speech representation (S-SSR) extraction network with decoupling (Decp-SSEN) to disentangle linguistic information from speech. The designed prosodic encoder extracts features of pitch and energy from the speech to compensate for the loss of nonlinguistic details incurred during the Decp-SSEN disentangling process. This approach allows us to obtain richer linguistic information independent of speaker identity, guaranteeing the robust performance of the model. Second, we leverage high-level S-SSR as the intermediate feature, replacing the traditional Mel-spectrogram. Built an end-to-end VC pipeline that eliminates the need for a vocoder, enhancing the expression level of intermediate features and reducing the learning difficulty gap between real and predicted features. Subjective and objective experiments conducted on both seen and unseen speech corpus demonstrate that SSL-VC achieves high-quality VC and speaker similarity. Moreover, it outperforms state-of-the-art methods in extracting richer linguistic information. Ablation experiments further scrutinize the indispensability of the prosodic encoder.

Original languageEnglish
JournalIEEE Transactions on Computational Social Systems
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • End-to-end model
  • linguistic information
  • self-supervised learning
  • voice conversion (VC)

Fingerprint

Dive into the research topics of 'SSL-VC: One-Shot Voice Conversion Through Self-Supervised Learning'. Together they form a unique fingerprint.

Cite this