TY - JOUR
T1 - SSL-VC
T2 - One-Shot Voice Conversion Through Self-Supervised Learning
AU - Jiang, Chenglong
AU - Pan, Linrong
AU - Gao, Ying
AU - Su, Kuanghua
AU - Hou, Gaoze
AU - Hu, Xiping
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2025
Y1 - 2025
N2 - Currently, the prevailing approach in voice conversion (VC) involves separating clearer linguistic information from the source audio and then reconstructing it with the identity of the target speaker. However, existing methods, whether employing in-formation perturbation techniques or carefully designed information bottleneck methods, encounter challenges related to unsatisfactory audio separation effects and insufficient robustness. This article introduces a VC through the self-supervised learning method (SSL-VC). First, it utilizes a self-supervised speech representation (S-SSR) extraction network with decoupling (Decp-SSEN) to disentangle linguistic information from speech. The designed prosodic encoder extracts features of pitch and energy from the speech to compensate for the loss of nonlinguistic details incurred during the Decp-SSEN disentangling process. This approach allows us to obtain richer linguistic information independent of speaker identity, guaranteeing the robust performance of the model. Second, we leverage high-level S-SSR as the intermediate feature, replacing the traditional Mel-spectrogram. Built an end-to-end VC pipeline that eliminates the need for a vocoder, enhancing the expression level of intermediate features and reducing the learning difficulty gap between real and predicted features. Subjective and objective experiments conducted on both seen and unseen speech corpus demonstrate that SSL-VC achieves high-quality VC and speaker similarity. Moreover, it outperforms state-of-the-art methods in extracting richer linguistic information. Ablation experiments further scrutinize the indispensability of the prosodic encoder.
AB - Currently, the prevailing approach in voice conversion (VC) involves separating clearer linguistic information from the source audio and then reconstructing it with the identity of the target speaker. However, existing methods, whether employing in-formation perturbation techniques or carefully designed information bottleneck methods, encounter challenges related to unsatisfactory audio separation effects and insufficient robustness. This article introduces a VC through the self-supervised learning method (SSL-VC). First, it utilizes a self-supervised speech representation (S-SSR) extraction network with decoupling (Decp-SSEN) to disentangle linguistic information from speech. The designed prosodic encoder extracts features of pitch and energy from the speech to compensate for the loss of nonlinguistic details incurred during the Decp-SSEN disentangling process. This approach allows us to obtain richer linguistic information independent of speaker identity, guaranteeing the robust performance of the model. Second, we leverage high-level S-SSR as the intermediate feature, replacing the traditional Mel-spectrogram. Built an end-to-end VC pipeline that eliminates the need for a vocoder, enhancing the expression level of intermediate features and reducing the learning difficulty gap between real and predicted features. Subjective and objective experiments conducted on both seen and unseen speech corpus demonstrate that SSL-VC achieves high-quality VC and speaker similarity. Moreover, it outperforms state-of-the-art methods in extracting richer linguistic information. Ablation experiments further scrutinize the indispensability of the prosodic encoder.
KW - End-to-end model
KW - linguistic information
KW - self-supervised learning
KW - voice conversion (VC)
UR - https://www.scopus.com/pages/publications/105019589727
U2 - 10.1109/TCSS.2025.3603008
DO - 10.1109/TCSS.2025.3603008
M3 - Article
AN - SCOPUS:105019589727
SN - 2329-924X
JO - IEEE Transactions on Computational Social Systems
JF - IEEE Transactions on Computational Social Systems
ER -