DurIAN-SC: Duration informed attention network based singing voice conversion system

Liqiang Zhang; Chengzhu Yu; Heng Lu; Chao Weng; Chunlei Zhang; Yusong Wu; Xiang Xie; Zijin Li; Dong Yu

doi:10.21437/Interspeech.2020-1789

DurIAN-SC: Duration informed attention network based singing voice conversion system

Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, Dong Yu

信息与电子学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

20 引用（Scopus）

摘要

Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data. In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small. Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.

源语言	英语
主期刊名	Interspeech 2020
出版商	International Speech Communication Association
页	1231-1235
页数	5
ISBN（印刷版）	9781713820697
DOI	https://doi.org/10.21437/Interspeech.2020-1789
出版状态	已出版 - 2020
活动	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, 中国期限: 25 10月 2020 → 29 10月 2020

出版系列

姓名	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2020-October
ISSN（印刷版）	2308-457X
ISSN（电子版）	1990-9772

会议

会议	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
国家/地区	中国
市	Shanghai
时期	25/10/20 → 29/10/20

访问文件

10.21437/Interspeech.2020-1789

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhang, L., Yu, C., Lu, H., Weng, C., Zhang, C., Wu, Y., Xie, X., Li, Z., & Yu, D. (2020). DurIAN-SC: Duration informed attention network based singing voice conversion system. 在 Interspeech 2020 (页码 1231-1235). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-1789

@inproceedings{27867fb00dee4f13afbd7e4f99126604,

title = "DurIAN-SC: Duration informed attention network based singing voice conversion system",

abstract = "Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data. In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small. Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.",

keywords = "Singing Synthesis, Singing Voice Conversion, Speaker D-vector, Speaker Embedding",

author = "Liqiang Zhang and Chengzhu Yu and Heng Lu and Chao Weng and Chunlei Zhang and Yusong Wu and Xiang Xie and Zijin Li and Dong Yu",

note = "Publisher Copyright: {\textcopyright} 2020 International Speech Communication Association. All rights reserved.; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-1789",

language = "English",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "1231--1235",

booktitle = "Interspeech 2020",

}

Zhang, L, Yu, C, Lu, H, Weng, C, Zhang, C, Wu, Y, Xie, X, Li, Z & Yu, D 2020, DurIAN-SC: Duration informed attention network based singing voice conversion system. 在 Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 卷 2020-October, International Speech Communication Association, 页码 1231-1235, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, 中国, 25/10/20. https://doi.org/10.21437/Interspeech.2020-1789

DurIAN-SC: Duration informed attention network based singing voice conversion system. / Zhang, Liqiang; Yu, Chengzhu; Lu, Heng 等.
Interspeech 2020. International Speech Communication Association, 2020. 页码 1231-1235 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2020-October).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - DurIAN-SC

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

AU - Zhang, Liqiang

AU - Yu, Chengzhu

AU - Lu, Heng

AU - Weng, Chao

AU - Zhang, Chunlei

AU - Wu, Yusong

AU - Xie, Xiang

AU - Li, Zijin

AU - Yu, Dong

PY - 2020

Y1 - 2020

N2 - Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data. In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small. Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.

AB - Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data. In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small. Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.

KW - Singing Synthesis

KW - Singing Voice Conversion

KW - Speaker D-vector

KW - Speaker Embedding

UR - http://www.scopus.com/inward/record.url?scp=85098174610&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-1789

DO - 10.21437/Interspeech.2020-1789

M3 - Conference contribution

AN - SCOPUS:85098174610

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 1231

EP - 1235

BT - Interspeech 2020

PB - International Speech Communication Association

Y2 - 25 October 2020 through 29 October 2020

ER -

Zhang L, Yu C, Lu H, Weng C, Zhang C, Wu Y 等. DurIAN-SC: Duration informed attention network based singing voice conversion system. 在 Interspeech 2020. International Speech Communication Association. 2020. 页码 1231-1235. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2020-1789