TY - JOUR
T1 - Adversarial Diffusion Probability Model For Cross-domain Speaker Verification Integrating Contrastive Loss
AU - Su, Xinmei
AU - Xie, Xiang
AU - Zhang, Fengrun
AU - Hu, Chenguang
N1 - Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.
PY - 2023
Y1 - 2023
N2 - In speaker verification, performance degradation caused by domain mismatch has been a common problem as the test domain lies outside the training distribution. In this paper, we present a novel domain transfer network called Adversarial Diffusion Probabilistic Model (ADPM), to better alleviate this problem. More specifically, ADPM is used to transfer melspectrogram from the source domain into the target domain. To generate the melspectrogram, we propose to regard the diffusion model as the generator and a discriminator is employed for adversarial training. We also explore the contrastive learning objective to retain the context information of source domain. The generated and the original feature maps from the source domain are fed into the ResNet34 network jointly to construct cross-domain speaker verification. We evaluate the proposed techniques on VOiCES dataset, and our best model achieves a relative 8.94% Equal Error Rate (EER) drop compared to the previous adaption methods.
AB - In speaker verification, performance degradation caused by domain mismatch has been a common problem as the test domain lies outside the training distribution. In this paper, we present a novel domain transfer network called Adversarial Diffusion Probabilistic Model (ADPM), to better alleviate this problem. More specifically, ADPM is used to transfer melspectrogram from the source domain into the target domain. To generate the melspectrogram, we propose to regard the diffusion model as the generator and a discriminator is employed for adversarial training. We also explore the contrastive learning objective to retain the context information of source domain. The generated and the original feature maps from the source domain are fed into the ResNet34 network jointly to construct cross-domain speaker verification. We evaluate the proposed techniques on VOiCES dataset, and our best model achieves a relative 8.94% Equal Error Rate (EER) drop compared to the previous adaption methods.
KW - contrastive learning
KW - cross-domain
KW - diffusion models
KW - speaker verification
UR - http://www.scopus.com/inward/record.url?scp=85171576079&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-1205
DO - 10.21437/Interspeech.2023-1205
M3 - Conference article
AN - SCOPUS:85171576079
SN - 2308-457X
VL - 2023-August
SP - 5336
EP - 5340
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 24th International Speech Communication Association, Interspeech 2023
Y2 - 20 August 2023 through 24 August 2023
ER -