TY - JOUR
T1 - A self-supervised model for language identification integrating phonological knowledge
AU - Zhan, Qingran
AU - Xie, Xiang
AU - Hu, Chenguang
AU - Cheng, Haobo
N1 - Publisher Copyright:
© 2021 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2021/9
Y1 - 2021/9
N2 - In this paper, a self-supervised learning pre-trained model is proposed and successfully applied in language identification task (LID). A Transformer encoder is employed and multi-task strategy is used to train the self-supervised model: the first task is to reconstruct the masking spans of input frames and the second task is a supervision task where the phoneme and phonological labels are used with Connectionist Temporal Classification (CTC) loss. By using this multi-task learning loss, the model is expected to capture high-level speech representation in phonological space. Meanwhile, an adaptive loss is also applied for multi-task learning to balance the weight between different tasks. After the pretraining stage, the self-supervised model is used for xvector systems. Our LID experiments are carried out on the oriental language recognition (OLR) challenge data corpus and 1 s, 3 s, Full-length test sets are selected. Experimental results show that on 1 s test set, feature extraction model approach can get best performance and in 3 s, Full-length test, the fine-tuning approach can reach the best performance. Furthermore, our results prove that the multi-task training strategy is effective and the proposed model can get the best performance.
AB - In this paper, a self-supervised learning pre-trained model is proposed and successfully applied in language identification task (LID). A Transformer encoder is employed and multi-task strategy is used to train the self-supervised model: the first task is to reconstruct the masking spans of input frames and the second task is a supervision task where the phoneme and phonological labels are used with Connectionist Temporal Classification (CTC) loss. By using this multi-task learning loss, the model is expected to capture high-level speech representation in phonological space. Meanwhile, an adaptive loss is also applied for multi-task learning to balance the weight between different tasks. After the pretraining stage, the self-supervised model is used for xvector systems. Our LID experiments are carried out on the oriental language recognition (OLR) challenge data corpus and 1 s, 3 s, Full-length test sets are selected. Experimental results show that on 1 s test set, feature extraction model approach can get best performance and in 3 s, Full-length test, the fine-tuning approach can reach the best performance. Furthermore, our results prove that the multi-task training strategy is effective and the proposed model can get the best performance.
KW - Language identification
KW - Phonological knowledge
KW - Self-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85114810958&partnerID=8YFLogxK
U2 - 10.3390/electronics10182259
DO - 10.3390/electronics10182259
M3 - Article
AN - SCOPUS:85114810958
SN - 2079-9292
VL - 10
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 18
M1 - 2259
ER -