TY - GEN
T1 - Semi-supervised Cross-Lingual Speech Recognition Exploiting Articulatory Features
AU - Su, Xinmei
AU - Xie, Xiang
AU - Hu, Chenguang
AU - Wu, Shu
AU - Wang, Jing
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - The state-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems are mostly based on the data-driven methods. However, low-resource languages may lack data for training. Articulatory Features (AFs) describe the movements of the vocal organ which can be shared across languages. Thus, this paper investigates AFs-based semi-supervised techniques to share data between languages. First, the traditional acoustic features and the AFs are combined as front-end features to provide articulatory information for cross-lingual knowledge transfer. Then, the dropout-based lattice decoded are used as the pseudo-labels for the unsupervised data to address the problem of data deficiency. In addition, the Lattice-free Maximum Mutual Information (LF-MMI) objective is adopted to better adapt to small datasets. Experiments show that our system can obtain a relative improvement of 58.6% on Character Error Rate (CER) comparing to the baseline system. More specifically, the smaller the datasets are, the more obvious the advantages of our system can be.
AB - The state-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems are mostly based on the data-driven methods. However, low-resource languages may lack data for training. Articulatory Features (AFs) describe the movements of the vocal organ which can be shared across languages. Thus, this paper investigates AFs-based semi-supervised techniques to share data between languages. First, the traditional acoustic features and the AFs are combined as front-end features to provide articulatory information for cross-lingual knowledge transfer. Then, the dropout-based lattice decoded are used as the pseudo-labels for the unsupervised data to address the problem of data deficiency. In addition, the Lattice-free Maximum Mutual Information (LF-MMI) objective is adopted to better adapt to small datasets. Experiments show that our system can obtain a relative improvement of 58.6% on Character Error Rate (CER) comparing to the baseline system. More specifically, the smaller the datasets are, the more obvious the advantages of our system can be.
KW - Articulatory features
KW - Automatic speech recognition
KW - Semi-supervised
UR - http://www.scopus.com/inward/record.url?scp=85211908505&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-80136-5_10
DO - 10.1007/978-3-031-80136-5_10
M3 - Conference contribution
AN - SCOPUS:85211908505
SN - 9783031801358
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 141
EP - 153
BT - Pattern Recognition - 27th International Conference, ICPR 2024, Proceedings
A2 - Antonacopoulos, Apostolos
A2 - Chaudhuri, Subhasis
A2 - Chellappa, Rama
A2 - Liu, Cheng-Lin
A2 - Bhattacharya, Saumik
A2 - Pal, Umapada
PB - Springer Science and Business Media Deutschland GmbH
T2 - 27th International Conference on Pattern Recognition, ICPR 2024
Y2 - 1 December 2024 through 5 December 2024
ER -