TY - GEN
T1 - Speech Emotion Recognition Exploiting ASR-based and Phonological Knowledge Representations
AU - Liang, Shuang
AU - Xie, Xiang
AU - Zhan, Qingran
AU - Cheng, Hao
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/3/4
Y1 - 2022/3/4
N2 - Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.
AB - Speech emotion recognition (SER) is a challenging problem due to the insufficient dataset. This paper deals with this problem from two aspects. First, we exploit two levels of speech representations for SER task, one for automatic speech recognition (ASR)-based representations and the other for phonological knowledge representations. Second, we use transfer learning, pre-train models and transfer knowledge from other large corpus for none-SER task. In our system, the whole model is divided into two parts: two-representation learning module and SER module. We fuse acoustic features with ASR-based and phonological knowledge representations which are both extracted from pre-trained models, and the fusion features are used in SER training. Then a novel multi-task learning approach is proposed where a shared encoder-multi decoder model is used for the phonological knowledge representation learning. The Conformer structure is introduced for the SER task, and our study indicates that Conformer is effective for SER. Finally, experimental results on IEMOCAP show that the proposed method can achieve 77.35 weighted accuracy and 77.99 unweighted accuracy respectively.
KW - Multi-task learning
KW - Speech emotion recognition
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85131859972&partnerID=8YFLogxK
U2 - 10.1145/3529466.3529488
DO - 10.1145/3529466.3529488
M3 - Conference contribution
AN - SCOPUS:85131859972
T3 - ACM International Conference Proceeding Series
SP - 216
EP - 220
BT - ICIAI 2022 - 6th International Conference on Innovation in Artificial Intelligence
PB - Association for Computing Machinery
T2 - 6th International Conference on Innovation in Artificial Intelligence, ICIAI 2022
Y2 - 4 March 2022 through 6 March 2022
ER -