TY - JOUR
T1 - Multimodal emotion recognition based on feature selection and extreme learning machine in video clips
AU - Pan, Bei
AU - Hirota, Kaoru
AU - Jia, Zhiyang
AU - Zhao, Linhui
AU - Jin, Xiaoming
AU - Dai, Yaping
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2023/3
Y1 - 2023/3
N2 - Multimodal fusion-based emotion recognition has attracted increasing attention in affective computing because different modalities can achieve information complementation. One of the main challenges for reliable and effective model design is to define and extract appropriate emotional features from different modalities. In this paper, we present a novel multimodal emotion recognition framework to estimate categorical emotions, where visual and audio signals are utilized as multimodal input. The model learns neural appearance and key emotion frame using a statistical geometric method, which acts as a pre-processer for saving computation power. Discriminative emotion features expressed from visual and audio modalities are extracted through evolutionary optimization, and then fed to the optimized extreme learning machine (ELM) classifiers for unimodal emotion recognition. Finally, a decision-level fusion strategy is applied to integrate the results of predicted emotions by the different classifiers to enhance the overall performance. The effectiveness of the proposed method is demonstrated through three public datasets, i.e., the acted CK+ dataset, the acted Enterface05 dataset, and the spontaneous BAUM-1s dataset. An average recognition rate of 93.53% on CK+, 91.62% on Enterface05, and 60.77% on BAUM-1s are obtained. The emotion recognition results acquired by fusing visual and audio predicted emotions are superior to both recognition of unimodality and concatenation of individual features.
AB - Multimodal fusion-based emotion recognition has attracted increasing attention in affective computing because different modalities can achieve information complementation. One of the main challenges for reliable and effective model design is to define and extract appropriate emotional features from different modalities. In this paper, we present a novel multimodal emotion recognition framework to estimate categorical emotions, where visual and audio signals are utilized as multimodal input. The model learns neural appearance and key emotion frame using a statistical geometric method, which acts as a pre-processer for saving computation power. Discriminative emotion features expressed from visual and audio modalities are extracted through evolutionary optimization, and then fed to the optimized extreme learning machine (ELM) classifiers for unimodal emotion recognition. Finally, a decision-level fusion strategy is applied to integrate the results of predicted emotions by the different classifiers to enhance the overall performance. The effectiveness of the proposed method is demonstrated through three public datasets, i.e., the acted CK+ dataset, the acted Enterface05 dataset, and the spontaneous BAUM-1s dataset. An average recognition rate of 93.53% on CK+, 91.62% on Enterface05, and 60.77% on BAUM-1s are obtained. The emotion recognition results acquired by fusing visual and audio predicted emotions are superior to both recognition of unimodality and concatenation of individual features.
KW - Emotion recognition
KW - Evolutionary optimization
KW - Extreme learning machine
KW - Feature selection
KW - Multimodal fusion
UR - http://www.scopus.com/inward/record.url?scp=85111486941&partnerID=8YFLogxK
U2 - 10.1007/s12652-021-03407-2
DO - 10.1007/s12652-021-03407-2
M3 - Article
AN - SCOPUS:85111486941
SN - 1868-5137
VL - 14
SP - 1903
EP - 1917
JO - Journal of Ambient Intelligence and Humanized Computing
JF - Journal of Ambient Intelligence and Humanized Computing
IS - 3
ER -