TY - GEN
T1 - Multimodal Emotion Recognition Based on Multi-Scale Facial Features and Cross-Modal Attention
AU - Bao, Chengao
AU - Chen, Luefeng
AU - Li, Min
AU - Wu, Min
AU - Pedrycz, Witold
AU - Hirota, Kaoru
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - A multi-modal emotion recognition method based on facial multi-scale features and cross-modal attention (MS-FCA) network is proposed. The MSFCA model improves the traditional single-branch ViT network into a two-branch ViT architecture by using classification tokens in each branch to interact with picture embeddings in the other branch, which facilitates effective interactions between different scales of information. Subsequently, audio features are extracted using ResNet18 network. The cross-modal attention mechanism is used to obtain the weight matrices between different modal features, making full use of inter-modal correlation and effectively fusing visual and audio features for more accurate emotion recognition. Two datasets are used for the experiments: eNTERFACE'05 and REDVESS dataset. The experimental results show that the accuracy of the proposed method on the eNTERFACE'05 and REDVESS datasets is 85.42% and 83.84% respectively, which proves the effectiveness of the proposed method.
AB - A multi-modal emotion recognition method based on facial multi-scale features and cross-modal attention (MS-FCA) network is proposed. The MSFCA model improves the traditional single-branch ViT network into a two-branch ViT architecture by using classification tokens in each branch to interact with picture embeddings in the other branch, which facilitates effective interactions between different scales of information. Subsequently, audio features are extracted using ResNet18 network. The cross-modal attention mechanism is used to obtain the weight matrices between different modal features, making full use of inter-modal correlation and effectively fusing visual and audio features for more accurate emotion recognition. Two datasets are used for the experiments: eNTERFACE'05 and REDVESS dataset. The experimental results show that the accuracy of the proposed method on the eNTERFACE'05 and REDVESS datasets is 85.42% and 83.84% respectively, which proves the effectiveness of the proposed method.
KW - cross-modal attention
KW - Multi-scale features
KW - multimodal emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=105004176275&partnerID=8YFLogxK
U2 - 10.1109/ICIT63637.2025.10965274
DO - 10.1109/ICIT63637.2025.10965274
M3 - Conference contribution
AN - SCOPUS:105004176275
T3 - Proceedings of the IEEE International Conference on Industrial Technology
BT - 2025 International Conference on Industrial Technology, ICIT 2025 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th International Conference on Industrial Technology, ICIT 2025
Y2 - 26 March 2025 through 28 March 2025
ER -