TY - JOUR
T1 - Learning Sequential Variation Information for Dynamic Facial Expression Recognition
AU - Pan, Bei
AU - Hirota, Kaoru
AU - Dai, Yaping
AU - Jia, Zhiyang
AU - Shao, Shuai
AU - She, Jinhua
N1 - Publisher Copyright:
© 2012 IEEE.
PY - 2025
Y1 - 2025
N2 - A multiscale sequence information fusion (MSSIF) method is presented for dynamic facial expression recognition (DFER) in video sequences. It exploits multiscale information by integrating features from individual frames, subsequences, and entire sequences through a transformer-based architecture. This hierarchical feature fusion process includes deep feature extraction at the frame level to capture intricate visual details, intrasubsequence fusion using self-attention mechanisms for analyzing adjacent frames, and intersubsequence fusion to synthesize long-term emotional dynamics across time scales. The efficacy of MSSIF is demonstrated through extensive evaluation on three video datasets: eNTERFACE’05, BAUM-1s, and AFEW, where it achieves overall recognition accuracies of 60.1%, 60.7%, and 58.8%, respectively. These results substantiate MSSIF’s superior performance in accurately recognizing facial expressions by managing short and long-term dependencies within video sequences, making it a potent tool for real-world applications requiring nuanced dynamic facial expression detection.
AB - A multiscale sequence information fusion (MSSIF) method is presented for dynamic facial expression recognition (DFER) in video sequences. It exploits multiscale information by integrating features from individual frames, subsequences, and entire sequences through a transformer-based architecture. This hierarchical feature fusion process includes deep feature extraction at the frame level to capture intricate visual details, intrasubsequence fusion using self-attention mechanisms for analyzing adjacent frames, and intersubsequence fusion to synthesize long-term emotional dynamics across time scales. The efficacy of MSSIF is demonstrated through extensive evaluation on three video datasets: eNTERFACE’05, BAUM-1s, and AFEW, where it achieves overall recognition accuracies of 60.1%, 60.7%, and 58.8%, respectively. These results substantiate MSSIF’s superior performance in accurately recognizing facial expressions by managing short and long-term dependencies within video sequences, making it a potent tool for real-world applications requiring nuanced dynamic facial expression detection.
KW - Dynamic facial expression recognition (DFER)
KW - multiscale feature fusion
KW - spatio-temporal features
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=105002004644&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2025.3548669
DO - 10.1109/TNNLS.2025.3548669
M3 - Article
AN - SCOPUS:105002004644
SN - 2162-237X
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
ER -