Learning Sequential Variation Information for Dynamic Facial Expression Recognition

Bei Pan, Kaoru Hirota*, Yaping Dai, Zhiyang Jia, Shuai Shao, Jinhua She

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

A multiscale sequence information fusion (MSSIF) method is presented for dynamic facial expression recognition (DFER) in video sequences. It exploits multiscale information by integrating features from individual frames, subsequences, and entire sequences through a transformer-based architecture. This hierarchical feature fusion process includes deep feature extraction at the frame level to capture intricate visual details, intrasubsequence fusion using self-attention mechanisms for analyzing adjacent frames, and intersubsequence fusion to synthesize long-term emotional dynamics across time scales. The efficacy of MSSIF is demonstrated through extensive evaluation on three video datasets: eNTERFACE’05, BAUM-1s, and AFEW, where it achieves overall recognition accuracies of 60.1%, 60.7%, and 58.8%, respectively. These results substantiate MSSIF’s superior performance in accurately recognizing facial expressions by managing short and long-term dependencies within video sequences, making it a potent tool for real-world applications requiring nuanced dynamic facial expression detection.

Original languageEnglish
JournalIEEE Transactions on Neural Networks and Learning Systems
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • Dynamic facial expression recognition (DFER)
  • multiscale feature fusion
  • spatio-temporal features
  • transformer

Fingerprint

Dive into the research topics of 'Learning Sequential Variation Information for Dynamic Facial Expression Recognition'. Together they form a unique fingerprint.

Cite this