TY - GEN
T1 - Multi-clue fusion for emotion recognition in the wild
AU - Yan, Jingwei
AU - Zheng, Wenming
AU - Cui, Zhen
AU - Tang, Chuangao
AU - Zhang, Tong
AU - Zong, Yuan
AU - Sun, Ning
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/10/31
Y1 - 2016/10/31
N2 - In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the finetuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.
AB - In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the finetuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.
KW - Afew
KW - Convolutional neural network (CNN)
KW - Emotion recognition in the wild
KW - Multi-clue
KW - Recurrent neural network (RNN)
UR - http://www.scopus.com/inward/record.url?scp=85016557815&partnerID=8YFLogxK
U2 - 10.1145/2993148.2997630
DO - 10.1145/2993148.2997630
M3 - Conference contribution
AN - SCOPUS:85016557815
T3 - ICMI 2016 - Proceedings of the 18th ACM International Conference on Multimodal Interaction
SP - 458
EP - 463
BT - ICMI 2016 - Proceedings of the 18th ACM International Conference on Multimodal Interaction
A2 - Pelachaud, Catherine
A2 - Nakano, Yukiko I.
A2 - Nishida, Toyoaki
A2 - Busso, Carlos
A2 - Morency, Louis-Philippe
A2 - Andre, Elisabeth
PB - Association for Computing Machinery, Inc
T2 - 18th ACM International Conference on Multimodal Interaction, ICMI 2016
Y2 - 12 November 2016 through 16 November 2016
ER -