Multi-cue fusion for emotion recognition in the wild

Jingwei Yan; Wenming Zheng; Zhen Cui; Chuangao Tang; Tong Zhang; Yuan Zong

doi:10.1016/j.neucom.2018.03.068

Multi-cue fusion for emotion recognition in the wild

Jingwei Yan, Wenming Zheng^*, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

87 引用（Scopus）

摘要

Emotion recognition has become a hot research topic in the past several years due to the large demand of this technology in many practical situations. One challenging task in this topic is to recognize emotion types in a given video clip collected in the wild. In order to solve this problem we propose a multi-cue fusion emotion recognition (MCFER) framework by modeling human emotions from three complementary cues, i.e., facial texture, facial landmark action and audio signal, and then fusing them together. To capture the dynamic change of facial texture we employ a cascaded convolutional neutral network (CNN) and bidirectional recurrent neutral network (BRNN) architecture where facial image from each frame is first fed into CNN to extract high-level texture feature, and then the feature sequence is traversed into BRNN to learn the changes within it. Facial landmark action models the movement of facial muscles explicitly. SVM and CNN are deployed to explore the emotion related patterns in it. Audio signal is also modeled with CNN by extracting low-level acoustic features from segmented clips and then stacking them as an image-like matrix. We fuse these models at both feature level and decision level to further boost the overall performance. Experimental results on two challenging databases demonstrate the effectiveness and superiority of our proposed MCFER framework.

源语言	英语
页（从-至）	27-35
页数	9
期刊	Neurocomputing
卷	309
DOI	https://doi.org/10.1016/j.neucom.2018.03.068
出版状态	已出版 - 2 10月 2018
已对外发布	是

访问文件

10.1016/j.neucom.2018.03.068

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{478bc96c761d4527affc8c9723b90584,

title = "Multi-cue fusion for emotion recognition in the wild",

abstract = "Emotion recognition has become a hot research topic in the past several years due to the large demand of this technology in many practical situations. One challenging task in this topic is to recognize emotion types in a given video clip collected in the wild. In order to solve this problem we propose a multi-cue fusion emotion recognition (MCFER) framework by modeling human emotions from three complementary cues, i.e., facial texture, facial landmark action and audio signal, and then fusing them together. To capture the dynamic change of facial texture we employ a cascaded convolutional neutral network (CNN) and bidirectional recurrent neutral network (BRNN) architecture where facial image from each frame is first fed into CNN to extract high-level texture feature, and then the feature sequence is traversed into BRNN to learn the changes within it. Facial landmark action models the movement of facial muscles explicitly. SVM and CNN are deployed to explore the emotion related patterns in it. Audio signal is also modeled with CNN by extracting low-level acoustic features from segmented clips and then stacking them as an image-like matrix. We fuse these models at both feature level and decision level to further boost the overall performance. Experimental results on two challenging databases demonstrate the effectiveness and superiority of our proposed MCFER framework.",

keywords = "Convolutional neural network (CNN), Emotion recognition, Facial landmark action, Multi-cue fusion",

author = "Jingwei Yan and Wenming Zheng and Zhen Cui and Chuangao Tang and Tong Zhang and Yuan Zong",

note = "Publisher Copyright: {\textcopyright} 2018 Elsevier B.V.",

year = "2018",

month = oct,

day = "2",

doi = "10.1016/j.neucom.2018.03.068",

language = "English",

volume = "309",

pages = "27--35",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Multi-cue fusion for emotion recognition in the wild

AU - Yan, Jingwei

AU - Zheng, Wenming

AU - Cui, Zhen

AU - Tang, Chuangao

AU - Zhang, Tong

AU - Zong, Yuan

PY - 2018/10/2

Y1 - 2018/10/2

N2 - Emotion recognition has become a hot research topic in the past several years due to the large demand of this technology in many practical situations. One challenging task in this topic is to recognize emotion types in a given video clip collected in the wild. In order to solve this problem we propose a multi-cue fusion emotion recognition (MCFER) framework by modeling human emotions from three complementary cues, i.e., facial texture, facial landmark action and audio signal, and then fusing them together. To capture the dynamic change of facial texture we employ a cascaded convolutional neutral network (CNN) and bidirectional recurrent neutral network (BRNN) architecture where facial image from each frame is first fed into CNN to extract high-level texture feature, and then the feature sequence is traversed into BRNN to learn the changes within it. Facial landmark action models the movement of facial muscles explicitly. SVM and CNN are deployed to explore the emotion related patterns in it. Audio signal is also modeled with CNN by extracting low-level acoustic features from segmented clips and then stacking them as an image-like matrix. We fuse these models at both feature level and decision level to further boost the overall performance. Experimental results on two challenging databases demonstrate the effectiveness and superiority of our proposed MCFER framework.

AB - Emotion recognition has become a hot research topic in the past several years due to the large demand of this technology in many practical situations. One challenging task in this topic is to recognize emotion types in a given video clip collected in the wild. In order to solve this problem we propose a multi-cue fusion emotion recognition (MCFER) framework by modeling human emotions from three complementary cues, i.e., facial texture, facial landmark action and audio signal, and then fusing them together. To capture the dynamic change of facial texture we employ a cascaded convolutional neutral network (CNN) and bidirectional recurrent neutral network (BRNN) architecture where facial image from each frame is first fed into CNN to extract high-level texture feature, and then the feature sequence is traversed into BRNN to learn the changes within it. Facial landmark action models the movement of facial muscles explicitly. SVM and CNN are deployed to explore the emotion related patterns in it. Audio signal is also modeled with CNN by extracting low-level acoustic features from segmented clips and then stacking them as an image-like matrix. We fuse these models at both feature level and decision level to further boost the overall performance. Experimental results on two challenging databases demonstrate the effectiveness and superiority of our proposed MCFER framework.

KW - Convolutional neural network (CNN)

KW - Emotion recognition

KW - Facial landmark action

KW - Multi-cue fusion

UR - http://www.scopus.com/inward/record.url?scp=85047832834&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2018.03.068

DO - 10.1016/j.neucom.2018.03.068

M3 - Article

AN - SCOPUS:85047832834

SN - 0925-2312

VL - 309

SP - 27

EP - 35

JO - Neurocomputing

JF - Neurocomputing

ER -

Multi-cue fusion for emotion recognition in the wild

摘要

访问文件

其它文件与链接

指纹

引用此