Patch attention convolutional vision transformer for facial expression recognition with occlusion

Chang Liu; Kaoru Hirota; Yaping Dai

doi:10.1016/j.ins.2022.11.068

Patch attention convolutional vision transformer for facial expression recognition with occlusion

Chang Liu, Kaoru Hirota, Yaping Dai^*

^*此作品的通讯作者

自动化学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

56 引用（Scopus）

摘要

Despite substantial progress in Facial Expression Recognition (FER) in recent decades, most previous methods have been developed to recognize constrained facial expressions. Real-world occlusions lead to invisible facial regions and contaminated facial features, which undoubtedly increase the difficulty of FER in the wild. Therefore, a Patch Attention Convolutional Vision Transformer (PACVT) is proposed to tackle the occlusion FER problem. The backbone convolutional neural network is used to extract facial feature maps, which are cropped into multiple regional patches to extract local and global features. The Patch Attention Unit (PAU) is designed to perceive occluded regions by adaptively calculating the patch-level attention weights of local features for expression recognition. The facial patches are mapped into sequences of visual tokens, and the Vision Transformer (ViT) is employed to capture the interactions and correlations between these visual tokens from a global perspective. The self-attention in ViT enables the PACVT to focus on the salient patches with discriminative features and ignore the occlusion. Experiments are conducted on three widely used expression datasets and their occlusion subsets, and the results demonstrate that the proposed PACVT outperforms state-of-the-art methods on occlusion FER. Cross-dataset experiment results evidence the generalization ability of the PACVT.

源语言	英语
页（从-至）	781-794
页数	14
期刊	Information Sciences
卷	619
DOI	https://doi.org/10.1016/j.ins.2022.11.068
出版状态	已出版 - 1月 2023

访问文件

10.1016/j.ins.2022.11.068

其它文件与链接

链接到 Scopus 的出版物

引用此

Liu, C., Hirota, K., & Dai, Y. (2023). Patch attention convolutional vision transformer for facial expression recognition with occlusion. Information Sciences, 619, 781-794. https://doi.org/10.1016/j.ins.2022.11.068

@article{dd46107f63864f9b8d497c6bacd12f37,

title = "Patch attention convolutional vision transformer for facial expression recognition with occlusion",

abstract = "Despite substantial progress in Facial Expression Recognition (FER) in recent decades, most previous methods have been developed to recognize constrained facial expressions. Real-world occlusions lead to invisible facial regions and contaminated facial features, which undoubtedly increase the difficulty of FER in the wild. Therefore, a Patch Attention Convolutional Vision Transformer (PACVT) is proposed to tackle the occlusion FER problem. The backbone convolutional neural network is used to extract facial feature maps, which are cropped into multiple regional patches to extract local and global features. The Patch Attention Unit (PAU) is designed to perceive occluded regions by adaptively calculating the patch-level attention weights of local features for expression recognition. The facial patches are mapped into sequences of visual tokens, and the Vision Transformer (ViT) is employed to capture the interactions and correlations between these visual tokens from a global perspective. The self-attention in ViT enables the PACVT to focus on the salient patches with discriminative features and ignore the occlusion. Experiments are conducted on three widely used expression datasets and their occlusion subsets, and the results demonstrate that the proposed PACVT outperforms state-of-the-art methods on occlusion FER. Cross-dataset experiment results evidence the generalization ability of the PACVT.",

keywords = "Facial expression recognition, Local and global feature, Occlusion, Self-attention, Vision transformer",

author = "Chang Liu and Kaoru Hirota and Yaping Dai",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier Inc.",

year = "2023",

month = jan,

doi = "10.1016/j.ins.2022.11.068",

language = "English",

volume = "619",

pages = "781--794",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Patch attention convolutional vision transformer for facial expression recognition with occlusion

AU - Liu, Chang

AU - Hirota, Kaoru

AU - Dai, Yaping

PY - 2023/1

Y1 - 2023/1

N2 - Despite substantial progress in Facial Expression Recognition (FER) in recent decades, most previous methods have been developed to recognize constrained facial expressions. Real-world occlusions lead to invisible facial regions and contaminated facial features, which undoubtedly increase the difficulty of FER in the wild. Therefore, a Patch Attention Convolutional Vision Transformer (PACVT) is proposed to tackle the occlusion FER problem. The backbone convolutional neural network is used to extract facial feature maps, which are cropped into multiple regional patches to extract local and global features. The Patch Attention Unit (PAU) is designed to perceive occluded regions by adaptively calculating the patch-level attention weights of local features for expression recognition. The facial patches are mapped into sequences of visual tokens, and the Vision Transformer (ViT) is employed to capture the interactions and correlations between these visual tokens from a global perspective. The self-attention in ViT enables the PACVT to focus on the salient patches with discriminative features and ignore the occlusion. Experiments are conducted on three widely used expression datasets and their occlusion subsets, and the results demonstrate that the proposed PACVT outperforms state-of-the-art methods on occlusion FER. Cross-dataset experiment results evidence the generalization ability of the PACVT.

AB - Despite substantial progress in Facial Expression Recognition (FER) in recent decades, most previous methods have been developed to recognize constrained facial expressions. Real-world occlusions lead to invisible facial regions and contaminated facial features, which undoubtedly increase the difficulty of FER in the wild. Therefore, a Patch Attention Convolutional Vision Transformer (PACVT) is proposed to tackle the occlusion FER problem. The backbone convolutional neural network is used to extract facial feature maps, which are cropped into multiple regional patches to extract local and global features. The Patch Attention Unit (PAU) is designed to perceive occluded regions by adaptively calculating the patch-level attention weights of local features for expression recognition. The facial patches are mapped into sequences of visual tokens, and the Vision Transformer (ViT) is employed to capture the interactions and correlations between these visual tokens from a global perspective. The self-attention in ViT enables the PACVT to focus on the salient patches with discriminative features and ignore the occlusion. Experiments are conducted on three widely used expression datasets and their occlusion subsets, and the results demonstrate that the proposed PACVT outperforms state-of-the-art methods on occlusion FER. Cross-dataset experiment results evidence the generalization ability of the PACVT.

KW - Facial expression recognition

KW - Local and global feature

KW - Occlusion

KW - Self-attention

KW - Vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85142894518&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2022.11.068

DO - 10.1016/j.ins.2022.11.068

M3 - Article

AN - SCOPUS:85142894518

SN - 0020-0255

VL - 619

SP - 781

EP - 794

JO - Information Sciences

JF - Information Sciences

ER -

Patch attention convolutional vision transformer for facial expression recognition with occlusion

摘要

访问文件

其它文件与链接

指纹

引用此