Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection

Chongwen Wang; Zicheng Wang

doi:10.3389/fnbot.2021.824592

Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection

Chongwen Wang^*, Zicheng Wang

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

8 Citations (Scopus)

Abstract

Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.

Original language	English
Article number	824592
Journal	Frontiers in Neurorobotics
Volume	15
DOIs	https://doi.org/10.3389/fnbot.2021.824592
Publication status	Published - 12 Jan 2022

Keywords

affective computing
cross-attention
facial action unit recognition
multi-scale transformer
self-attention

Access to Document

10.3389/fnbot.2021.824592

Cite this

@article{84447d904b9b41fbad25c74e6cd269c4,

title = "Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection",

abstract = "Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.",

keywords = "affective computing, cross-attention, facial action unit recognition, multi-scale transformer, self-attention",

author = "Chongwen Wang and Zicheng Wang",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 Wang and Wang.",

year = "2022",

month = jan,

day = "12",

doi = "10.3389/fnbot.2021.824592",

language = "English",

volume = "15",

journal = "Frontiers in Neurorobotics",

issn = "1662-5218",

publisher = "Frontiers Media SA",

}

TY - JOUR

T1 - Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection

AU - Wang, Chongwen

AU - Wang, Zicheng

PY - 2022/1/12

Y1 - 2022/1/12

N2 - Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.

AB - Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.

KW - affective computing

KW - cross-attention

KW - facial action unit recognition

KW - multi-scale transformer

KW - self-attention

UR - http://www.scopus.com/inward/record.url?scp=85123716047&partnerID=8YFLogxK

U2 - 10.3389/fnbot.2021.824592

DO - 10.3389/fnbot.2021.824592

M3 - Article

AN - SCOPUS:85123716047

SN - 1662-5218

VL - 15

JO - Frontiers in Neurorobotics

JF - Frontiers in Neurorobotics

M1 - 824592

ER -

Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this