Abstract
Despite substantial progress in Facial Expression Recognition (FER) in recent decades, most previous methods have been developed to recognize constrained facial expressions. Real-world occlusions lead to invisible facial regions and contaminated facial features, which undoubtedly increase the difficulty of FER in the wild. Therefore, a Patch Attention Convolutional Vision Transformer (PACVT) is proposed to tackle the occlusion FER problem. The backbone convolutional neural network is used to extract facial feature maps, which are cropped into multiple regional patches to extract local and global features. The Patch Attention Unit (PAU) is designed to perceive occluded regions by adaptively calculating the patch-level attention weights of local features for expression recognition. The facial patches are mapped into sequences of visual tokens, and the Vision Transformer (ViT) is employed to capture the interactions and correlations between these visual tokens from a global perspective. The self-attention in ViT enables the PACVT to focus on the salient patches with discriminative features and ignore the occlusion. Experiments are conducted on three widely used expression datasets and their occlusion subsets, and the results demonstrate that the proposed PACVT outperforms state-of-the-art methods on occlusion FER. Cross-dataset experiment results evidence the generalization ability of the PACVT.
Original language | English |
---|---|
Pages (from-to) | 781-794 |
Number of pages | 14 |
Journal | Information Sciences |
Volume | 619 |
DOIs | |
Publication status | Published - Jan 2023 |
Keywords
- Facial expression recognition
- Local and global feature
- Occlusion
- Self-attention
- Vision transformer