TY - GEN
T1 - Two-Stage Self-Supervised Learning for Facial Action Unit Recognition
AU - Cheng, Hao
AU - Xie, Xiang
AU - Liang, Shuang
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/3/18
Y1 - 2022/3/18
N2 - This paper proposes a two-stage self-supervised method for facial action unit recognition. First, an auto-encoder approach is applied, with an encoder which operates on a small proportion e.g., 40% of images patches. The decoder reconstructs the original image from latent features and learnable mask tokens. After training, the encoder is adapted to the task of AU recognition, yet poor results are observed in certain AU classes. To address the problem, contrastive learning is proposed to learn discriminative features. This method uses images from the VGG-Face2 dataset, which vary in terms of head pose, age and background. Experiments on AU recognition show that the two-stage method strengthens the representation quality. Compared to previous self-supervised methods, the pre-trained encoder achieves the best linear probing result on DISFA dataset, with the F1-score of 53.8%. Fine-tuning experiment is also conducted, and obtains the F1-score of 59.9%, with a roughly 3% gap to existing state-of-the-art method. The two-stage training method is easy to implement and expandable for further research.
AB - This paper proposes a two-stage self-supervised method for facial action unit recognition. First, an auto-encoder approach is applied, with an encoder which operates on a small proportion e.g., 40% of images patches. The decoder reconstructs the original image from latent features and learnable mask tokens. After training, the encoder is adapted to the task of AU recognition, yet poor results are observed in certain AU classes. To address the problem, contrastive learning is proposed to learn discriminative features. This method uses images from the VGG-Face2 dataset, which vary in terms of head pose, age and background. Experiments on AU recognition show that the two-stage method strengthens the representation quality. Compared to previous self-supervised methods, the pre-trained encoder achieves the best linear probing result on DISFA dataset, with the F1-score of 53.8%. Fine-tuning experiment is also conducted, and obtains the F1-score of 59.9%, with a roughly 3% gap to existing state-of-the-art method. The two-stage training method is easy to implement and expandable for further research.
KW - Facial action unit recognition
KW - Self-supervised learning
KW - Vision Transformers
UR - http://www.scopus.com/inward/record.url?scp=85131875457&partnerID=8YFLogxK
U2 - 10.1145/3531232.3531243
DO - 10.1145/3531232.3531243
M3 - Conference contribution
AN - SCOPUS:85131875457
T3 - ACM International Conference Proceeding Series
SP - 80
EP - 84
BT - IVSP 2022 - 2022 4th International Conference on Image, Video and Signal Processing
PB - Association for Computing Machinery
T2 - 4th International Conference on Image, Video and Signal Processing, IVSP 2022
Y2 - 18 March 2022 through 20 March 2022
ER -