Two-Stage Self-Supervised Learning for Facial Action Unit Recognition

Hao Cheng; Xiang Xie; Shuang Liang

doi:10.1145/3531232.3531243

Two-Stage Self-Supervised Learning for Facial Action Unit Recognition

Hao Cheng, Xiang Xie, Shuang Liang

School of Information and Electronics

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

This paper proposes a two-stage self-supervised method for facial action unit recognition. First, an auto-encoder approach is applied, with an encoder which operates on a small proportion e.g., 40% of images patches. The decoder reconstructs the original image from latent features and learnable mask tokens. After training, the encoder is adapted to the task of AU recognition, yet poor results are observed in certain AU classes. To address the problem, contrastive learning is proposed to learn discriminative features. This method uses images from the VGG-Face2 dataset, which vary in terms of head pose, age and background. Experiments on AU recognition show that the two-stage method strengthens the representation quality. Compared to previous self-supervised methods, the pre-trained encoder achieves the best linear probing result on DISFA dataset, with the F1-score of 53.8%. Fine-tuning experiment is also conducted, and obtains the F1-score of 59.9%, with a roughly 3% gap to existing state-of-the-art method. The two-stage training method is easy to implement and expandable for further research.

Original language	English
Title of host publication	IVSP 2022 - 2022 4th International Conference on Image, Video and Signal Processing
Publisher	Association for Computing Machinery
Pages	80-84
Number of pages	5
ISBN (Electronic)	9781450387415
DOIs	https://doi.org/10.1145/3531232.3531243
Publication status	Published - 18 Mar 2022
Event	4th International Conference on Image, Video and Signal Processing, IVSP 2022 - Virtual, Online, Singapore Duration: 18 Mar 2022 → 20 Mar 2022

Publication series

Name	ACM International Conference Proceeding Series

Conference

Conference	4th International Conference on Image, Video and Signal Processing, IVSP 2022
Country/Territory	Singapore
City	Virtual, Online
Period	18/03/22 → 20/03/22

Keywords

Facial action unit recognition
Self-supervised learning
Vision Transformers

Access to Document

10.1145/3531232.3531243

Cite this

@inproceedings{0158ee5490e545b2b99c0c5db7e844ea,

title = "Two-Stage Self-Supervised Learning for Facial Action Unit Recognition",

abstract = "This paper proposes a two-stage self-supervised method for facial action unit recognition. First, an auto-encoder approach is applied, with an encoder which operates on a small proportion e.g., 40% of images patches. The decoder reconstructs the original image from latent features and learnable mask tokens. After training, the encoder is adapted to the task of AU recognition, yet poor results are observed in certain AU classes. To address the problem, contrastive learning is proposed to learn discriminative features. This method uses images from the VGG-Face2 dataset, which vary in terms of head pose, age and background. Experiments on AU recognition show that the two-stage method strengthens the representation quality. Compared to previous self-supervised methods, the pre-trained encoder achieves the best linear probing result on DISFA dataset, with the F1-score of 53.8%. Fine-tuning experiment is also conducted, and obtains the F1-score of 59.9%, with a roughly 3% gap to existing state-of-the-art method. The two-stage training method is easy to implement and expandable for further research.",

keywords = "Facial action unit recognition, Self-supervised learning, Vision Transformers",

author = "Hao Cheng and Xiang Xie and Shuang Liang",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 4th International Conference on Image, Video and Signal Processing, IVSP 2022 ; Conference date: 18-03-2022 Through 20-03-2022",

year = "2022",

month = mar,

day = "18",

doi = "10.1145/3531232.3531243",

language = "English",

series = "ACM International Conference Proceeding Series",

publisher = "Association for Computing Machinery",

pages = "80--84",

booktitle = "IVSP 2022 - 2022 4th International Conference on Image, Video and Signal Processing",

}

Cheng, H, Xie, X & Liang, S 2022, Two-Stage Self-Supervised Learning for Facial Action Unit Recognition. in IVSP 2022 - 2022 4th International Conference on Image, Video and Signal Processing. ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 80-84, 4th International Conference on Image, Video and Signal Processing, IVSP 2022, Virtual, Online, Singapore, 18/03/22. https://doi.org/10.1145/3531232.3531243

Two-Stage Self-Supervised Learning for Facial Action Unit Recognition. / Cheng, Hao; Xie, Xiang; Liang, Shuang.
IVSP 2022 - 2022 4th International Conference on Image, Video and Signal Processing. Association for Computing Machinery, 2022. p. 80-84 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Two-Stage Self-Supervised Learning for Facial Action Unit Recognition

AU - Cheng, Hao

AU - Xie, Xiang

AU - Liang, Shuang

PY - 2022/3/18

Y1 - 2022/3/18

N2 - This paper proposes a two-stage self-supervised method for facial action unit recognition. First, an auto-encoder approach is applied, with an encoder which operates on a small proportion e.g., 40% of images patches. The decoder reconstructs the original image from latent features and learnable mask tokens. After training, the encoder is adapted to the task of AU recognition, yet poor results are observed in certain AU classes. To address the problem, contrastive learning is proposed to learn discriminative features. This method uses images from the VGG-Face2 dataset, which vary in terms of head pose, age and background. Experiments on AU recognition show that the two-stage method strengthens the representation quality. Compared to previous self-supervised methods, the pre-trained encoder achieves the best linear probing result on DISFA dataset, with the F1-score of 53.8%. Fine-tuning experiment is also conducted, and obtains the F1-score of 59.9%, with a roughly 3% gap to existing state-of-the-art method. The two-stage training method is easy to implement and expandable for further research.

AB - This paper proposes a two-stage self-supervised method for facial action unit recognition. First, an auto-encoder approach is applied, with an encoder which operates on a small proportion e.g., 40% of images patches. The decoder reconstructs the original image from latent features and learnable mask tokens. After training, the encoder is adapted to the task of AU recognition, yet poor results are observed in certain AU classes. To address the problem, contrastive learning is proposed to learn discriminative features. This method uses images from the VGG-Face2 dataset, which vary in terms of head pose, age and background. Experiments on AU recognition show that the two-stage method strengthens the representation quality. Compared to previous self-supervised methods, the pre-trained encoder achieves the best linear probing result on DISFA dataset, with the F1-score of 53.8%. Fine-tuning experiment is also conducted, and obtains the F1-score of 59.9%, with a roughly 3% gap to existing state-of-the-art method. The two-stage training method is easy to implement and expandable for further research.

KW - Facial action unit recognition

KW - Self-supervised learning

KW - Vision Transformers

UR - http://www.scopus.com/inward/record.url?scp=85131875457&partnerID=8YFLogxK

U2 - 10.1145/3531232.3531243

DO - 10.1145/3531232.3531243

M3 - Conference contribution

AN - SCOPUS:85131875457

T3 - ACM International Conference Proceeding Series

SP - 80

EP - 84

BT - IVSP 2022 - 2022 4th International Conference on Image, Video and Signal Processing

PB - Association for Computing Machinery

T2 - 4th International Conference on Image, Video and Signal Processing, IVSP 2022

Y2 - 18 March 2022 through 20 March 2022

ER -

Two-Stage Self-Supervised Learning for Facial Action Unit Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this