Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task

Tomoya Koike; Kun Qian; Björn W. Schuller; Yoshiharu Yamamoto

doi:10.21437/Interspeech.2020-1552

Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task

Tomoya Koike, Kun Qian^*, Björn W. Schuller, Yoshiharu Yamamoto

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

14 Citations (Scopus)

Abstract

Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the COMPARE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i. e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p <.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8 % (baseline) to 76.2 % on the test set. Finally, the best result, i. e., 77.5 % of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.

Original language	English
Title of host publication	Interspeech 2020
Publisher	International Speech Communication Association
Pages	2047-2051
Number of pages	5
ISBN (Print)	9781713820697
DOIs	https://doi.org/10.21437/Interspeech.2020-1552
Publication status	Published - 2020
Externally published	Yes
Event	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2020-October
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/Territory	China
City	Shanghai
Period	25/10/20 → 29/10/20

Keywords

Computational Paralinguistics
Data Augmentation
Deep Learning
Speech under Mask
Transfer Learning

Access to Document

10.21437/Interspeech.2020-1552

Cite this

Koike, T., Qian, K., Schuller, B. W., & Yamamoto, Y. (2020). Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task. In Interspeech 2020 (pp. 2047-2051). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-1552

Koike, Tomoya ; Qian, Kun ; Schuller, Björn W. et al. / Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task. Interspeech 2020. International Speech Communication Association, 2020. pp. 2047-2051 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{7ec0f9327039480cb8cacad697f97a31,

title = "Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task",

abstract = "Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the COMPARE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i. e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p <.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8 % (baseline) to 76.2 % on the test set. Finally, the best result, i. e., 77.5 % of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.",

keywords = "Computational Paralinguistics, Data Augmentation, Deep Learning, Speech under Mask, Transfer Learning",

author = "Tomoya Koike and Kun Qian and Schuller, {Bj{\"o}rn W.} and Yoshiharu Yamamoto",

note = "Publisher Copyright: Copyright {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-1552",

language = "English",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "2047--2051",

booktitle = "Interspeech 2020",

}

Koike, T, Qian, K, Schuller, BW & Yamamoto, Y 2020, Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task. in Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, International Speech Communication Association, pp. 2047-2051, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, China, 25/10/20. https://doi.org/10.21437/Interspeech.2020-1552

Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task. / Koike, Tomoya; Qian, Kun; Schuller, Björn W. et al.
Interspeech 2020. International Speech Communication Association, 2020. p. 2047-2051 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task

AU - Koike, Tomoya

AU - Qian, Kun

AU - Schuller, Björn W.

AU - Yamamoto, Yoshiharu

PY - 2020

Y1 - 2020

N2 - Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the COMPARE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i. e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p <.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8 % (baseline) to 76.2 % on the test set. Finally, the best result, i. e., 77.5 % of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.

AB - Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the COMPARE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i. e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p <.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8 % (baseline) to 76.2 % on the test set. Finally, the best result, i. e., 77.5 % of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.

KW - Computational Paralinguistics

KW - Data Augmentation

KW - Deep Learning

KW - Speech under Mask

KW - Transfer Learning

UR - http://www.scopus.com/inward/record.url?scp=85098193731&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-1552

DO - 10.21437/Interspeech.2020-1552

M3 - Conference contribution

AN - SCOPUS:85098193731

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 2047

EP - 2051

BT - Interspeech 2020

PB - International Speech Communication Association

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Koike T, Qian K, Schuller BW, Yamamoto Y. Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task. In Interspeech 2020. International Speech Communication Association. 2020. p. 2047-2051. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2020-1552

Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this