TY - GEN
T1 - Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 challenge mask task
AU - Koike, Tomoya
AU - Qian, Kun
AU - Schuller, Björn W.
AU - Yamamoto, Yoshiharu
N1 - Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the COMPARE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i. e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p <.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8 % (baseline) to 76.2 % on the test set. Finally, the best result, i. e., 77.5 % of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.
AB - Human hand-crafted features are always regarded as expensive, time-consuming, and difficult in almost all of the machine-learning-related tasks. First, those well-designed features extremely rely on human expert domain knowledge, which may restrain the collaboration work across fields. Second, the features extracted in such a brute-force scenario may not be easy to be transferred to another task, which means a series of new features should be designed. To this end, we introduce a method based on a transfer learning strategy combined with data augmentation techniques for the COMPARE 2020 Challenge Mask Sub-Challenge. Unlike the previous studies mainly based on pre-trained models by image data, we use a pre-trained model based on large scale audio data, i. e., AudioSet. In addition, the SpecAugment and mixup methods are used to improve the generalisation of the deep models. Experimental results demonstrate that the best-proposed model can significantly (p <.001, by one-tailed z-test) improve the unweighted average recall (UAR) from 71.8 % (baseline) to 76.2 % on the test set. Finally, the best result, i. e., 77.5 % of the UAR on the test set, is achieved by a late fusion of the two best proposed models and the best single model in the baseline.
KW - Computational Paralinguistics
KW - Data Augmentation
KW - Deep Learning
KW - Speech under Mask
KW - Transfer Learning
UR - http://www.scopus.com/inward/record.url?scp=85098193731&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1552
DO - 10.21437/Interspeech.2020-1552
M3 - Conference contribution
AN - SCOPUS:85098193731
SN - 9781713820697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2047
EP - 2051
BT - Interspeech 2020
PB - International Speech Communication Association
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -