TY - GEN
T1 - Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task
AU - Qiao, Yu
AU - Qian, Kun
AU - Zhao, Ziping
AU - Zhao, Xiaojing
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
PY - 2021
Y1 - 2021
N2 - When writing this article, COVID-19 as a global epidemic, has affected more than 200 countries and territories globally and lead to more than 694,000 deaths. Wearing a mask is one of most convenient, cheap, and efficient precautions. Moreover, guaranteeing a quality of the speech under the condition of wearing a mask is crucial in real-world telecommunication technologies. To this line, the goal of the ComParE 2020 Mask condition recognition of speakers subchallenge is to recognize the states of speakers with or without facial masks worn. In this work, we present three modeling methods under the deep neural network framework, namely Convolutional Recurrent Neural Network(CRNN), Convolutional Temporal Convolutional Network(CTCNs) and CTCNs combined with utterance level features, respectively. Furthermore, we use cycle mode to fill the samples to further enhance the system performance. In the CTCNs model, we tried different network depths. Finally, the experimental results demonstrate the effectiveness of the CTCNs network structure, which can reach an unweighted average recall (UAR) at 66.4% on the development set. This is higher than the result of baseline, which is 64.4% in S2SAE+SVM nerwork(a significance level at p< 0.001 by one-tailed z-test). It demonstrates the good performance of our proposed network.
AB - When writing this article, COVID-19 as a global epidemic, has affected more than 200 countries and territories globally and lead to more than 694,000 deaths. Wearing a mask is one of most convenient, cheap, and efficient precautions. Moreover, guaranteeing a quality of the speech under the condition of wearing a mask is crucial in real-world telecommunication technologies. To this line, the goal of the ComParE 2020 Mask condition recognition of speakers subchallenge is to recognize the states of speakers with or without facial masks worn. In this work, we present three modeling methods under the deep neural network framework, namely Convolutional Recurrent Neural Network(CRNN), Convolutional Temporal Convolutional Network(CTCNs) and CTCNs combined with utterance level features, respectively. Furthermore, we use cycle mode to fill the samples to further enhance the system performance. In the CTCNs model, we tried different network depths. Finally, the experimental results demonstrate the effectiveness of the CTCNs network structure, which can reach an unweighted average recall (UAR) at 66.4% on the development set. This is higher than the result of baseline, which is 64.4% in S2SAE+SVM nerwork(a significance level at p< 0.001 by one-tailed z-test). It demonstrates the good performance of our proposed network.
KW - Computational paralinguistics
KW - Deep learning framework
KW - Mask condition recognition
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85105893175&partnerID=8YFLogxK
U2 - 10.1007/978-981-16-1649-5_14
DO - 10.1007/978-981-16-1649-5_14
M3 - Conference contribution
AN - SCOPUS:85105893175
SN - 9789811616488
T3 - Lecture Notes in Electrical Engineering
SP - 163
EP - 174
BT - Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT
A2 - Shao, Xi
A2 - Qian, Kun
A2 - Zhou, Li
A2 - Wang, Xin
A2 - Zhao, Ziping
PB - Springer Science and Business Media Deutschland GmbH
T2 - 8th Conference on Sound and Music Technology, CSMT 2020
Y2 - 5 November 2020 through 8 November 2020
ER -