Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task

Yu Qiao; Kun Qian; Ziping Zhao; Xiaojing Zhao

doi:10.1007/978-981-16-1649-5_14

Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task

Yu Qiao, Kun Qian, Ziping Zhao^*, Xiaojing Zhao

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

When writing this article, COVID-19 as a global epidemic, has affected more than 200 countries and territories globally and lead to more than 694,000 deaths. Wearing a mask is one of most convenient, cheap, and efficient precautions. Moreover, guaranteeing a quality of the speech under the condition of wearing a mask is crucial in real-world telecommunication technologies. To this line, the goal of the ComParE 2020 Mask condition recognition of speakers subchallenge is to recognize the states of speakers with or without facial masks worn. In this work, we present three modeling methods under the deep neural network framework, namely Convolutional Recurrent Neural Network(CRNN), Convolutional Temporal Convolutional Network(CTCNs) and CTCNs combined with utterance level features, respectively. Furthermore, we use cycle mode to fill the samples to further enhance the system performance. In the CTCNs model, we tried different network depths. Finally, the experimental results demonstrate the effectiveness of the CTCNs network structure, which can reach an unweighted average recall (UAR) at 66.4% on the development set. This is higher than the result of baseline, which is 64.4% in S2SAE+SVM nerwork(a significance level at p< 0.001 by one-tailed z-test). It demonstrates the good performance of our proposed network.

Original language	English
Title of host publication	Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT
Editors	Xi Shao, Kun Qian, Li Zhou, Xin Wang, Ziping Zhao
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	163-174
Number of pages	12
ISBN (Print)	9789811616488
DOIs	https://doi.org/10.1007/978-981-16-1649-5_14
Publication status	Published - 2021
Externally published	Yes
Event	8th Conference on Sound and Music Technology, CSMT 2020 - Taiyuan, China Duration: 5 Nov 2020 → 8 Nov 2020

Publication series

Name	Lecture Notes in Electrical Engineering
Volume	761 LNEE
ISSN (Print)	1876-1100
ISSN (Electronic)	1876-1119

Conference

Conference	8th Conference on Sound and Music Technology, CSMT 2020
Country/Territory	China
City	Taiyuan
Period	5/11/20 → 8/11/20

Keywords

Computational paralinguistics
Deep learning framework
Mask condition recognition
Speech recognition

Access to Document

10.1007/978-981-16-1649-5_14

Cite this

Qiao, Y., Qian, K., Zhao, Z., & Zhao, X. (2021). Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task. In X. Shao, K. Qian, L. Zhou, X. Wang, & Z. Zhao (Eds.), Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT (pp. 163-174). (Lecture Notes in Electrical Engineering; Vol. 761 LNEE). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-16-1649-5_14

Qiao, Yu ; Qian, Kun ; Zhao, Ziping et al. / Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task. Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT. editor / Xi Shao ; Kun Qian ; Li Zhou ; Xin Wang ; Ziping Zhao. Springer Science and Business Media Deutschland GmbH, 2021. pp. 163-174 (Lecture Notes in Electrical Engineering).

@inproceedings{4a5c24fdf45f4605beaedd7ec3bd6b07,

title = "Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task",

abstract = "When writing this article, COVID-19 as a global epidemic, has affected more than 200 countries and territories globally and lead to more than 694,000 deaths. Wearing a mask is one of most convenient, cheap, and efficient precautions. Moreover, guaranteeing a quality of the speech under the condition of wearing a mask is crucial in real-world telecommunication technologies. To this line, the goal of the ComParE 2020 Mask condition recognition of speakers subchallenge is to recognize the states of speakers with or without facial masks worn. In this work, we present three modeling methods under the deep neural network framework, namely Convolutional Recurrent Neural Network(CRNN), Convolutional Temporal Convolutional Network(CTCNs) and CTCNs combined with utterance level features, respectively. Furthermore, we use cycle mode to fill the samples to further enhance the system performance. In the CTCNs model, we tried different network depths. Finally, the experimental results demonstrate the effectiveness of the CTCNs network structure, which can reach an unweighted average recall (UAR) at 66.4% on the development set. This is higher than the result of baseline, which is 64.4% in S2SAE+SVM nerwork(a significance level at p< 0.001 by one-tailed z-test). It demonstrates the good performance of our proposed network.",

keywords = "Computational paralinguistics, Deep learning framework, Mask condition recognition, Speech recognition",

author = "Yu Qiao and Kun Qian and Ziping Zhao and Xiaojing Zhao",

note = "Publisher Copyright: {\textcopyright} 2021, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.; 8th Conference on Sound and Music Technology, CSMT 2020 ; Conference date: 05-11-2020 Through 08-11-2020",

year = "2021",

doi = "10.1007/978-981-16-1649-5_14",

language = "English",

isbn = "9789811616488",

series = "Lecture Notes in Electrical Engineering",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "163--174",

editor = "Xi Shao and Kun Qian and Li Zhou and Xin Wang and Ziping Zhao",

booktitle = "Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT",

address = "Germany",

}

Qiao, Y, Qian, K, Zhao, Z & Zhao, X 2021, Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task. in X Shao, K Qian, L Zhou, X Wang & Z Zhao (eds), Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT. Lecture Notes in Electrical Engineering, vol. 761 LNEE, Springer Science and Business Media Deutschland GmbH, pp. 163-174, 8th Conference on Sound and Music Technology, CSMT 2020, Taiyuan, China, 5/11/20. https://doi.org/10.1007/978-981-16-1649-5_14

Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task. / Qiao, Yu; Qian, Kun; Zhao, Ziping et al.
Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT. ed. / Xi Shao; Kun Qian; Li Zhou; Xin Wang; Ziping Zhao. Springer Science and Business Media Deutschland GmbH, 2021. p. 163-174 (Lecture Notes in Electrical Engineering; Vol. 761 LNEE).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task

AU - Qiao, Yu

AU - Qian, Kun

AU - Zhao, Ziping

AU - Zhao, Xiaojing

PY - 2021

Y1 - 2021

N2 - When writing this article, COVID-19 as a global epidemic, has affected more than 200 countries and territories globally and lead to more than 694,000 deaths. Wearing a mask is one of most convenient, cheap, and efficient precautions. Moreover, guaranteeing a quality of the speech under the condition of wearing a mask is crucial in real-world telecommunication technologies. To this line, the goal of the ComParE 2020 Mask condition recognition of speakers subchallenge is to recognize the states of speakers with or without facial masks worn. In this work, we present three modeling methods under the deep neural network framework, namely Convolutional Recurrent Neural Network(CRNN), Convolutional Temporal Convolutional Network(CTCNs) and CTCNs combined with utterance level features, respectively. Furthermore, we use cycle mode to fill the samples to further enhance the system performance. In the CTCNs model, we tried different network depths. Finally, the experimental results demonstrate the effectiveness of the CTCNs network structure, which can reach an unweighted average recall (UAR) at 66.4% on the development set. This is higher than the result of baseline, which is 64.4% in S2SAE+SVM nerwork(a significance level at p< 0.001 by one-tailed z-test). It demonstrates the good performance of our proposed network.

AB - When writing this article, COVID-19 as a global epidemic, has affected more than 200 countries and territories globally and lead to more than 694,000 deaths. Wearing a mask is one of most convenient, cheap, and efficient precautions. Moreover, guaranteeing a quality of the speech under the condition of wearing a mask is crucial in real-world telecommunication technologies. To this line, the goal of the ComParE 2020 Mask condition recognition of speakers subchallenge is to recognize the states of speakers with or without facial masks worn. In this work, we present three modeling methods under the deep neural network framework, namely Convolutional Recurrent Neural Network(CRNN), Convolutional Temporal Convolutional Network(CTCNs) and CTCNs combined with utterance level features, respectively. Furthermore, we use cycle mode to fill the samples to further enhance the system performance. In the CTCNs model, we tried different network depths. Finally, the experimental results demonstrate the effectiveness of the CTCNs network structure, which can reach an unweighted average recall (UAR) at 66.4% on the development set. This is higher than the result of baseline, which is 64.4% in S2SAE+SVM nerwork(a significance level at p< 0.001 by one-tailed z-test). It demonstrates the good performance of our proposed network.

KW - Computational paralinguistics

KW - Deep learning framework

KW - Mask condition recognition

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85105893175&partnerID=8YFLogxK

U2 - 10.1007/978-981-16-1649-5_14

DO - 10.1007/978-981-16-1649-5_14

M3 - Conference contribution

AN - SCOPUS:85105893175

SN - 9789811616488

T3 - Lecture Notes in Electrical Engineering

SP - 163

EP - 174

BT - Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT

A2 - Shao, Xi

A2 - Qian, Kun

A2 - Zhou, Li

A2 - Wang, Xin

A2 - Zhao, Ziping

PB - Springer Science and Business Media Deutschland GmbH

T2 - 8th Conference on Sound and Music Technology, CSMT 2020

Y2 - 5 November 2020 through 8 November 2020

ER -

Qiao Y, Qian K, Zhao Z, Zhao X. Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task. In Shao X, Qian K, Zhou L, Wang X, Zhao Z, editors, Proceedings of the 8th Conference on Sound and Music Technology - Selected Papers from CSMT. Springer Science and Business Media Deutschland GmbH. 2021. p. 163-174. (Lecture Notes in Electrical Engineering). doi: 10.1007/978-981-16-1649-5_14

Are You Speaking with a Mask? An Investigation on Attention Based Deep Temporal Convolutional Neural Networks for Mask Detection Task

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this