Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

Shuo Liu; Adria Mallol-Ragolta; Tianhao Yan; Kun Qian; Emilia Parada-Cabaleiro; Bin Hu; Björn W. Schuller

doi:10.1109/JBHI.2022.3173128

Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

Shuo Liu^*, Adria Mallol-Ragolta, Tianhao Yan, Kun Qian^*, Emilia Parada-Cabaleiro, Bin Hu^*, Björn W. Schuller

^*此作品的通讯作者

医学技术学院

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.

源语言	英语
页（从-至）	4291-4302
页数	12
期刊	IEEE Journal of Biomedical and Health Informatics
卷	26
期	8
DOI	https://doi.org/10.1109/JBHI.2022.3173128
出版状态	已出版 - 1 8月 2022

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/JBHI.2022.3173128

其它文件与链接

链接到 Scopus 的出版物

引用此

Liu, S., Mallol-Ragolta, A., Yan, T., Qian, K., Parada-Cabaleiro, E., Hu, B., & Schuller, B. W. (2022). Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection. IEEE Journal of Biomedical and Health Informatics, 26(8), 4291-4302. https://doi.org/10.1109/JBHI.2022.3173128

@article{f712258f20f04ae2b784bfb282cdf037,

title = "Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection",

abstract = "The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.",

keywords = "Face mask detection, audio processing, convolutional recurrent neural network, convolutional transformer network, multi-head attention",

author = "Shuo Liu and Adria Mallol-Ragolta and Tianhao Yan and Kun Qian and Emilia Parada-Cabaleiro and Bin Hu and Schuller, {Bj{\"o}rn W.}",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2022",

month = aug,

day = "1",

doi = "10.1109/JBHI.2022.3173128",

language = "English",

volume = "26",

pages = "4291--4302",

journal = "IEEE Journal of Biomedical and Health Informatics",

issn = "2168-2194",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "8",

}

TY - JOUR

T1 - Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

AU - Liu, Shuo

AU - Mallol-Ragolta, Adria

AU - Yan, Tianhao

AU - Qian, Kun

AU - Parada-Cabaleiro, Emilia

AU - Hu, Bin

AU - Schuller, Björn W.

PY - 2022/8/1

Y1 - 2022/8/1

N2 - The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.

AB - The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.

KW - Face mask detection

KW - audio processing

KW - convolutional recurrent neural network

KW - convolutional transformer network

KW - multi-head attention

UR - http://www.scopus.com/inward/record.url?scp=85132531912&partnerID=8YFLogxK

U2 - 10.1109/JBHI.2022.3173128

DO - 10.1109/JBHI.2022.3173128

M3 - Article

C2 - 35522639

AN - SCOPUS:85132531912

SN - 2168-2194

VL - 26

SP - 4291

EP - 4302

JO - IEEE Journal of Biomedical and Health Informatics

JF - IEEE Journal of Biomedical and Health Informatics

IS - 8

ER -

Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此