Semi-Supervised Sound Event Detection with Pre-Trained Model

Liang Xu; Lizhong Wang; Sijun Bi; Hanyue Liu; Jing Wang

doi:10.1109/ICASSP49357.2023.10095687

Semi-Supervised Sound Event Detection with Pre-Trained Model

Liang Xu^*, Lizhong Wang, Sijun Bi^*, Hanyue Liu^*, Jing Wang^*

^*Corresponding author for this work

School of Information and Electronics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

11 Citations (Scopus)

Abstract

Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and combine pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. If the training dataset is large and general enough, the embedding features extracted by the pre-trained model will cover the potential information in the original task. We use pre-trained model PANNs which is suitable for SED task and proposed two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to improve the model's switching speed at event boundaries and the smoothness within events. Experimental results show that using pre-trained model features outperforms the baseline by 8.5% and 9.1% in DESED public evaluation dataset in terms of polyphonic sound detection score (PSDS).

Original language	English
Title of host publication	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781728163277
DOIs	https://doi.org/10.1109/ICASSP49357.2023.10095687
Publication status	Published - 2023
Event	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2023-June
ISSN (Print)	1520-6149

Conference

Conference	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Country/Territory	Greece
City	Rhodes Island
Period	4/06/23 → 10/06/23

Keywords

mean-teacher
pre-trained
sound event detection
temporal contrastive loss

Access to Document

10.1109/ICASSP49357.2023.10095687

Cite this

Xu, L., Wang, L., Bi, S., Liu, H., & Wang, J. (2023). Semi-Supervised Sound Event Detection with Pre-Trained Model. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP49357.2023.10095687

Xu, Liang ; Wang, Lizhong ; Bi, Sijun et al. / Semi-Supervised Sound Event Detection with Pre-Trained Model. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{3c613e81af964b2bbbee3beb72839e88,

title = "Semi-Supervised Sound Event Detection with Pre-Trained Model",

abstract = "Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and combine pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. If the training dataset is large and general enough, the embedding features extracted by the pre-trained model will cover the potential information in the original task. We use pre-trained model PANNs which is suitable for SED task and proposed two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to improve the model's switching speed at event boundaries and the smoothness within events. Experimental results show that using pre-trained model features outperforms the baseline by 8.5% and 9.1% in DESED public evaluation dataset in terms of polyphonic sound detection score (PSDS).",

keywords = "mean-teacher, pre-trained, sound event detection, temporal contrastive loss",

author = "Liang Xu and Lizhong Wang and Sijun Bi and Hanyue Liu and Jing Wang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10095687",

language = "English",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings",

address = "United States",

}

Xu, L, Wang, L, Bi, S, Liu, H & Wang, J 2023, Semi-Supervised Sound Event Detection with Pre-Trained Model. in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Greece, 4/06/23. https://doi.org/10.1109/ICASSP49357.2023.10095687

Semi-Supervised Sound Event Detection with Pre-Trained Model. / Xu, Liang; Wang, Lizhong; Bi, Sijun et al.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Semi-Supervised Sound Event Detection with Pre-Trained Model

AU - Xu, Liang

AU - Wang, Lizhong

AU - Bi, Sijun

AU - Liu, Hanyue

AU - Wang, Jing

PY - 2023

Y1 - 2023

N2 - Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and combine pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. If the training dataset is large and general enough, the embedding features extracted by the pre-trained model will cover the potential information in the original task. We use pre-trained model PANNs which is suitable for SED task and proposed two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to improve the model's switching speed at event boundaries and the smoothness within events. Experimental results show that using pre-trained model features outperforms the baseline by 8.5% and 9.1% in DESED public evaluation dataset in terms of polyphonic sound detection score (PSDS).

AB - Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and combine pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. If the training dataset is large and general enough, the embedding features extracted by the pre-trained model will cover the potential information in the original task. We use pre-trained model PANNs which is suitable for SED task and proposed two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to improve the model's switching speed at event boundaries and the smoothness within events. Experimental results show that using pre-trained model features outperforms the baseline by 8.5% and 9.1% in DESED public evaluation dataset in terms of polyphonic sound detection score (PSDS).

KW - mean-teacher

KW - pre-trained

KW - sound event detection

KW - temporal contrastive loss

UR - http://www.scopus.com/inward/record.url?scp=85177596038&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10095687

DO - 10.1109/ICASSP49357.2023.10095687

M3 - Conference contribution

AN - SCOPUS:85177596038

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

Y2 - 4 June 2023 through 10 June 2023

ER -

Xu L, Wang L, Bi S, Liu H, Wang J. Semi-Supervised Sound Event Detection with Pre-Trained Model. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP49357.2023.10095687

Semi-Supervised Sound Event Detection with Pre-Trained Model

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this