TY - GEN
T1 - Semi-Supervised Sound Event Detection with Pre-Trained Model
AU - Xu, Liang
AU - Wang, Lizhong
AU - Bi, Sijun
AU - Liu, Hanyue
AU - Wang, Jing
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and combine pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. If the training dataset is large and general enough, the embedding features extracted by the pre-trained model will cover the potential information in the original task. We use pre-trained model PANNs which is suitable for SED task and proposed two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to improve the model's switching speed at event boundaries and the smoothness within events. Experimental results show that using pre-trained model features outperforms the baseline by 8.5% and 9.1% in DESED public evaluation dataset in terms of polyphonic sound detection score (PSDS).
AB - Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and combine pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. If the training dataset is large and general enough, the embedding features extracted by the pre-trained model will cover the potential information in the original task. We use pre-trained model PANNs which is suitable for SED task and proposed two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to improve the model's switching speed at event boundaries and the smoothness within events. Experimental results show that using pre-trained model features outperforms the baseline by 8.5% and 9.1% in DESED public evaluation dataset in terms of polyphonic sound detection score (PSDS).
KW - mean-teacher
KW - pre-trained
KW - sound event detection
KW - temporal contrastive loss
UR - http://www.scopus.com/inward/record.url?scp=85177596038&partnerID=8YFLogxK
U2 - 10.1109/ICASSP49357.2023.10095687
DO - 10.1109/ICASSP49357.2023.10095687
M3 - Conference contribution
AN - SCOPUS:85177596038
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Y2 - 4 June 2023 through 10 June 2023
ER -