TY - GEN
T1 - Explainable Stuttering Recognition Using Axial Attention
AU - Ma, Yu
AU - Huang, Yuting
AU - Yuan, Kaixiang
AU - Xuan, Guangzhe
AU - Yu, Yongzi
AU - Zhong, Hengrui
AU - Li, Rui
AU - Shen, Jian
AU - Qian, Kun
AU - Hu, Bin
AU - Schuller, Björn W.
AU - Yamamoto, Yoshiharu
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
PY - 2023
Y1 - 2023
N2 - Stuttering is a complex speech disorder that disrupts the flow of speech, and recognizing persons who stutter (PWS) and understanding their significant struggles is crucial. With advancements in computer vision, deep neural networks offer potential for recognizing stuttering events through image-based features. In this paper, we extract image features of Wavelet Transformation (WT) and Histograms of Oriented Gradient (HOG) from audio signals. We also generate explainable images using Gradient-weighted Class Activation Mapping (Grad-CAM) as input for our final recognition model–an axial attention-based EfficientNetV2, which is trained on the Kassel State of Fluency Dataset (KSoF) to perform 8 classes recognition. Our experimental results achieved a relative percentage increase in unweighted average recall (UAR) of 4.4% compared to the baseline of ComParE 2022, demonstrating that the axial attention-based EfficientNetV2, combined with the explainable input, has the capability to detect and recognise multiple types of stuttering.
AB - Stuttering is a complex speech disorder that disrupts the flow of speech, and recognizing persons who stutter (PWS) and understanding their significant struggles is crucial. With advancements in computer vision, deep neural networks offer potential for recognizing stuttering events through image-based features. In this paper, we extract image features of Wavelet Transformation (WT) and Histograms of Oriented Gradient (HOG) from audio signals. We also generate explainable images using Gradient-weighted Class Activation Mapping (Grad-CAM) as input for our final recognition model–an axial attention-based EfficientNetV2, which is trained on the Kassel State of Fluency Dataset (KSoF) to perform 8 classes recognition. Our experimental results achieved a relative percentage increase in unweighted average recall (UAR) of 4.4% compared to the baseline of ComParE 2022, demonstrating that the axial attention-based EfficientNetV2, combined with the explainable input, has the capability to detect and recognise multiple types of stuttering.
KW - Histogram of Oriented Gradient
KW - Speech
KW - Stuttering Recognition
KW - Wavelet Transformation
UR - http://www.scopus.com/inward/record.url?scp=85174804382&partnerID=8YFLogxK
U2 - 10.1007/978-981-99-4749-2_18
DO - 10.1007/978-981-99-4749-2_18
M3 - Conference contribution
AN - SCOPUS:85174804382
SN - 9789819947485
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 209
EP - 220
BT - Advanced Intelligent Computing Technology and Applications - 19th International Conference, ICIC 2023, Proceedings
A2 - Huang, De-Shuang
A2 - Premaratne, Prashan
A2 - Jin, Baohua
A2 - Qu, Boyang
A2 - Jo, Kang-Hyun
A2 - Hussain, Abir
PB - Springer Science and Business Media Deutschland GmbH
T2 - 19th International Conference on Intelligent Computing, ICIC 2023
Y2 - 10 August 2023 through 13 August 2023
ER -