TY - CHAP
T1 - Parallelising 2D-CNNs and Transformers
T2 - A Cognitive-based approach for Automatic Recognition of Learners’ English Proficiency
AU - Song, Meishu
AU - Parada-Cabaleiro, Emilia
AU - Yang, Zijiang
AU - Jing, Xin
AU - Togami, Kazumasa
AU - Qian, Kun
AU - Schuller, Björn W.
AU - Yamamoto, Yoshiharu
N1 - Publisher Copyright:
© 2022, AHFE International. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Learning English as a foreign language requires an extensive use of cognitive capacity, memory, and motor skills in order to orally express one’s thoughts in a clear manner. Current speech recognition intelligence focuses on recognising learners’ oral proficiency from fluency, prosody, pronunciation, and grammar’s perspectives. However, the capacity of clearly and naturally expressing an idea is a high-level cognitive behaviour which can hardly be represented by these detailed and segmental dimensions, which indeed do not fulfil English learners and teachers’ requirements. This work aims to utilise the state-of-the-art deep learning techniques to recognise English speaking proficiency at a cognitive level, i. e., a learner’s ability to clearly organise their own thoughts when expressing an idea in English as a foreign language. For this, we collected the “Oral English for Japanese Learners” Dataset (OEJL-DB), a corpus of recordings by 82 students of a Japanese high school expressing their ideas in English towards 5 different topics. Annotations concerning the clarity of learners’ thoughts are given by 5 English teachers according to 2 classes: clear and unclear. In total, the dataset includes 7.6 hours of audio data with an average length for each oral English presentation of66 seconds. As initial cognitive-based method to identify learners’ speaking proficiency, we propose an architecture based on the parallelization of CNNs and Transformers. With the strengthening of the CNNs in spatial feature representation and the Transformer in sequence encoding, we achieve a 89.4% accuracy and 87.6%. Unweighted Average Recall (UAR), results which outperform those from the ResNet architectures (89.2 % accuracy and 86.3 % UAR). Our promising outcomes reveal that speech intelligence can be efficiently applied to “grasp” high level cognitive behaviours, a new area of research which seems to have a great potential for further investigation.
AB - Learning English as a foreign language requires an extensive use of cognitive capacity, memory, and motor skills in order to orally express one’s thoughts in a clear manner. Current speech recognition intelligence focuses on recognising learners’ oral proficiency from fluency, prosody, pronunciation, and grammar’s perspectives. However, the capacity of clearly and naturally expressing an idea is a high-level cognitive behaviour which can hardly be represented by these detailed and segmental dimensions, which indeed do not fulfil English learners and teachers’ requirements. This work aims to utilise the state-of-the-art deep learning techniques to recognise English speaking proficiency at a cognitive level, i. e., a learner’s ability to clearly organise their own thoughts when expressing an idea in English as a foreign language. For this, we collected the “Oral English for Japanese Learners” Dataset (OEJL-DB), a corpus of recordings by 82 students of a Japanese high school expressing their ideas in English towards 5 different topics. Annotations concerning the clarity of learners’ thoughts are given by 5 English teachers according to 2 classes: clear and unclear. In total, the dataset includes 7.6 hours of audio data with an average length for each oral English presentation of66 seconds. As initial cognitive-based method to identify learners’ speaking proficiency, we propose an architecture based on the parallelization of CNNs and Transformers. With the strengthening of the CNNs in spatial feature representation and the Transformer in sequence encoding, we achieve a 89.4% accuracy and 87.6%. Unweighted Average Recall (UAR), results which outperform those from the ResNet architectures (89.2 % accuracy and 86.3 % UAR). Our promising outcomes reveal that speech intelligence can be efficiently applied to “grasp” high level cognitive behaviours, a new area of research which seems to have a great potential for further investigation.
KW - CNNs
KW - English Speaking
KW - transformer
UR - https://www.scopus.com/pages/publications/105036061752
U2 - 10.54941/ahfe1001000
DO - 10.54941/ahfe1001000
M3 - Chapter
AN - SCOPUS:105036061752
T3 - Applied Human Factors and Ergonomics International
BT - Applied Human Factors and Ergonomics International
PB - AHFE International
ER -