跳到主要导航 跳到搜索 跳到主要内容

Parallelising 2D-CNNs and Transformers: A Cognitive-based approach for Automatic Recognition of Learners’ English Proficiency

  • Meishu Song
  • , Emilia Parada-Cabaleiro
  • , Zijiang Yang
  • , Xin Jing
  • , Kazumasa Togami
  • , Kun Qian*
  • , Björn W. Schuller
  • , Yoshiharu Yamamoto
  • *此作品的通讯作者
  • Augsburg University
  • The University of Tokyo
  • Johannes Kepler University Linz
  • Beijing Institute of Technology
  • Imperial College London

科研成果: 书/报告/会议事项章节章节同行评审

摘要

Learning English as a foreign language requires an extensive use of cognitive capacity, memory, and motor skills in order to orally express one’s thoughts in a clear manner. Current speech recognition intelligence focuses on recognising learners’ oral proficiency from fluency, prosody, pronunciation, and grammar’s perspectives. However, the capacity of clearly and naturally expressing an idea is a high-level cognitive behaviour which can hardly be represented by these detailed and segmental dimensions, which indeed do not fulfil English learners and teachers’ requirements. This work aims to utilise the state-of-the-art deep learning techniques to recognise English speaking proficiency at a cognitive level, i. e., a learner’s ability to clearly organise their own thoughts when expressing an idea in English as a foreign language. For this, we collected the “Oral English for Japanese Learners” Dataset (OEJL-DB), a corpus of recordings by 82 students of a Japanese high school expressing their ideas in English towards 5 different topics. Annotations concerning the clarity of learners’ thoughts are given by 5 English teachers according to 2 classes: clear and unclear. In total, the dataset includes 7.6 hours of audio data with an average length for each oral English presentation of66 seconds. As initial cognitive-based method to identify learners’ speaking proficiency, we propose an architecture based on the parallelization of CNNs and Transformers. With the strengthening of the CNNs in spatial feature representation and the Transformer in sequence encoding, we achieve a 89.4% accuracy and 87.6%. Unweighted Average Recall (UAR), results which outperform those from the ResNet architectures (89.2 % accuracy and 86.3 % UAR). Our promising outcomes reveal that speech intelligence can be efficiently applied to “grasp” high level cognitive behaviours, a new area of research which seems to have a great potential for further investigation.

源语言英语
主期刊名Applied Human Factors and Ergonomics International
出版商AHFE International
版本22
DOI
出版状态已出版 - 2022
已对外发布

出版系列

姓名Applied Human Factors and Ergonomics International
编号22
22
ISSN(电子版)2771-0718

指纹

探究 'Parallelising 2D-CNNs and Transformers: A Cognitive-based approach for Automatic Recognition of Learners’ English Proficiency' 的科研主题。它们共同构成独一无二的指纹。

引用此