TY - GEN
T1 - 3D Contextual Transformer & Double Inception Network for Human Action Recognition
AU - Liu, Enqi
AU - Hirota, Kaoru
AU - Liu, Chang
AU - Dai, Yaping
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Inception are combined to form the feature extractor called DIFE for capturing short-term contextual features. Moreover, the LSTM is used to obtain long-term action change features, and a multi-stream input framework is introduced to obtain fuller contextual information. It aims to obtain multi-scale spatio-temporal features compared with single convolution methods, where CoT3D combines contextual action information, the DIFE captures short-term features while LSTM fuses long-term features. The experiments are carried out on a laptop with 32G RAM and a GeForce RTX3070 8G GPU by using the KTH dataset, and the results show a recognition accuracy of 97.2%. The obtained results indicate that the proposed CoTDIL-Net promote the convolutional structure understanding of human actions changes.
AB - The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Inception are combined to form the feature extractor called DIFE for capturing short-term contextual features. Moreover, the LSTM is used to obtain long-term action change features, and a multi-stream input framework is introduced to obtain fuller contextual information. It aims to obtain multi-scale spatio-temporal features compared with single convolution methods, where CoT3D combines contextual action information, the DIFE captures short-term features while LSTM fuses long-term features. The experiments are carried out on a laptop with 32G RAM and a GeForce RTX3070 8G GPU by using the KTH dataset, and the results show a recognition accuracy of 97.2%. The obtained results indicate that the proposed CoTDIL-Net promote the convolutional structure understanding of human actions changes.
KW - 2D Inception
KW - 3D Inception
KW - Contextual Transformer
KW - Human action recognition
KW - Long Short-Term Memory
KW - multi-stream input
UR - http://www.scopus.com/inward/record.url?scp=85181824416&partnerID=8YFLogxK
U2 - 10.1109/CCDC58219.2023.10326469
DO - 10.1109/CCDC58219.2023.10326469
M3 - Conference contribution
AN - SCOPUS:85181824416
T3 - Proceedings of the 35th Chinese Control and Decision Conference, CCDC 2023
SP - 1795
EP - 1800
BT - Proceedings of the 35th Chinese Control and Decision Conference, CCDC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 35th Chinese Control and Decision Conference, CCDC 2023
Y2 - 20 May 2023 through 22 May 2023
ER -