3D Contextual Transformer & Double Inception Network for Human Action Recognition

Enqi Liu, Kaoru Hirota, Chang Liu, Yaping Dai*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The 3D Contextual Transformer & Double Inception Network called CoTDIL-Net is proposed for human action recognition. The spatio-temporal enrichment module based on a 3D Contextual Transformer (CoT3D) is proposed for enhancing the features of adjacent frames. In addition, 3D Inception and 2D Inception are combined to form the feature extractor called DIFE for capturing short-term contextual features. Moreover, the LSTM is used to obtain long-term action change features, and a multi-stream input framework is introduced to obtain fuller contextual information. It aims to obtain multi-scale spatio-temporal features compared with single convolution methods, where CoT3D combines contextual action information, the DIFE captures short-term features while LSTM fuses long-term features. The experiments are carried out on a laptop with 32G RAM and a GeForce RTX3070 8G GPU by using the KTH dataset, and the results show a recognition accuracy of 97.2%. The obtained results indicate that the proposed CoTDIL-Net promote the convolutional structure understanding of human actions changes.

Original languageEnglish
Title of host publicationProceedings of the 35th Chinese Control and Decision Conference, CCDC 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1795-1800
Number of pages6
ISBN (Electronic)9798350334722
DOIs
Publication statusPublished - 2023
Event35th Chinese Control and Decision Conference, CCDC 2023 - Yichang, China
Duration: 20 May 202322 May 2023

Publication series

NameProceedings of the 35th Chinese Control and Decision Conference, CCDC 2023

Conference

Conference35th Chinese Control and Decision Conference, CCDC 2023
Country/TerritoryChina
CityYichang
Period20/05/2322/05/23

Keywords

  • 2D Inception
  • 3D Inception
  • Contextual Transformer
  • Human action recognition
  • Long Short-Term Memory
  • multi-stream input

Fingerprint

Dive into the research topics of '3D Contextual Transformer & Double Inception Network for Human Action Recognition'. Together they form a unique fingerprint.

Cite this