Bidirectional LSTM with saliency-aware 3D-CNN features for human action recognition

Sheeraz Arif; Jing Wang; Adnan Ahmed Siddiqui; Rashid Hussain; Fida Hussain

doi:10.36909/jer.v9i3A.8383

Bidirectional LSTM with saliency-aware 3D-CNN features for human action recognition

Sheeraz Arif^*, Jing Wang, Adnan Ahmed Siddiqui, Rashid Hussain, Fida Hussain

^*此作品的通讯作者

信息与电子学院

科研成果: 期刊稿件 › 文章 › 同行评审

8 引用（Scopus）

摘要

Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.

源语言	英语
页（从-至）	115-133
页数	19
期刊	Journal of Engineering Research (Kuwait)
卷	9
期	3
DOI	https://doi.org/10.36909/jer.v9i3A.8383
出版状态	已出版 - 2 9月 2021

访问文件

10.36909/jer.v9i3A.8383

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{2927380bee074ff1a3a726aab8f5479e,

title = "Bidirectional LSTM with saliency-aware 3D-CNN features for human action recognition",

abstract = "Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.",

keywords = "Action recognition, Convolutional neural network (CNN), Long-short-term-memory (LSTM), Recurrent neural network (RNN), Saliency",

author = "Sheeraz Arif and Jing Wang and Siddiqui, {Adnan Ahmed} and Rashid Hussain and Fida Hussain",

year = "2021",

month = sep,

day = "2",

doi = "10.36909/jer.v9i3A.8383",

language = "English",

volume = "9",

pages = "115--133",

journal = "Journal of Engineering Research (Kuwait)",

issn = "2307-1877",

publisher = "University of Kuwait",

number = "3",

}

TY - JOUR

T1 - Bidirectional LSTM with saliency-aware 3D-CNN features for human action recognition

AU - Arif, Sheeraz

AU - Wang, Jing

AU - Siddiqui, Adnan Ahmed

AU - Hussain, Rashid

AU - Hussain, Fida

PY - 2021/9/2

Y1 - 2021/9/2

N2 - Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.

AB - Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.

KW - Action recognition

KW - Convolutional neural network (CNN)

KW - Long-short-term-memory (LSTM)

KW - Recurrent neural network (RNN)

KW - Saliency

UR - http://www.scopus.com/inward/record.url?scp=85114193096&partnerID=8YFLogxK

U2 - 10.36909/jer.v9i3A.8383

DO - 10.36909/jer.v9i3A.8383

M3 - Article

AN - SCOPUS:85114193096

SN - 2307-1877

VL - 9

SP - 115

EP - 133

JO - Journal of Engineering Research (Kuwait)

JF - Journal of Engineering Research (Kuwait)

IS - 3

ER -

Bidirectional LSTM with saliency-aware 3D-CNN features for human action recognition

摘要

访问文件

其它文件与链接

指纹

引用此