TY - GEN
T1 - Action recognition with bootstrapping based long-range temporal context attention
AU - Liu, Ziming
AU - Gao, Guangyu
AU - Qin, A. K.
AU - Wu, Tong
AU - Liu, Chi Harold
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/10/15
Y1 - 2019/10/15
N2 - Actions always refer to complex vision variations in a long-range redundant video sequence. Instead of focusing on limited range sequence, i.e. convolution on adjacent frames, in this paper, we proposed an action recognition approach with bootstrapping based long-range temporal context attention. Specifically, due to vision variations of the local region across frames, we target at capturing temporal context by proposing the Temporal Pixels based Parallel-head Attention (TPPA) block. In TPPA, we apply the self-attention mechanism between local regions at the same position across temporal frames to capture the interaction impacts. Meanwhile, to deal with video redundancy and capture long-range context, the TPPA is extended to the Random Frames based Bootstrapping Attention (RFBA) framework. While the bootstrapping sampling frames have the same distribution of the whole video sequence, the RFBA not only captures longer temporal context with only a few sampling frames but also has comprehensive representation through multiple sampling. Furthermore, we also try to apply this temporal context attention to image-based action recognition, by transforming the image into “pseudo video“with the spatial shift. Finally, we conduct extensive experiments and empirical evaluations on two most popular datasets: UCF101 for videos and Stanford40 for images. In particular, our approach achieves top-1 accuracy of 91.7% in UCF101 and mAP of 90.9% in Stanford40.
AB - Actions always refer to complex vision variations in a long-range redundant video sequence. Instead of focusing on limited range sequence, i.e. convolution on adjacent frames, in this paper, we proposed an action recognition approach with bootstrapping based long-range temporal context attention. Specifically, due to vision variations of the local region across frames, we target at capturing temporal context by proposing the Temporal Pixels based Parallel-head Attention (TPPA) block. In TPPA, we apply the self-attention mechanism between local regions at the same position across temporal frames to capture the interaction impacts. Meanwhile, to deal with video redundancy and capture long-range context, the TPPA is extended to the Random Frames based Bootstrapping Attention (RFBA) framework. While the bootstrapping sampling frames have the same distribution of the whole video sequence, the RFBA not only captures longer temporal context with only a few sampling frames but also has comprehensive representation through multiple sampling. Furthermore, we also try to apply this temporal context attention to image-based action recognition, by transforming the image into “pseudo video“with the spatial shift. Finally, we conduct extensive experiments and empirical evaluations on two most popular datasets: UCF101 for videos and Stanford40 for images. In particular, our approach achieves top-1 accuracy of 91.7% in UCF101 and mAP of 90.9% in Stanford40.
KW - Action recognition
KW - Bootstrapping attention
KW - Context
KW - Self-attention
UR - http://www.scopus.com/inward/record.url?scp=85074850177&partnerID=8YFLogxK
U2 - 10.1145/3343031.3350916
DO - 10.1145/3343031.3350916
M3 - Conference contribution
AN - SCOPUS:85074850177
T3 - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
SP - 583
EP - 591
BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 27th ACM International Conference on Multimedia, MM 2019
Y2 - 21 October 2019 through 25 October 2019
ER -