TY - JOUR
T1 - Spatio-temporal attention mechanisms based model for collective activity recognition
AU - Lu, Lihua
AU - Di, Huijun
AU - Lu, Yao
AU - Zhang, Lin
AU - Wang, Shunzhou
N1 - Publisher Copyright:
© 2019
PY - 2019/5
Y1 - 2019/5
N2 - Collective activity recognition involving multiple people active and interactive in a collective scenario is a widely-used but challenging domain in computer vision. The key to this end task is how to efficiently explore the spatial and temporal evolutions of the collective activities. In this paper we propose a spatio-temporal attention mechanisms based model to exploit spatial configurations and temporal dynamics in collective scenes. We present ingenious spatio-temporal attention mechanisms built from both deep RGB features and human articulated poses to capture spatio-temporal evolutions of individuals’ actions and the collective activity. Benefited from these attention mechanisms, our model learns to spatially capture unbalanced person–group interactions for each person while updating each individual state based on these interactions, and temporally assess reliabilities of different video frames to predict the final label of the collective activity. Furthermore, the long-range temporal variability and consistency are handled by a two-stage Gated Recurrent Units (GRUs) network. Finally, to ensure effective training of our model, we jointly optimize the losses at both person and group levels to drive the model learning process. Experimental results indicate that our method outperforms the state-of-the-art on Volleyball dataset. More check experiments and visual results demonstrate the effectiveness and practicability of the proposed model.
AB - Collective activity recognition involving multiple people active and interactive in a collective scenario is a widely-used but challenging domain in computer vision. The key to this end task is how to efficiently explore the spatial and temporal evolutions of the collective activities. In this paper we propose a spatio-temporal attention mechanisms based model to exploit spatial configurations and temporal dynamics in collective scenes. We present ingenious spatio-temporal attention mechanisms built from both deep RGB features and human articulated poses to capture spatio-temporal evolutions of individuals’ actions and the collective activity. Benefited from these attention mechanisms, our model learns to spatially capture unbalanced person–group interactions for each person while updating each individual state based on these interactions, and temporally assess reliabilities of different video frames to predict the final label of the collective activity. Furthermore, the long-range temporal variability and consistency are handled by a two-stage Gated Recurrent Units (GRUs) network. Finally, to ensure effective training of our model, we jointly optimize the losses at both person and group levels to drive the model learning process. Experimental results indicate that our method outperforms the state-of-the-art on Volleyball dataset. More check experiments and visual results demonstrate the effectiveness and practicability of the proposed model.
KW - Attention mechanisms
KW - Gated Recurrent Units (GRUs) network
KW - Multi-modal data
KW - Multi-person activity recognition
KW - Spatio-temporal model
UR - http://www.scopus.com/inward/record.url?scp=85062449442&partnerID=8YFLogxK
U2 - 10.1016/j.image.2019.02.012
DO - 10.1016/j.image.2019.02.012
M3 - Article
AN - SCOPUS:85062449442
SN - 0923-5965
VL - 74
SP - 162
EP - 174
JO - Signal Processing: Image Communication
JF - Signal Processing: Image Communication
ER -