Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Xinxiao Wu*, Ruiqi Wang, Jingyi Hou, Hanxi Lin, Jiebo Luo

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

23 Citations (Scopus)

Abstract

Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

Original languageEnglish
Pages (from-to)1484-1505
Number of pages22
JournalInternational Journal of Computer Vision
Volume129
Issue number5
DOIs
Publication statusPublished - May 2021

Keywords

  • Action prediction
  • Improved gated graph neural network
  • Long short-term graph network
  • Spatial–temporal relation reasoning

Fingerprint

Dive into the research topics of 'Spatial–Temporal Relation Reasoning for Action Prediction in Videos'. Together they form a unique fingerprint.

Cite this