Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Xinxiao Wu; Ruiqi Wang; Jingyi Hou; Hanxi Lin; Jiebo Luo

doi:10.1007/s11263-020-01409-9

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Xinxiao Wu^*, Ruiqi Wang, Jingyi Hou, Hanxi Lin, Jiebo Luo

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

24 Citations (Scopus)

Abstract

Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

Original language	English
Pages (from-to)	1484-1505
Number of pages	22
Journal	International Journal of Computer Vision
Volume	129
Issue number	5
DOIs	https://doi.org/10.1007/s11263-020-01409-9
Publication status	Published - May 2021

Keywords

Action prediction
Improved gated graph neural network
Long short-term graph network
Spatial–temporal relation reasoning

Access to Document

10.1007/s11263-020-01409-9

Cite this

@article{8a16b4b00c3f4493b76bd7e98561a997,

title = "Spatial–Temporal Relation Reasoning for Action Prediction in Videos",

abstract = "Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.",

keywords = "Action prediction, Improved gated graph neural network, Long short-term graph network, Spatial–temporal relation reasoning",

author = "Xinxiao Wu and Ruiqi Wang and Jingyi Hou and Hanxi Lin and Jiebo Luo",

note = "Publisher Copyright: {\textcopyright} 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature.",

year = "2021",

month = may,

doi = "10.1007/s11263-020-01409-9",

language = "English",

volume = "129",

pages = "1484--1505",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

publisher = "Springer Netherlands",

number = "5",

}

TY - JOUR

T1 - Spatial–Temporal Relation Reasoning for Action Prediction in Videos

AU - Wu, Xinxiao

AU - Wang, Ruiqi

AU - Hou, Jingyi

AU - Lin, Hanxi

AU - Luo, Jiebo

PY - 2021/5

Y1 - 2021/5

N2 - Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

AB - Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

KW - Action prediction

KW - Improved gated graph neural network

KW - Long short-term graph network

KW - Spatial–temporal relation reasoning

UR - http://www.scopus.com/inward/record.url?scp=85101026889&partnerID=8YFLogxK

U2 - 10.1007/s11263-020-01409-9

DO - 10.1007/s11263-020-01409-9

M3 - Article

AN - SCOPUS:85101026889

SN - 0920-5691

VL - 129

SP - 1484

EP - 1505

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

IS - 5

ER -

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this