TY - JOUR
T1 - 3-D Relation Network for visual relation recognition in videos
AU - Cao, Qianwen
AU - Huang, Heyan
AU - Shang, Xindi
AU - Wang, Boran
AU - Chua, Tat Seng
N1 - Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2021/4/7
Y1 - 2021/4/7
N2 - Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈subject,predicate,object〉, such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.
AB - Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈subject,predicate,object〉, such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.
KW - Computer vision
KW - Deep neural network
KW - Video visual relation recognition
KW - Visual relation detection
UR - http://www.scopus.com/inward/record.url?scp=85098980087&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2020.12.029
DO - 10.1016/j.neucom.2020.12.029
M3 - Article
AN - SCOPUS:85098980087
SN - 0925-2312
VL - 432
SP - 91
EP - 100
JO - Neurocomputing
JF - Neurocomputing
ER -