TY - JOUR
T1 - Attention Guided Relation Detection Approach for Video Visual Relation Detection
AU - Cao, Qianwen
AU - Huang, Heyan
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2022
Y1 - 2022
N2 - Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of < subject-predicate-object >. Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.
AB - Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of < subject-predicate-object >. Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.
KW - Video visual relation detection
KW - attention mechanism
KW - neural network
KW - visual relation tagging
UR - http://www.scopus.com/inward/record.url?scp=85114727000&partnerID=8YFLogxK
U2 - 10.1109/TMM.2021.3109430
DO - 10.1109/TMM.2021.3109430
M3 - Article
AN - SCOPUS:85114727000
SN - 1520-9210
VL - 24
SP - 3896
EP - 3907
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -