Attention Guided Relation Detection Approach for Video Visual Relation Detection

Qianwen Cao; Heyan Huang

doi:10.1109/TMM.2021.3109430

Attention Guided Relation Detection Approach for Video Visual Relation Detection

Qianwen Cao, Heyan Huang^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of < subject-predicate-object >. Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.

源语言	英语
页（从-至）	3896-3907
页数	12
期刊	IEEE Transactions on Multimedia
卷	24
DOI	https://doi.org/10.1109/TMM.2021.3109430
出版状态	已出版 - 2022

访问文件

10.1109/TMM.2021.3109430

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3d176b10f4b041afb1f5389b69df77a7,

title = "Attention Guided Relation Detection Approach for Video Visual Relation Detection",

abstract = "Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of < subject-predicate-object >. Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.",

keywords = "Video visual relation detection, attention mechanism, neural network, visual relation tagging",

author = "Qianwen Cao and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2022",

doi = "10.1109/TMM.2021.3109430",

language = "English",

volume = "24",

pages = "3896--3907",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Attention Guided Relation Detection Approach for Video Visual Relation Detection

AU - Cao, Qianwen

AU - Huang, Heyan

PY - 2022

Y1 - 2022

N2 - Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of < subject-predicate-object >. Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.

AB - Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of < subject-predicate-object >. Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.

KW - Video visual relation detection

KW - attention mechanism

KW - neural network

KW - visual relation tagging

UR - http://www.scopus.com/inward/record.url?scp=85114727000&partnerID=8YFLogxK

U2 - 10.1109/TMM.2021.3109430

DO - 10.1109/TMM.2021.3109430

M3 - Article

AN - SCOPUS:85114727000

SN - 1520-9210

VL - 24

SP - 3896

EP - 3907

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Attention Guided Relation Detection Approach for Video Visual Relation Detection

摘要

访问文件

其它文件与链接

指纹

引用此