3-D Relation Network for visual relation recognition in videos

Qianwen Cao; Heyan Huang; Xindi Shang; Boran Wang; Tat Seng Chua

doi:10.1016/j.neucom.2020.12.029

3-D Relation Network for visual relation recognition in videos

Qianwen Cao^*, Heyan Huang, Xindi Shang, Boran Wang, Tat Seng Chua

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

20 引用（Scopus）

摘要

Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈subject,predicate,object〉, such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.

源语言	英语
页（从-至）	91-100
页数	10
期刊	Neurocomputing
卷	432
DOI	https://doi.org/10.1016/j.neucom.2020.12.029
出版状态	已出版 - 7 4月 2021

访问文件

10.1016/j.neucom.2020.12.029

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{e7308ea81d0c4887b2ef69096c042ff1,

title = "3-D Relation Network for visual relation recognition in videos",

abstract = "Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈subject,predicate,object〉, such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.",

keywords = "Computer vision, Deep neural network, Video visual relation recognition, Visual relation detection",

author = "Qianwen Cao and Heyan Huang and Xindi Shang and Boran Wang and Chua, {Tat Seng}",

note = "Publisher Copyright: {\textcopyright} 2020 Elsevier B.V.",

year = "2021",

month = apr,

day = "7",

doi = "10.1016/j.neucom.2020.12.029",

language = "English",

volume = "432",

pages = "91--100",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - 3-D Relation Network for visual relation recognition in videos

AU - Cao, Qianwen

AU - Huang, Heyan

AU - Shang, Xindi

AU - Wang, Boran

AU - Chua, Tat Seng

PY - 2021/4/7

Y1 - 2021/4/7

N2 - Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈subject,predicate,object〉, such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.

AB - Video visual relation recognition aims at mining the dynamic relation instances between objects in the form of 〈subject,predicate,object〉, such as “person1-towards-person2” and “person-ride-bicycle”. Existing solutions treat the problem as several independent sub-tasks, i.e., image object detection, video object tracking and trajectory-based relation prediction. We argue that such separation results in the lack of information flow between different sub-models, which creates redundant representation while each sub-task cannot share a common set of task-specific features. Toward this end, we connect these three sub-tasks in an end-to-end manner by proposing the 3-D relation proposal that serves as a bridge for relation feature learning. Specifically, we put forward a novel deep neural network, named 3DRN, to fuse the spatio-temporal visual characteristics, object label features, and spatial interactive features for learning the relation instances with multi-modal cues. In addition, a three-staged training strategy is also provided to facilitate large-scale parameter optimization. We conduct extensive experiments on two public datasets with different emphasis to demonstrate the effectiveness of the proposed end-to-end feature learning method for visual relation recognition in videos. Furthermore, we verify the potential of our approach by tackling the video relation detection task.

KW - Computer vision

KW - Deep neural network

KW - Video visual relation recognition

KW - Visual relation detection

UR - http://www.scopus.com/inward/record.url?scp=85098980087&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2020.12.029

DO - 10.1016/j.neucom.2020.12.029

M3 - Article

AN - SCOPUS:85098980087

SN - 0925-2312

VL - 432

SP - 91

EP - 100

JO - Neurocomputing

JF - Neurocomputing

ER -

3-D Relation Network for visual relation recognition in videos

摘要

访问文件

其它文件与链接

指纹

引用此