VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference

Qianwen Cao; Heyan Huang

doi:10.1109/TCSVT.2021.3068214

VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference

Qianwen Cao, Heyan Huang^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

Video visual relation inference refers to the task of automatically detecting the relation triplets between the observed objects in videos with the form of ${ < subject, predicate, object>}$ , which requires correctly labeling each detected object and their interaction predicates. Despite the recent advances in image visual relation detection using deep learning techniques, relation inference in videos remains a challenging topic. On one hand, since the introduction of temporal information, it needs to model the rich spatio-temporal visual information for objects and videos. On the other hand, wild videos are often annotated with incomplete relation triplet tags and a few of them are semantically overlapped. However, previous methods adopt hand-crafted visual features extracted from the trajectories, describing local appearance characteristics of isolated objects. And they treat the problem as a multi-class classification task, which makes the relation tags mutually exclusive. To address the above issues, we propose a novel model, termed Visual-Semantic Relation Network (VSRN). In this network, we leverage three-dimensional convolution kernel to capture spatio-temporal features, and encode global visual features in videos through pooling operation on each time slice. Moreover, the semantic collocations between objects are also incorporated so as to obtain comprehensive representations of the relationships. For relation classification, we treat the problem as a multi-label classification task and regard each tag to be independent to predict various relationships. Additionally, we modify commonly used evaluation metric, video-wise recall, to a pair-wise metric (Roop) for testing the performance of models in predicting multiple relationships for the object pairs, Extensive experimental results on two large-scale datasets demonstrate the effectiveness of our proposed model which significantly outperforms the previous works.

源语言	英语
页（从-至）	768-777
页数	10
期刊	IEEE Transactions on Circuits and Systems for Video Technology
卷	32
期	2
DOI	https://doi.org/10.1109/TCSVT.2021.3068214
出版状态	已出版 - 1 2月 2022

访问文件

10.1109/TCSVT.2021.3068214

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a84ae56795c046b99b4cc585015d8104,

title = "VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference",

abstract = "Video visual relation inference refers to the task of automatically detecting the relation triplets between the observed objects in videos with the form of ${ < subject, predicate, object>}$ , which requires correctly labeling each detected object and their interaction predicates. Despite the recent advances in image visual relation detection using deep learning techniques, relation inference in videos remains a challenging topic. On one hand, since the introduction of temporal information, it needs to model the rich spatio-temporal visual information for objects and videos. On the other hand, wild videos are often annotated with incomplete relation triplet tags and a few of them are semantically overlapped. However, previous methods adopt hand-crafted visual features extracted from the trajectories, describing local appearance characteristics of isolated objects. And they treat the problem as a multi-class classification task, which makes the relation tags mutually exclusive. To address the above issues, we propose a novel model, termed Visual-Semantic Relation Network (VSRN). In this network, we leverage three-dimensional convolution kernel to capture spatio-temporal features, and encode global visual features in videos through pooling operation on each time slice. Moreover, the semantic collocations between objects are also incorporated so as to obtain comprehensive representations of the relationships. For relation classification, we treat the problem as a multi-label classification task and regard each tag to be independent to predict various relationships. Additionally, we modify commonly used evaluation metric, video-wise recall, to a pair-wise metric (Roop) for testing the performance of models in predicting multiple relationships for the object pairs, Extensive experimental results on two large-scale datasets demonstrate the effectiveness of our proposed model which significantly outperforms the previous works.",

keywords = "Feature representation, Neural network, Video analysis, Visual relation inference",

author = "Qianwen Cao and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.",

year = "2022",

month = feb,

day = "1",

doi = "10.1109/TCSVT.2021.3068214",

language = "English",

volume = "32",

pages = "768--777",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "2",

}

TY - JOUR

T1 - VSRN

T2 - Visual-Semantic Relation Network for Video Visual Relation Inference

AU - Cao, Qianwen

AU - Huang, Heyan

PY - 2022/2/1

Y1 - 2022/2/1

N2 - Video visual relation inference refers to the task of automatically detecting the relation triplets between the observed objects in videos with the form of ${ < subject, predicate, object>}$ , which requires correctly labeling each detected object and their interaction predicates. Despite the recent advances in image visual relation detection using deep learning techniques, relation inference in videos remains a challenging topic. On one hand, since the introduction of temporal information, it needs to model the rich spatio-temporal visual information for objects and videos. On the other hand, wild videos are often annotated with incomplete relation triplet tags and a few of them are semantically overlapped. However, previous methods adopt hand-crafted visual features extracted from the trajectories, describing local appearance characteristics of isolated objects. And they treat the problem as a multi-class classification task, which makes the relation tags mutually exclusive. To address the above issues, we propose a novel model, termed Visual-Semantic Relation Network (VSRN). In this network, we leverage three-dimensional convolution kernel to capture spatio-temporal features, and encode global visual features in videos through pooling operation on each time slice. Moreover, the semantic collocations between objects are also incorporated so as to obtain comprehensive representations of the relationships. For relation classification, we treat the problem as a multi-label classification task and regard each tag to be independent to predict various relationships. Additionally, we modify commonly used evaluation metric, video-wise recall, to a pair-wise metric (Roop) for testing the performance of models in predicting multiple relationships for the object pairs, Extensive experimental results on two large-scale datasets demonstrate the effectiveness of our proposed model which significantly outperforms the previous works.

AB - Video visual relation inference refers to the task of automatically detecting the relation triplets between the observed objects in videos with the form of ${ < subject, predicate, object>}$ , which requires correctly labeling each detected object and their interaction predicates. Despite the recent advances in image visual relation detection using deep learning techniques, relation inference in videos remains a challenging topic. On one hand, since the introduction of temporal information, it needs to model the rich spatio-temporal visual information for objects and videos. On the other hand, wild videos are often annotated with incomplete relation triplet tags and a few of them are semantically overlapped. However, previous methods adopt hand-crafted visual features extracted from the trajectories, describing local appearance characteristics of isolated objects. And they treat the problem as a multi-class classification task, which makes the relation tags mutually exclusive. To address the above issues, we propose a novel model, termed Visual-Semantic Relation Network (VSRN). In this network, we leverage three-dimensional convolution kernel to capture spatio-temporal features, and encode global visual features in videos through pooling operation on each time slice. Moreover, the semantic collocations between objects are also incorporated so as to obtain comprehensive representations of the relationships. For relation classification, we treat the problem as a multi-label classification task and regard each tag to be independent to predict various relationships. Additionally, we modify commonly used evaluation metric, video-wise recall, to a pair-wise metric (Roop) for testing the performance of models in predicting multiple relationships for the object pairs, Extensive experimental results on two large-scale datasets demonstrate the effectiveness of our proposed model which significantly outperforms the previous works.

KW - Feature representation

KW - Neural network

KW - Video analysis

KW - Visual relation inference

UR - http://www.scopus.com/inward/record.url?scp=85103283157&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2021.3068214

DO - 10.1109/TCSVT.2021.3068214

M3 - Article

AN - SCOPUS:85103283157

SN - 1051-8215

VL - 32

SP - 768

EP - 777

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 2

ER -

VSRN: Visual-Semantic Relation Network for Video Visual Relation Inference

摘要

访问文件

其它文件与链接

指纹

引用此