Video Visual Relation Detection With Contextual Knowledge Embedding

Qianwen Cao; Heyan Huang

doi:10.1109/TKDE.2023.3270328

Video Visual Relation Detection With Contextual Knowledge Embedding

Qianwen Cao, Heyan Huang^*

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of <subject-predicate-object > from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

源语言	英语
页（从-至）	13083-13095
页数	13
期刊	IEEE Transactions on Knowledge and Data Engineering
卷	35
期	12
DOI	https://doi.org/10.1109/TKDE.2023.3270328
出版状态	已出版 - 1 12月 2023

访问文件

10.1109/TKDE.2023.3270328

其它文件与链接

链接到 Scopus 的出版物

引用此

Cao, Q., & Huang, H. (2023). Video Visual Relation Detection With Contextual Knowledge Embedding. IEEE Transactions on Knowledge and Data Engineering, 35(12), 13083-13095. https://doi.org/10.1109/TKDE.2023.3270328

@article{03669369bf8e45018fd0bb72f2c56aac,

title = "Video Visual Relation Detection With Contextual Knowledge Embedding",

abstract = "Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.",

keywords = "Computer vision, knowledge embedding, video understanding, video visual relation detection, visual relation tagging",

author = "Qianwen Cao and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2023",

month = dec,

day = "1",

doi = "10.1109/TKDE.2023.3270328",

language = "English",

volume = "35",

pages = "13083--13095",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "12",

}

TY - JOUR

T1 - Video Visual Relation Detection With Contextual Knowledge Embedding

AU - Cao, Qianwen

AU - Huang, Heyan

PY - 2023/12/1

Y1 - 2023/12/1

N2 - Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

AB - Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

KW - Computer vision

KW - knowledge embedding

KW - video understanding

KW - video visual relation detection

KW - visual relation tagging

UR - http://www.scopus.com/inward/record.url?scp=85159708574&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2023.3270328

DO - 10.1109/TKDE.2023.3270328

M3 - Article

AN - SCOPUS:85159708574

SN - 1041-4347

VL - 35

SP - 13083

EP - 13095

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 12

ER -

Video Visual Relation Detection With Contextual Knowledge Embedding

摘要

访问文件

其它文件与链接

指纹

引用此