Video Visual Relation Detection With Contextual Knowledge Embedding

Qianwen Cao; Heyan Huang

doi:10.1109/TKDE.2023.3270328

Video Visual Relation Detection With Contextual Knowledge Embedding

Qianwen Cao, Heyan Huang^*

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of <subject-predicate-object > from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

Original language	English
Pages (from-to)	13083-13095
Number of pages	13
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	35
Issue number	12
DOIs	https://doi.org/10.1109/TKDE.2023.3270328
Publication status	Published - 1 Dec 2023

Keywords

Computer vision
knowledge embedding
video understanding
video visual relation detection
visual relation tagging

Access to Document

10.1109/TKDE.2023.3270328

Cite this

@article{03669369bf8e45018fd0bb72f2c56aac,

title = "Video Visual Relation Detection With Contextual Knowledge Embedding",

abstract = "Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.",

keywords = "Computer vision, knowledge embedding, video understanding, video visual relation detection, visual relation tagging",

author = "Qianwen Cao and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} 1989-2012 IEEE.",

year = "2023",

month = dec,

day = "1",

doi = "10.1109/TKDE.2023.3270328",

language = "English",

volume = "35",

pages = "13083--13095",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society",

number = "12",

}

TY - JOUR

T1 - Video Visual Relation Detection With Contextual Knowledge Embedding

AU - Cao, Qianwen

AU - Huang, Heyan

PY - 2023/12/1

Y1 - 2023/12/1

N2 - Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

AB - Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

KW - Computer vision

KW - knowledge embedding

KW - video understanding

KW - video visual relation detection

KW - visual relation tagging

UR - http://www.scopus.com/inward/record.url?scp=85159708574&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2023.3270328

DO - 10.1109/TKDE.2023.3270328

M3 - Article

AN - SCOPUS:85159708574

SN - 1041-4347

VL - 35

SP - 13083

EP - 13095

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 12

ER -

Video Visual Relation Detection With Contextual Knowledge Embedding

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this