TY - JOUR
T1 - Video Visual Relation Detection With Contextual Knowledge Embedding
AU - Cao, Qianwen
AU - Huang, Heyan
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.
AB - Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.
KW - Computer vision
KW - knowledge embedding
KW - video understanding
KW - video visual relation detection
KW - visual relation tagging
UR - http://www.scopus.com/inward/record.url?scp=85159708574&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2023.3270328
DO - 10.1109/TKDE.2023.3270328
M3 - Article
AN - SCOPUS:85159708574
SN - 1041-4347
VL - 35
SP - 13083
EP - 13095
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 12
ER -