TY - JOUR
T1 - Concept-Enhanced Relation Network for Video Visual Relation Inference
AU - Cao, Qianwen
AU - Huang, Heyan
AU - Ren, Mucheng
AU - Yuan, Changsen
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2023/5/1
Y1 - 2023/5/1
N2 - Video visual relation inference aims at extracting the relation triplets in the form of < subject-predicate-object > in videos. With the development of deep learning, existing approaches are designed based on data-driven neural networks. But the datasets are always biased in terms of objects and relation triplets, which make relation inference challenging. Existing approaches often describe the relationships from visual, spatial, and semantic characteristics. The semantic description plays a key role to indicate the potential linguistic connections between objects, that are crucial to transfer knowledge across relationships, especially for the determination of novel relations. However, in these works, the semantic features are not emphasized, but simply obtained by mapping object labels, which can not reflect sufficient linguistic meanings. To alleviate the above issues, we propose a novel network, termed Concept-Enhanced Relation Network (CERN), to facilitate video visual relation inference. Thanks to the attributes and linguistic contexts implied in concepts, the semantic representations aggregated with related concept knowledge of objects are of benefit to relation inference. To this end, we incorporate retrieved concepts with local semantics of objects via the gating mechanism to generate the concept-enhanced semantic representations. Extensive experimental results show that our approach has achieved state-of-the-art performance on two public datasets: ImageNet-VidVRD and VidOR.
AB - Video visual relation inference aims at extracting the relation triplets in the form of < subject-predicate-object > in videos. With the development of deep learning, existing approaches are designed based on data-driven neural networks. But the datasets are always biased in terms of objects and relation triplets, which make relation inference challenging. Existing approaches often describe the relationships from visual, spatial, and semantic characteristics. The semantic description plays a key role to indicate the potential linguistic connections between objects, that are crucial to transfer knowledge across relationships, especially for the determination of novel relations. However, in these works, the semantic features are not emphasized, but simply obtained by mapping object labels, which can not reflect sufficient linguistic meanings. To alleviate the above issues, we propose a novel network, termed Concept-Enhanced Relation Network (CERN), to facilitate video visual relation inference. Thanks to the attributes and linguistic contexts implied in concepts, the semantic representations aggregated with related concept knowledge of objects are of benefit to relation inference. To this end, we incorporate retrieved concepts with local semantics of objects via the gating mechanism to generate the concept-enhanced semantic representations. Extensive experimental results show that our approach has achieved state-of-the-art performance on two public datasets: ImageNet-VidVRD and VidOR.
KW - Video visual relation inference
KW - concept knowledge base
KW - feature learning
KW - neural network
KW - visual understanding
UR - http://www.scopus.com/inward/record.url?scp=85141560636&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2022.3220426
DO - 10.1109/TCSVT.2022.3220426
M3 - Article
AN - SCOPUS:85141560636
SN - 1051-8215
VL - 33
SP - 2233
EP - 2244
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 5
ER -