TY - JOUR
T1 - Online video visual relation detection with hierarchical multi-modal fusion
AU - He, Yuxuan
AU - Gan, Ming Gang
AU - Ma, Qianzhao
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2024
Y1 - 2024
N2 - With the development of artificial intelligence technology, visual scene understanding has become a hot research topic. Online visual relation detection plays an important role in dynamic visual scene understanding. However, the complete modeling of dynamic relations and how to utilize a large amount of video content to infer visual relations are two difficult problems needed to be solved. Therefore, we propose Hierarchical Multi-Modal Fusion network for online video visual relation detection. We propose ASE-GCN to model dynamic scenes from different perspectives in order to fully capture visual relations in dynamic scenes. Meanwhile, we use trajectory features and natural language features as additional auxiliary features to describe the visual scene together with high-level visual features constructed by ASE-GCN. In order to make full use of these information to infer the visual relation, we design Hierarchical Fusion module before the relation predictor, which fuses the multi-role and multi-modal features using the methods based on attention and trilinear pooling. Comparative experiments on the ImageNet-VidVRD dataset demonstrate that our network outperforms other methods, while ablation studies verify the proposed modules are effective.
AB - With the development of artificial intelligence technology, visual scene understanding has become a hot research topic. Online visual relation detection plays an important role in dynamic visual scene understanding. However, the complete modeling of dynamic relations and how to utilize a large amount of video content to infer visual relations are two difficult problems needed to be solved. Therefore, we propose Hierarchical Multi-Modal Fusion network for online video visual relation detection. We propose ASE-GCN to model dynamic scenes from different perspectives in order to fully capture visual relations in dynamic scenes. Meanwhile, we use trajectory features and natural language features as additional auxiliary features to describe the visual scene together with high-level visual features constructed by ASE-GCN. In order to make full use of these information to infer the visual relation, we design Hierarchical Fusion module before the relation predictor, which fuses the multi-role and multi-modal features using the methods based on attention and trilinear pooling. Comparative experiments on the ImageNet-VidVRD dataset demonstrate that our network outperforms other methods, while ablation studies verify the proposed modules are effective.
KW - Graph network
KW - Multi-modal features
KW - Visual relation detection
UR - http://www.scopus.com/inward/record.url?scp=85182686845&partnerID=8YFLogxK
U2 - 10.1007/s11042-023-15310-3
DO - 10.1007/s11042-023-15310-3
M3 - Article
AN - SCOPUS:85182686845
SN - 1380-7501
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
ER -