Online video visual relation detection with hierarchical multi-modal fusion

Yuxuan He; Ming Gang Gan; Qianzhao Ma

doi:10.1007/s11042-023-15310-3

Online video visual relation detection with hierarchical multi-modal fusion

Yuxuan He, Ming Gang Gan^*, Qianzhao Ma

^*Corresponding author for this work

School of Automation

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

With the development of artificial intelligence technology, visual scene understanding has become a hot research topic. Online visual relation detection plays an important role in dynamic visual scene understanding. However, the complete modeling of dynamic relations and how to utilize a large amount of video content to infer visual relations are two difficult problems needed to be solved. Therefore, we propose Hierarchical Multi-Modal Fusion network for online video visual relation detection. We propose ASE-GCN to model dynamic scenes from different perspectives in order to fully capture visual relations in dynamic scenes. Meanwhile, we use trajectory features and natural language features as additional auxiliary features to describe the visual scene together with high-level visual features constructed by ASE-GCN. In order to make full use of these information to infer the visual relation, we design Hierarchical Fusion module before the relation predictor, which fuses the multi-role and multi-modal features using the methods based on attention and trilinear pooling. Comparative experiments on the ImageNet-VidVRD dataset demonstrate that our network outperforms other methods, while ablation studies verify the proposed modules are effective.

Original language	English
Journal	Multimedia Tools and Applications
DOIs	https://doi.org/10.1007/s11042-023-15310-3
Publication status	Accepted/In press - 2024

Keywords

Graph network
Multi-modal features
Visual relation detection

Access to Document

10.1007/s11042-023-15310-3

Cite this

He, Y., Gan, M. G., & Ma, Q. (Accepted/In press). Online video visual relation detection with hierarchical multi-modal fusion. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-023-15310-3

@article{170d855751d3494588fcde02f5959957,

title = "Online video visual relation detection with hierarchical multi-modal fusion",

abstract = "With the development of artificial intelligence technology, visual scene understanding has become a hot research topic. Online visual relation detection plays an important role in dynamic visual scene understanding. However, the complete modeling of dynamic relations and how to utilize a large amount of video content to infer visual relations are two difficult problems needed to be solved. Therefore, we propose Hierarchical Multi-Modal Fusion network for online video visual relation detection. We propose ASE-GCN to model dynamic scenes from different perspectives in order to fully capture visual relations in dynamic scenes. Meanwhile, we use trajectory features and natural language features as additional auxiliary features to describe the visual scene together with high-level visual features constructed by ASE-GCN. In order to make full use of these information to infer the visual relation, we design Hierarchical Fusion module before the relation predictor, which fuses the multi-role and multi-modal features using the methods based on attention and trilinear pooling. Comparative experiments on the ImageNet-VidVRD dataset demonstrate that our network outperforms other methods, while ablation studies verify the proposed modules are effective.",

keywords = "Graph network, Multi-modal features, Visual relation detection",

author = "Yuxuan He and Gan, {Ming Gang} and Qianzhao Ma",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2024",

doi = "10.1007/s11042-023-15310-3",

language = "English",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

}

TY - JOUR

T1 - Online video visual relation detection with hierarchical multi-modal fusion

AU - He, Yuxuan

AU - Gan, Ming Gang

AU - Ma, Qianzhao

PY - 2024

Y1 - 2024

N2 - With the development of artificial intelligence technology, visual scene understanding has become a hot research topic. Online visual relation detection plays an important role in dynamic visual scene understanding. However, the complete modeling of dynamic relations and how to utilize a large amount of video content to infer visual relations are two difficult problems needed to be solved. Therefore, we propose Hierarchical Multi-Modal Fusion network for online video visual relation detection. We propose ASE-GCN to model dynamic scenes from different perspectives in order to fully capture visual relations in dynamic scenes. Meanwhile, we use trajectory features and natural language features as additional auxiliary features to describe the visual scene together with high-level visual features constructed by ASE-GCN. In order to make full use of these information to infer the visual relation, we design Hierarchical Fusion module before the relation predictor, which fuses the multi-role and multi-modal features using the methods based on attention and trilinear pooling. Comparative experiments on the ImageNet-VidVRD dataset demonstrate that our network outperforms other methods, while ablation studies verify the proposed modules are effective.

AB - With the development of artificial intelligence technology, visual scene understanding has become a hot research topic. Online visual relation detection plays an important role in dynamic visual scene understanding. However, the complete modeling of dynamic relations and how to utilize a large amount of video content to infer visual relations are two difficult problems needed to be solved. Therefore, we propose Hierarchical Multi-Modal Fusion network for online video visual relation detection. We propose ASE-GCN to model dynamic scenes from different perspectives in order to fully capture visual relations in dynamic scenes. Meanwhile, we use trajectory features and natural language features as additional auxiliary features to describe the visual scene together with high-level visual features constructed by ASE-GCN. In order to make full use of these information to infer the visual relation, we design Hierarchical Fusion module before the relation predictor, which fuses the multi-role and multi-modal features using the methods based on attention and trilinear pooling. Comparative experiments on the ImageNet-VidVRD dataset demonstrate that our network outperforms other methods, while ablation studies verify the proposed modules are effective.

KW - Graph network

KW - Multi-modal features

KW - Visual relation detection

UR - http://www.scopus.com/inward/record.url?scp=85182686845&partnerID=8YFLogxK

U2 - 10.1007/s11042-023-15310-3

DO - 10.1007/s11042-023-15310-3

M3 - Article

AN - SCOPUS:85182686845

SN - 1380-7501

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

ER -

Online video visual relation detection with hierarchical multi-modal fusion

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this