Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

Chuanqi Zang*, Hanqing Wang, Mingtao Pei, Wei Liang*

*此作品的通讯作者

科研成果: 期刊稿件会议文章同行评审

23 引用 (Scopus)
Plum Print visual indicator of research metrics
  • Citations
    • Citation Indexes: 2
  • Captures
    • Readers: 33
see details

摘要

Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. The code will be released at https://github.com/Chuanqi-Zang/Discovering-the-Real-Association.

源语言英语
页(从-至)19027-19036
页数10
期刊Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOI
出版状态已出版 - 2023
活动2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, 加拿大
期限: 18 6月 202322 6月 2023

指纹

探究 'Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering' 的科研主题。它们共同构成独一无二的指纹。

引用此

Zang, C., Wang, H., Pei, M., & Liang, W. (2023). Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 19027-19036. https://doi.org/10.1109/CVPR52729.2023.01824