Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

Chuanqi Zang; Hanqing Wang; Mingtao Pei; Wei Liang

doi:10.1109/CVPR52729.2023.01824

Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

Chuanqi Zang^*, Hanqing Wang, Mingtao Pei, Wei Liang^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 会议文章 › 同行评审

23 引用（Scopus）

摘要

Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. The code will be released at https://github.com/Chuanqi-Zang/Discovering-the-Real-Association.

源语言	英语
页（从-至）	19027-19036
页数	10
期刊	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOI	https://doi.org/10.1109/CVPR52729.2023.01824
出版状态	已出版 - 2023
活动	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, 加拿大期限: 18 6月 2023 → 22 6月 2023

访问文件

10.1109/CVPR52729.2023.01824

其它文件与链接

链接到 Scopus 的出版物

引用此

Zang, C., Wang, H., Pei, M., & Liang, W. (2023). Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 19027-19036. https://doi.org/10.1109/CVPR52729.2023.01824

@article{d5bbd65747dc43acb9c278ddb6afcde9,

title = "Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering",

abstract = "Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. The code will be released at https://github.com/Chuanqi-Zang/Discovering-the-Real-Association.",

keywords = "language, reasoning, Vision",

author = "Chuanqi Zang and Hanqing Wang and Mingtao Pei and Wei Liang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.01824",

language = "English",

pages = "19027--19036",

journal = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

issn = "1063-6919",

publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - Discovering the Real Association

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

AU - Zang, Chuanqi

AU - Wang, Hanqing

AU - Pei, Mingtao

AU - Liang, Wei

PY - 2023

Y1 - 2023

N2 - Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. The code will be released at https://github.com/Chuanqi-Zang/Discovering-the-Real-Association.

AB - Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. The code will be released at https://github.com/Chuanqi-Zang/Discovering-the-Real-Association.

KW - language

KW - reasoning

KW - Vision

UR - http://www.scopus.com/inward/record.url?scp=85172438569&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.01824

DO - 10.1109/CVPR52729.2023.01824

M3 - Conference article

AN - SCOPUS:85172438569

SN - 1063-6919

SP - 19027

EP - 19036

JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Y2 - 18 June 2023 through 22 June 2023

ER -

Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

摘要

访问文件

其它文件与链接

指纹

引用此