Adversarial Multimodal Network for Movie Story Question Answering

Zhaoquan Yuan, Siyuan Sun, Lixin Duan*, Changsheng Li*, Xiao Wu, Changsheng Xu

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

16 引用 (Scopus)

摘要

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

源语言英语
文章编号9117168
页(从-至)1744-1756
页数13
期刊IEEE Transactions on Multimedia
23
DOI
出版状态已出版 - 2021

指纹

探究 'Adversarial Multimodal Network for Movie Story Question Answering' 的科研主题。它们共同构成独一无二的指纹。

引用此