Adversarial Multimodal Network for Movie Story Question Answering

Zhaoquan Yuan, Siyuan Sun, Lixin Duan*, Changsheng Li*, Xiao Wu, Changsheng Xu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

16 Citations (Scopus)

Abstract

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

Original languageEnglish
Article number9117168
Pages (from-to)1744-1756
Number of pages13
JournalIEEE Transactions on Multimedia
Volume23
DOIs
Publication statusPublished - 2021

Keywords

  • Movie question answering
  • adversarial network
  • multimodal understanding

Fingerprint

Dive into the research topics of 'Adversarial Multimodal Network for Movie Story Question Answering'. Together they form a unique fingerprint.

Cite this