Adversarial Multimodal Network for Movie Story Question Answering

Zhaoquan Yuan; Siyuan Sun; Lixin Duan; Changsheng Li; Xiao Wu; Changsheng Xu

doi:10.1109/TMM.2020.3002667

Adversarial Multimodal Network for Movie Story Question Answering

Zhaoquan Yuan, Siyuan Sun, Lixin Duan^*, Changsheng Li^*, Xiao Wu, Changsheng Xu

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

18 Citations (Scopus)

Abstract

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

Original language	English
Article number	9117168
Pages (from-to)	1744-1756
Number of pages	13
Journal	IEEE Transactions on Multimedia
Volume	23
DOIs	https://doi.org/10.1109/TMM.2020.3002667
Publication status	Published - 2021

Keywords

Movie question answering
adversarial network
multimodal understanding

Access to Document

10.1109/TMM.2020.3002667

Cite this

@article{857b4f11c6b4405aa56d4cc8d79aee4d,

title = "Adversarial Multimodal Network for Movie Story Question Answering",

abstract = "Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.",

keywords = "Movie question answering, adversarial network, multimodal understanding",

author = "Zhaoquan Yuan and Siyuan Sun and Lixin Duan and Changsheng Li and Xiao Wu and Changsheng Xu",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2021",

doi = "10.1109/TMM.2020.3002667",

language = "English",

volume = "23",

pages = "1744--1756",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Adversarial Multimodal Network for Movie Story Question Answering

AU - Yuan, Zhaoquan

AU - Sun, Siyuan

AU - Duan, Lixin

AU - Li, Changsheng

AU - Wu, Xiao

AU - Xu, Changsheng

PY - 2021

Y1 - 2021

N2 - Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

AB - Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

KW - Movie question answering

KW - adversarial network

KW - multimodal understanding

UR - http://www.scopus.com/inward/record.url?scp=85087482936&partnerID=8YFLogxK

U2 - 10.1109/TMM.2020.3002667

DO - 10.1109/TMM.2020.3002667

M3 - Article

AN - SCOPUS:85087482936

SN - 1520-9210

VL - 23

SP - 1744

EP - 1756

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

M1 - 9117168

ER -

Adversarial Multimodal Network for Movie Story Question Answering

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this