TY - JOUR
T1 - Adversarial Multimodal Network for Movie Story Question Answering
AU - Yuan, Zhaoquan
AU - Sun, Siyuan
AU - Duan, Lixin
AU - Li, Changsheng
AU - Wu, Xiao
AU - Xu, Changsheng
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2021
Y1 - 2021
N2 - Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.
AB - Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.
KW - Movie question answering
KW - adversarial network
KW - multimodal understanding
UR - http://www.scopus.com/inward/record.url?scp=85087482936&partnerID=8YFLogxK
U2 - 10.1109/TMM.2020.3002667
DO - 10.1109/TMM.2020.3002667
M3 - Article
AN - SCOPUS:85087482936
SN - 1520-9210
VL - 23
SP - 1744
EP - 1756
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
M1 - 9117168
ER -