TY - JOUR
T1 - Adaptive Latent Graph Representation Learning for Image-Text Matching
AU - Tian, Mengxiao
AU - Wu, Xinxiao
AU - Jia, Yunde
N1 - Publisher Copyright:
© 1992-2012 IEEE.
PY - 2023
Y1 - 2023
N2 - Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.
AB - Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.
KW - Image-text matching
KW - graph variational autoencoder
KW - latent representation learning
UR - http://www.scopus.com/inward/record.url?scp=85146257145&partnerID=8YFLogxK
U2 - 10.1109/TIP.2022.3229631
DO - 10.1109/TIP.2022.3229631
M3 - Article
AN - SCOPUS:85146257145
SN - 1057-7149
VL - 32
SP - 471
EP - 482
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -