Adaptive Latent Graph Representation Learning for Image-Text Matching

Mengxiao Tian; Xinxiao Wu; Yunde Jia

doi:10.1109/TIP.2022.3229631

Adaptive Latent Graph Representation Learning for Image-Text Matching

Mengxiao Tian, Xinxiao Wu^*, Yunde Jia

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

11 Citations (Scopus)

Abstract

Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

Original language	English
Pages (from-to)	471-482
Number of pages	12
Journal	IEEE Transactions on Image Processing
Volume	32
DOIs	https://doi.org/10.1109/TIP.2022.3229631
Publication status	Published - 2023

Keywords

Image-text matching
graph variational autoencoder
latent representation learning

Access to Document

10.1109/TIP.2022.3229631

Cite this

@article{d2264ba6b7ec4fdda339b5423433a2f4,

title = "Adaptive Latent Graph Representation Learning for Image-Text Matching",

abstract = "Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.",

keywords = "Image-text matching, graph variational autoencoder, latent representation learning",

author = "Mengxiao Tian and Xinxiao Wu and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2023",

doi = "10.1109/TIP.2022.3229631",

language = "English",

volume = "32",

pages = "471--482",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Adaptive Latent Graph Representation Learning for Image-Text Matching

AU - Tian, Mengxiao

AU - Wu, Xinxiao

AU - Jia, Yunde

PY - 2023

Y1 - 2023

N2 - Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

AB - Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

KW - Image-text matching

KW - graph variational autoencoder

KW - latent representation learning

UR - http://www.scopus.com/inward/record.url?scp=85146257145&partnerID=8YFLogxK

U2 - 10.1109/TIP.2022.3229631

DO - 10.1109/TIP.2022.3229631

M3 - Article

AN - SCOPUS:85146257145

SN - 1057-7149

VL - 32

SP - 471

EP - 482

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Adaptive Latent Graph Representation Learning for Image-Text Matching

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this