Adaptive Latent Graph Representation Learning for Image-Text Matching

Mengxiao Tian; Xinxiao Wu; Yunde Jia

doi:10.1109/TIP.2022.3229631

Adaptive Latent Graph Representation Learning for Image-Text Matching

Mengxiao Tian, Xinxiao Wu^*, Yunde Jia

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

12 引用（Scopus）

摘要

Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

源语言	英语
页（从-至）	471-482
页数	12
期刊	IEEE Transactions on Image Processing
卷	32
DOI	https://doi.org/10.1109/TIP.2022.3229631
出版状态	已出版 - 2023

访问文件

10.1109/TIP.2022.3229631

其它文件与链接

链接到 Scopus 的出版物

引用此

Tian, M., Wu, X., & Jia, Y. (2023). Adaptive Latent Graph Representation Learning for Image-Text Matching. IEEE Transactions on Image Processing, 32, 471-482. https://doi.org/10.1109/TIP.2022.3229631

@article{d2264ba6b7ec4fdda339b5423433a2f4,

title = "Adaptive Latent Graph Representation Learning for Image-Text Matching",

abstract = "Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.",

keywords = "Image-text matching, graph variational autoencoder, latent representation learning",

author = "Mengxiao Tian and Xinxiao Wu and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2023",

doi = "10.1109/TIP.2022.3229631",

language = "English",

volume = "32",

pages = "471--482",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Adaptive Latent Graph Representation Learning for Image-Text Matching

AU - Tian, Mengxiao

AU - Wu, Xinxiao

AU - Jia, Yunde

PY - 2023

Y1 - 2023

N2 - Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

AB - Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

KW - Image-text matching

KW - graph variational autoencoder

KW - latent representation learning

UR - http://www.scopus.com/inward/record.url?scp=85146257145&partnerID=8YFLogxK

U2 - 10.1109/TIP.2022.3229631

DO - 10.1109/TIP.2022.3229631

M3 - Article

AN - SCOPUS:85146257145

SN - 1057-7149

VL - 32

SP - 471

EP - 482

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Adaptive Latent Graph Representation Learning for Image-Text Matching

摘要

访问文件

其它文件与链接

指纹

引用此