Multi-scale image–text matching network for scene and spatio-temporal images

Runde Yu, Fusheng Jin*, Zhuang Qiao, Ye Yuan, Guoren Wang

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

6 引用 (Scopus)

摘要

In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

源语言英语
页(从-至)292-300
页数9
期刊Future Generation Computer Systems
142
DOI
出版状态已出版 - 5月 2023

指纹

探究 'Multi-scale image–text matching network for scene and spatio-temporal images' 的科研主题。它们共同构成独一无二的指纹。

引用此