Multi-scale image–text matching network for scene and spatio-temporal images

Runde Yu; Fusheng Jin; Zhuang Qiao; Ye Yuan; Guoren Wang

doi:10.1016/j.future.2023.01.004

Multi-scale image–text matching network for scene and spatio-temporal images

Runde Yu, Fusheng Jin^*, Zhuang Qiao, Ye Yuan, Guoren Wang

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

6 引用（Scopus）

摘要

In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

源语言	英语
页（从-至）	292-300
页数	9
期刊	Future Generation Computer Systems
卷	142
DOI	https://doi.org/10.1016/j.future.2023.01.004
出版状态	已出版 - 5月 2023

访问文件

10.1016/j.future.2023.01.004

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a7d1a6c747f14a9898246ee475c84ff2,

title = "Multi-scale image–text matching network for scene and spatio-temporal images",

abstract = "In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.",

keywords = "Feature extraction, GNN, Multi-scale, Multimodal matching",

author = "Runde Yu and Fusheng Jin and Zhuang Qiao and Ye Yuan and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2023 The Author(s)",

year = "2023",

month = may,

doi = "10.1016/j.future.2023.01.004",

language = "English",

volume = "142",

pages = "292--300",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Multi-scale image–text matching network for scene and spatio-temporal images

AU - Yu, Runde

AU - Jin, Fusheng

AU - Qiao, Zhuang

AU - Yuan, Ye

AU - Wang, Guoren

PY - 2023/5

Y1 - 2023/5

N2 - In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

AB - In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

KW - Feature extraction

KW - GNN

KW - Multi-scale

KW - Multimodal matching

UR - http://www.scopus.com/inward/record.url?scp=85146292771&partnerID=8YFLogxK

U2 - 10.1016/j.future.2023.01.004

DO - 10.1016/j.future.2023.01.004

M3 - Article

AN - SCOPUS:85146292771

SN - 0167-739X

VL - 142

SP - 292

EP - 300

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

ER -

Multi-scale image–text matching network for scene and spatio-temporal images

摘要

访问文件

其它文件与链接

指纹

引用此