TY - GEN
T1 - Image Semantic Feature Multiple Interactive Network for Remote Sensing Image Captioning
AU - Hou, Junzhu
AU - Li, Wei
AU - Li, Yang
AU - Li, Qiaoyi
AU - Cheng, Qiyuan
AU - Wang, Zhengjie
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - Remote sensing image captioning is widely used in disaster warning, disaster rescue, geographic positioning and other fields because it input remote sensing images and output accurate, comprehensive and fluent texts. Traditional remote sensing image captioning usually use convolutional neural network as the encoder to extract image features, and recurrent neural network as the decoder to generate texts. However, the image features extracted by the CNN encoder lack semantic information directly corresponding to the texts, and the RNN decoder cannot make full use of the features extracted by the encoder, resulting in the generated texts are not accurate and rich enough. To address the above two problems, we propose image semantic feature multiple interactive network based on the Encoder-Decoder model. We use pre-trained image encoder of CLIP as our remote sensing image semantic feature extraction network to narrow the modal gap between input images and output texts by extracting features that are highly sensitive to image semantic information. The multiple interactive network is used as our decoder. In order to prevent feature redundancy, we use the gated recurrent unit network to the multiple interactive network to fully interact and utilize the features. Experimental results show that our proposed network can generate richer, accurate and comprehensive texts compared with other comparison methods.
AB - Remote sensing image captioning is widely used in disaster warning, disaster rescue, geographic positioning and other fields because it input remote sensing images and output accurate, comprehensive and fluent texts. Traditional remote sensing image captioning usually use convolutional neural network as the encoder to extract image features, and recurrent neural network as the decoder to generate texts. However, the image features extracted by the CNN encoder lack semantic information directly corresponding to the texts, and the RNN decoder cannot make full use of the features extracted by the encoder, resulting in the generated texts are not accurate and rich enough. To address the above two problems, we propose image semantic feature multiple interactive network based on the Encoder-Decoder model. We use pre-trained image encoder of CLIP as our remote sensing image semantic feature extraction network to narrow the modal gap between input images and output texts by extracting features that are highly sensitive to image semantic information. The multiple interactive network is used as our decoder. In order to prevent feature redundancy, we use the gated recurrent unit network to the multiple interactive network to fully interact and utilize the features. Experimental results show that our proposed network can generate richer, accurate and comprehensive texts compared with other comparison methods.
KW - Pre-trained model
KW - Remote sensing image captioning
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85209826171&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8658-9_7
DO - 10.1007/978-981-97-8658-9_7
M3 - Conference contribution
AN - SCOPUS:85209826171
SN - 9789819786572
T3 - Lecture Notes in Electrical Engineering
SP - 63
EP - 74
BT - Proceedings of 2024 Chinese Intelligent Systems Conference
A2 - Jia, Yingmin
A2 - Zhang, Weicun
A2 - Fu, Yongling
A2 - Yang, Huihua
PB - Springer Science and Business Media Deutschland GmbH
T2 - 20th Chinese Intelligent Systems Conference, CISC 2024
Y2 - 26 October 2024 through 27 October 2024
ER -