Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

Wei Zhang; Miaoxin Cai; Tong Zhang; Guoqiang Lei; Yin Zhuang; Xuerui Mao

doi:10.1109/JSTARS.2024.3488034

Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

Wei Zhang, Miaoxin Cai, Tong Zhang, Guoqiang Lei, Yin Zhuang^*, Xuerui Mao^*

^*Corresponding author for this work

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.

Original language	English
Pages (from-to)	20050-20063
Number of pages	14
Journal	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Volume	17
DOIs	https://doi.org/10.1109/JSTARS.2024.3488034
Publication status	Published - 2024

Keywords

Multisource imagery
natural language interaction
ship detection
visual-language alignment

Access to Document

10.1109/JSTARS.2024.3488034

Cite this

Zhang, W., Cai, M., Zhang, T., Lei, G., Zhuang, Y., & Mao, X. (2024). Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17, 20050-20063. https://doi.org/10.1109/JSTARS.2024.3488034

@article{d2c748acf7dc4514aeb930e3d3a52a54,

title = "Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery",

abstract = "Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.",

keywords = "Multisource imagery, natural language interaction, ship detection, visual-language alignment",

author = "Wei Zhang and Miaoxin Cai and Tong Zhang and Guoqiang Lei and Yin Zhuang and Xuerui Mao",

note = "Publisher Copyright: {\textcopyright} 2008-2012 IEEE.",

year = "2024",

doi = "10.1109/JSTARS.2024.3488034",

language = "English",

volume = "17",

pages = "20050--20063",

journal = "IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing",

issn = "1939-1404",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Popeye

T2 - A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

AU - Zhang, Wei

AU - Cai, Miaoxin

AU - Zhang, Tong

AU - Lei, Guoqiang

AU - Zhuang, Yin

AU - Mao, Xuerui

PY - 2024

Y1 - 2024

N2 - Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.

AB - Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.

KW - Multisource imagery

KW - natural language interaction

KW - ship detection

KW - visual-language alignment

UR - http://www.scopus.com/inward/record.url?scp=85208221569&partnerID=8YFLogxK

U2 - 10.1109/JSTARS.2024.3488034

DO - 10.1109/JSTARS.2024.3488034

M3 - Article

AN - SCOPUS:85208221569

SN - 1939-1404

VL - 17

SP - 20050

EP - 20063

JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ER -

Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this