LLaMA-Unidetector: An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery

Jianlin Xie, Guanqun Wang*, Tong Zhang, Yikang Sun, He Chen, Yin Zhuang, Jun Li

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The object detection is a crucial task in the computer vision for remote sensing applications. However, the reliance of traditional methods on predefined and trained object categories limits their applicability in open-world scenarios. A key challenge in open-vocabulary object detection lies in accurately identifying unseen objects. Existing approaches often focus solely on detecting object locations, struggling to recognize the categories of previously unseen targets. To address this issue, we propose a novel benchmark, where models are trained on known base classes and evaluated on their performance in detecting and recognizing unseen or novel classes. To this end, we introduce LLaMA-Unidetector, a universal framework that incorporates textual information into a closed-set detector, enabling the generalization to open-set scenarios. Our LLaMA-Unidetector leverages a decoupled learning strategy that separates localization and recognition. In the first stage, a class-agnostic detector identifies objects, distinguishing only between foreground and background. In the second stage, the detected foreground objects are passed through TerraOV-LLM, a multimodal large language model (MLLM), for recognition, utilizing the strong generalization capabilities of large language models to infer the correct categories. We propose a self-built vision question answering (VQA) remote sensing dataset, TerraVQA, and conduct extensive experiments on the NWPU-VHR10, DOTA1.0, and DIOR datasets. The LLaMA-Unidetector achieves impressive results, with a performance of 75.46% AP, 50.22% AP, and 51.38% AP on the zero-shot detection benchmarks for the NWPU-VHR10, DOTA1.0, and DIOR datasets, respectively.

Original languageEnglish
Article number4409318
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume63
DOIs
Publication statusPublished - 2025

Keywords

  • Decoupled learning
  • open vocabulary
  • remote sensing object detection

Fingerprint

Dive into the research topics of 'LLaMA-Unidetector: An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery'. Together they form a unique fingerprint.

Cite this