TY - JOUR
T1 - LLaMA-Unidetector
T2 - An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery
AU - Xie, Jianlin
AU - Wang, Guanqun
AU - Zhang, Tong
AU - Sun, Yikang
AU - Chen, He
AU - Zhuang, Yin
AU - Li, Jun
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - The object detection is a crucial task in the computer vision for remote sensing applications. However, the reliance of traditional methods on predefined and trained object categories limits their applicability in open-world scenarios. A key challenge in open-vocabulary object detection lies in accurately identifying unseen objects. Existing approaches often focus solely on detecting object locations, struggling to recognize the categories of previously unseen targets. To address this issue, we propose a novel benchmark, where models are trained on known base classes and evaluated on their performance in detecting and recognizing unseen or novel classes. To this end, we introduce LLaMA-Unidetector, a universal framework that incorporates textual information into a closed-set detector, enabling the generalization to open-set scenarios. Our LLaMA-Unidetector leverages a decoupled learning strategy that separates localization and recognition. In the first stage, a class-agnostic detector identifies objects, distinguishing only between foreground and background. In the second stage, the detected foreground objects are passed through TerraOV-LLM, a multimodal large language model (MLLM), for recognition, utilizing the strong generalization capabilities of large language models to infer the correct categories. We propose a self-built vision question answering (VQA) remote sensing dataset, TerraVQA, and conduct extensive experiments on the NWPU-VHR10, DOTA1.0, and DIOR datasets. The LLaMA-Unidetector achieves impressive results, with a performance of 75.46% AP, 50.22% AP, and 51.38% AP on the zero-shot detection benchmarks for the NWPU-VHR10, DOTA1.0, and DIOR datasets, respectively.
AB - The object detection is a crucial task in the computer vision for remote sensing applications. However, the reliance of traditional methods on predefined and trained object categories limits their applicability in open-world scenarios. A key challenge in open-vocabulary object detection lies in accurately identifying unseen objects. Existing approaches often focus solely on detecting object locations, struggling to recognize the categories of previously unseen targets. To address this issue, we propose a novel benchmark, where models are trained on known base classes and evaluated on their performance in detecting and recognizing unseen or novel classes. To this end, we introduce LLaMA-Unidetector, a universal framework that incorporates textual information into a closed-set detector, enabling the generalization to open-set scenarios. Our LLaMA-Unidetector leverages a decoupled learning strategy that separates localization and recognition. In the first stage, a class-agnostic detector identifies objects, distinguishing only between foreground and background. In the second stage, the detected foreground objects are passed through TerraOV-LLM, a multimodal large language model (MLLM), for recognition, utilizing the strong generalization capabilities of large language models to infer the correct categories. We propose a self-built vision question answering (VQA) remote sensing dataset, TerraVQA, and conduct extensive experiments on the NWPU-VHR10, DOTA1.0, and DIOR datasets. The LLaMA-Unidetector achieves impressive results, with a performance of 75.46% AP, 50.22% AP, and 51.38% AP on the zero-shot detection benchmarks for the NWPU-VHR10, DOTA1.0, and DIOR datasets, respectively.
KW - Decoupled learning
KW - open vocabulary
KW - remote sensing object detection
UR - http://www.scopus.com/inward/record.url?scp=105003693780&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2025.3564332
DO - 10.1109/TGRS.2025.3564332
M3 - Article
AN - SCOPUS:105003693780
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 4409318
ER -