TY - JOUR
T1 - Efficient Grounding DINO
T2 - Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote Sensing
AU - Hu, Zibo
AU - Gao, Kun
AU - Zhang, Xiaodian
AU - Yang, Zhijia
AU - Cai, Mingfeng
AU - Zhu, Zhenyu
AU - Li, Wei
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Visual grounding for remote sensing (RSVG) aims to detect objects in remote sensing scenes based on textual descriptions. While existing methods perform well on RSVG datasets, they are limited to single-object predictions, making them unsuitable for multi-object candidate category datasets. Open-set methods can be applied to both RSVG and candidate datasets, but their use in remote sensing remains rare. To bridge this gap, we introduce the open-set approach to RSVG and propose Efficient Grounding DINO, using Grounding DINO as a baseline. Open-set methods rely on two key modules: cross-modality fusion and label assignment. Existing cross-modality fusion methods simultaneously update text and multi-scale visual features, which hampers the model's ability to generalize under different texts and increases learning complexity. Existing methods predict a single object, allowing direct use as a positive example for loss calculation, while open-set methods for multi-objects require one-to-one matching to assign positive and negative samples. However, background interference in the RSVG datasets causes frequent misassignments, slowing model convergence. We address these issues with two innovations: the multi-scale image-to-text fusion module (MSITFM), which updates text features using self-attention to maintain independence from visual features and employs scale-specific cross-attention for multi-scale visual feature fusion to reduce learning complexity, achieving a 3% parameter and 21.6% GFLOPs reduction. Text confidence matching (TCM) incorporates IoU-based confidence into label assignment to reduce mismatches and enhance model performance. Experiments on DIOR-RSVG, RSVG-HR, and DOTA datasets validate the effectiveness of our approach.
AB - Visual grounding for remote sensing (RSVG) aims to detect objects in remote sensing scenes based on textual descriptions. While existing methods perform well on RSVG datasets, they are limited to single-object predictions, making them unsuitable for multi-object candidate category datasets. Open-set methods can be applied to both RSVG and candidate datasets, but their use in remote sensing remains rare. To bridge this gap, we introduce the open-set approach to RSVG and propose Efficient Grounding DINO, using Grounding DINO as a baseline. Open-set methods rely on two key modules: cross-modality fusion and label assignment. Existing cross-modality fusion methods simultaneously update text and multi-scale visual features, which hampers the model's ability to generalize under different texts and increases learning complexity. Existing methods predict a single object, allowing direct use as a positive example for loss calculation, while open-set methods for multi-objects require one-to-one matching to assign positive and negative samples. However, background interference in the RSVG datasets causes frequent misassignments, slowing model convergence. We address these issues with two innovations: the multi-scale image-to-text fusion module (MSITFM), which updates text features using self-attention to maintain independence from visual features and employs scale-specific cross-attention for multi-scale visual feature fusion to reduce learning complexity, achieving a 3% parameter and 21.6% GFLOPs reduction. Text confidence matching (TCM) incorporates IoU-based confidence into label assignment to reduce mismatches and enhance model performance. Experiments on DIOR-RSVG, RSVG-HR, and DOTA datasets validate the effectiveness of our approach.
KW - Cross-modality fusion module
KW - misassignment
KW - multi-scale image-to-text fusion module (MSITFM)
KW - text confidence matching (TCM)
KW - visual grounding for remote sensing (RSVG)
UR - http://www.scopus.com/inward/record.url?scp=85216611409&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2025.3536015
DO - 10.1109/TGRS.2025.3536015
M3 - Article
AN - SCOPUS:85216611409
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5609414
ER -