Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote Sensing

Zibo Hu, Kun Gao*, Xiaodian Zhang, Zhijia Yang, Mingfeng Cai, Zhenyu Zhu*, Wei Li

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Visual grounding for remote sensing (RSVG) aims to detect objects in remote sensing scenes based on textual descriptions. While existing methods perform well on RSVG datasets, they are limited to single-object predictions, making them unsuitable for multi-object candidate category datasets. Open-set methods can be applied to both RSVG and candidate datasets, but their use in remote sensing remains rare. To bridge this gap, we introduce the open-set approach to RSVG and propose Efficient Grounding DINO, using Grounding DINO as a baseline. Open-set methods rely on two key modules: cross-modality fusion and label assignment. Existing cross-modality fusion methods simultaneously update text and multi-scale visual features, which hampers the model's ability to generalize under different texts and increases learning complexity. Existing methods predict a single object, allowing direct use as a positive example for loss calculation, while open-set methods for multi-objects require one-to-one matching to assign positive and negative samples. However, background interference in the RSVG datasets causes frequent misassignments, slowing model convergence. We address these issues with two innovations: the multi-scale image-to-text fusion module (MSITFM), which updates text features using self-attention to maintain independence from visual features and employs scale-specific cross-attention for multi-scale visual feature fusion to reduce learning complexity, achieving a 3% parameter and 21.6% GFLOPs reduction. Text confidence matching (TCM) incorporates IoU-based confidence into label assignment to reduce mismatches and enhance model performance. Experiments on DIOR-RSVG, RSVG-HR, and DOTA datasets validate the effectiveness of our approach.

Original languageEnglish
Article number5609414
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume63
DOIs
Publication statusPublished - 2025
Externally publishedYes

Keywords

  • Cross-modality fusion module
  • misassignment
  • multi-scale image-to-text fusion module (MSITFM)
  • text confidence matching (TCM)
  • visual grounding for remote sensing (RSVG)

Cite this