Skip to main navigation Skip to search Skip to main content

Test-Time Candidate-Aware Dual Refinement for Remote Sensing Image–Text Retrieval

  • Bofan Zhang
  • , Hao Wu*
  • *Corresponding author for this work
  • Beijing Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Remote sensing image–text retrieval (RSITR) is a pivotal task aimed at achieving efficient bidirectional matching between visual content and textual descriptions in large-scale remote sensing databases. Nevertheless, it faces a fundamental challenge: the severe information asymmetry between sparse, abstract captions and dense, multi-scale overhead imagery. Prior works predominantly focus on learning static cross-modal representations during training; however, this frozen inference process is fundamentally limited in bridging the asymmetry due to its inability to dynamically compensate for missing details or resolve visual ambiguities in heterogeneous scenes. To overcome this limitation, we propose CADRE (Test-Time Candidate-Aware Dual Refinement), a retrieval-backbone-agnostic framework exploiting retrieved candidates as feedback for bidirectional alignment. Operating on a novel Inject-and-Suppress paradigm, CADRE comprises two complementary modules. First, the Visual-Context Injection (VCI) module addresses textual sparsity by incorporating an adaptive filtering mechanism to efficiently mine hierarchical visual evidence from high-confidence candidates and inject it into the query via a domain-adapted Multimodal Large Language Model (MLLM). Second, the Query-Guided Disambiguation (QGD) module targets visual ambiguity by generating multi-view visual hypotheses and utilizing the query as a semantic probe to suppress background noise. Extensive experiments on three standard benchmarks (RSICD, RSITMD, and UCM) demonstrate good transferability across several strong RSITR backbones.

Original languageEnglish
Article number1389
JournalRemote Sensing
Volume18
Issue number9
DOIs
Publication statusPublished - May 2026
Externally publishedYes

Keywords

  • cross-modal alignment
  • multimodal large language models 21 (MLLMs)
  • re-ranking
  • remote sensing image-text retrieval
  • test-time refinement

Fingerprint

Dive into the research topics of 'Test-Time Candidate-Aware Dual Refinement for Remote Sensing Image–Text Retrieval'. Together they form a unique fingerprint.

Cite this