Abstract
To address the challenges posed by extensive modality differences in image matching tasks, this article proposes a cascaded learning framework. It guides the optimization of an end-to-end matching model via a dynamic data engine, which can provide sufficient cross-modal training data to support the model's full adaptation to cross-modal features. The data engine integrates a random homography transformation module and a lightweight image generation model, enabling the online synthesis of cross-modal image pairs with geometric variations and diverse styles. This provides the matching model with rich cross-modal stimulation. The matching model adopts a hybrid architecture combining a convolutional neural network (CNN) backbone and Transformer attention mechanisms, which integrates multiscale local feature extraction with global context modeling. By adopting the proposed stepwise aggregation strategy, the efficiency of feature extraction is well ensured. Subsequently, a coarse-to-fine matching strategy is employed to achieve high accuracy and robustness of feature alignment. Comprehensive experiments on both self-collected and public cross-modal image matching datasets demonstrate that the proposed data generation for image matching (DGIM) outperforms existing state-of-the-art approaches in cross-modal matching performance while achieving a good balance between efficiency and effectiveness. It also exhibits broad practical potential across multiple fields and scenes. This work provides novel solutions and evaluation benchmarks for cross-modal image matching tasks. The code and testing dataset will be made publicly available at https://github.com/LotrL/DGIM.
| Original language | English |
|---|---|
| Article number | 4708616 |
| Journal | IEEE Transactions on Geoscience and Remote Sensing |
| Volume | 63 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
Keywords
- Cross-modal
- data augmentation
- feature aggregation
- generative model
- image matching