TY - JOUR
T1 - GenGeo
T2 - Robust Cross-View Geo-Localization via Foundation Model and Dynamic Feature Aggregation
AU - Wang, Rong
AU - Yuan, Wen
AU - Yuan, Wu
AU - Liu, Tong
AU - Xi, Xiao
AU - Zhu, Yaokai
N1 - Publisher Copyright:
© 2026 by the authors.
PY - 2026/4
Y1 - 2026/4
N2 - Cross-view geo-localization (CVGL) aims to match ground-level images with geo-tagged aerial imagery for precise localization, but remains challenging due to severe viewpoint discrepancies, partial correspondence, and significant domain shifts across geographic regions. While existing methods achieve high accuracy within specific datasets, their generalization ability to unseen environments is limited. In this paper, we propose GenGeo, a unified framework that integrates vision foundation model representations with a matching-aware aggregation mechanism to address these challenges. Specifically, we leverage DINOv2 to extract semantically rich and transferable features, and revisit the SALAD aggregation module in the context of CVGL. By employing a shared clustering strategy, the proposed framework projects cross-view features into a unified assignment space, enabling implicit semantic alignment across views, while the dustbin mechanism effectively filters unmatched and non-informative regions arising from partial correspondence. Extensive experiments on three large-scale benchmarks (CVUSA, CVACT, and VIGOR) demonstrate that GenGeo achieves state-of-the-art performance in cross-dataset generalization and consistently improves robustness under severe domain shifts and spatial misalignment. Notably, our method outperforms the baseline by 14.65% in Top-1 Recall on the CVUSA-to-CVACT transfer task. These results highlight the effectiveness of combining foundation model representations with matching-aware aggregation, and suggest that enforcing semantic consistency in a shared assignment space is a promising direction for generalizable cross-view geo-localization.
AB - Cross-view geo-localization (CVGL) aims to match ground-level images with geo-tagged aerial imagery for precise localization, but remains challenging due to severe viewpoint discrepancies, partial correspondence, and significant domain shifts across geographic regions. While existing methods achieve high accuracy within specific datasets, their generalization ability to unseen environments is limited. In this paper, we propose GenGeo, a unified framework that integrates vision foundation model representations with a matching-aware aggregation mechanism to address these challenges. Specifically, we leverage DINOv2 to extract semantically rich and transferable features, and revisit the SALAD aggregation module in the context of CVGL. By employing a shared clustering strategy, the proposed framework projects cross-view features into a unified assignment space, enabling implicit semantic alignment across views, while the dustbin mechanism effectively filters unmatched and non-informative regions arising from partial correspondence. Extensive experiments on three large-scale benchmarks (CVUSA, CVACT, and VIGOR) demonstrate that GenGeo achieves state-of-the-art performance in cross-dataset generalization and consistently improves robustness under severe domain shifts and spatial misalignment. Notably, our method outperforms the baseline by 14.65% in Top-1 Recall on the CVUSA-to-CVACT transfer task. These results highlight the effectiveness of combining foundation model representations with matching-aware aggregation, and suggest that enforcing semantic consistency in a shared assignment space is a promising direction for generalizable cross-view geo-localization.
KW - cross-view geo-localization
KW - feature aggregation
KW - generalization capacity
KW - remote sensing imagery
KW - vision foundation models
UR - https://www.scopus.com/pages/publications/105036840515
U2 - 10.3390/rs18081116
DO - 10.3390/rs18081116
M3 - Article
AN - SCOPUS:105036840515
SN - 2072-4292
VL - 18
JO - Remote Sensing
JF - Remote Sensing
IS - 8
M1 - 1116
ER -