D2VT: Better Detection and Description of Local Features with Vision Transformers

Yifei Yang, Zihao Wang, Zhen Li, Fang Deng, Yidian Huang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Constrained by the local nature of CNNs, existing local feature description methods often overlook global and contextual spatial information. Vision Transformers (ViT) address this by leveraging self-attention to capture long-range dependencies and preserve spatial details more effectively than CNNs. Our work introduces a hybrid architecture that merges CNNs for local feature extraction with ViT for global feature capture, enhancing performance across diverse vision tasks. We propose a novel hierarchical Transformer encoder adaptable to various image resolutions, yielding multi-scale features without positional encoding. Additionally, we introduce a consistent attention-weighted triple loss to get the attention map and to optimize and match local descriptors. Utilizing a feature pyramid, our method predicts keypoints at multiple scales, leading to improved localization accuracy. Experiments have shown that our approach is competitive with the leading contrastive learning methods in image matching benchmarks and demonstrates robust generalization in tasks like visual odometry.

Original languageEnglish
Title of host publicationProceedings - 2024 China Automation Congress, CAC 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7110-7115
Number of pages6
ISBN (Electronic)9798350368604
DOIs
Publication statusPublished - 2024
Event2024 China Automation Congress, CAC 2024 - Qingdao, China
Duration: 1 Nov 20243 Nov 2024

Publication series

NameProceedings - 2024 China Automation Congress, CAC 2024

Conference

Conference2024 China Automation Congress, CAC 2024
Country/TerritoryChina
CityQingdao
Period1/11/243/11/24

Keywords

  • Deep Learning
  • Feature Description
  • Feature Detection
  • Global Information
  • Vision Transformer

Fingerprint

Dive into the research topics of 'D2VT: Better Detection and Description of Local Features with Vision Transformers'. Together they form a unique fingerprint.

Cite this