D2VT: Better Detection and Description of Local Features with Vision Transformers

Yifei Yang; Zihao Wang; Zhen Li; Fang Deng; Yidian Huang

doi:10.1109/CAC63892.2024.10864608

D2VT: Better Detection and Description of Local Features with Vision Transformers

Yifei Yang, Zihao Wang, Zhen Li, Fang Deng, Yidian Huang

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Constrained by the local nature of CNNs, existing local feature description methods often overlook global and contextual spatial information. Vision Transformers (ViT) address this by leveraging self-attention to capture long-range dependencies and preserve spatial details more effectively than CNNs. Our work introduces a hybrid architecture that merges CNNs for local feature extraction with ViT for global feature capture, enhancing performance across diverse vision tasks. We propose a novel hierarchical Transformer encoder adaptable to various image resolutions, yielding multi-scale features without positional encoding. Additionally, we introduce a consistent attention-weighted triple loss to get the attention map and to optimize and match local descriptors. Utilizing a feature pyramid, our method predicts keypoints at multiple scales, leading to improved localization accuracy. Experiments have shown that our approach is competitive with the leading contrastive learning methods in image matching benchmarks and demonstrates robust generalization in tasks like visual odometry.

源语言	英语
主期刊名	Proceedings - 2024 China Automation Congress, CAC 2024
出版商	Institute of Electrical and Electronics Engineers Inc.
页	7110-7115
页数	6
ISBN（电子版）	9798350368604
DOI	https://doi.org/10.1109/CAC63892.2024.10864608
出版状态	已出版 - 2024
活动	2024 China Automation Congress, CAC 2024 - Qingdao, 中国期限: 1 11月 2024 → 3 11月 2024

出版系列

姓名	Proceedings - 2024 China Automation Congress, CAC 2024

会议

会议	2024 China Automation Congress, CAC 2024
国家/地区	中国
市	Qingdao
时期	1/11/24 → 3/11/24

访问文件

10.1109/CAC63892.2024.10864608

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, Y., Wang, Z., Li, Z., Deng, F., & Huang, Y. (2024). D2VT: Better Detection and Description of Local Features with Vision Transformers. 在 Proceedings - 2024 China Automation Congress, CAC 2024 (页码 7110-7115). (Proceedings - 2024 China Automation Congress, CAC 2024). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CAC63892.2024.10864608

@inproceedings{391a11e79d3747788fb17024981844db,

title = "D2VT: Better Detection and Description of Local Features with Vision Transformers",

abstract = "Constrained by the local nature of CNNs, existing local feature description methods often overlook global and contextual spatial information. Vision Transformers (ViT) address this by leveraging self-attention to capture long-range dependencies and preserve spatial details more effectively than CNNs. Our work introduces a hybrid architecture that merges CNNs for local feature extraction with ViT for global feature capture, enhancing performance across diverse vision tasks. We propose a novel hierarchical Transformer encoder adaptable to various image resolutions, yielding multi-scale features without positional encoding. Additionally, we introduce a consistent attention-weighted triple loss to get the attention map and to optimize and match local descriptors. Utilizing a feature pyramid, our method predicts keypoints at multiple scales, leading to improved localization accuracy. Experiments have shown that our approach is competitive with the leading contrastive learning methods in image matching benchmarks and demonstrates robust generalization in tasks like visual odometry.",

keywords = "Deep Learning, Feature Description, Feature Detection, Global Information, Vision Transformer",

author = "Yifei Yang and Zihao Wang and Zhen Li and Fang Deng and Yidian Huang",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 China Automation Congress, CAC 2024 ; Conference date: 01-11-2024 Through 03-11-2024",

year = "2024",

doi = "10.1109/CAC63892.2024.10864608",

language = "English",

series = "Proceedings - 2024 China Automation Congress, CAC 2024",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "7110--7115",

booktitle = "Proceedings - 2024 China Automation Congress, CAC 2024",

address = "United States",

}

Yang, Y, Wang, Z, Li, Z, Deng, F & Huang, Y 2024, D2VT: Better Detection and Description of Local Features with Vision Transformers. 在 Proceedings - 2024 China Automation Congress, CAC 2024. Proceedings - 2024 China Automation Congress, CAC 2024, Institute of Electrical and Electronics Engineers Inc., 页码 7110-7115, 2024 China Automation Congress, CAC 2024, Qingdao, 中国, 1/11/24. https://doi.org/10.1109/CAC63892.2024.10864608

D2VT: Better Detection and Description of Local Features with Vision Transformers. / Yang, Yifei; Wang, Zihao; Li, Zhen 等.
Proceedings - 2024 China Automation Congress, CAC 2024. Institute of Electrical and Electronics Engineers Inc., 2024. 页码 7110-7115 (Proceedings - 2024 China Automation Congress, CAC 2024).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - D2VT

T2 - 2024 China Automation Congress, CAC 2024

AU - Yang, Yifei

AU - Wang, Zihao

AU - Li, Zhen

AU - Deng, Fang

AU - Huang, Yidian

PY - 2024

Y1 - 2024

N2 - Constrained by the local nature of CNNs, existing local feature description methods often overlook global and contextual spatial information. Vision Transformers (ViT) address this by leveraging self-attention to capture long-range dependencies and preserve spatial details more effectively than CNNs. Our work introduces a hybrid architecture that merges CNNs for local feature extraction with ViT for global feature capture, enhancing performance across diverse vision tasks. We propose a novel hierarchical Transformer encoder adaptable to various image resolutions, yielding multi-scale features without positional encoding. Additionally, we introduce a consistent attention-weighted triple loss to get the attention map and to optimize and match local descriptors. Utilizing a feature pyramid, our method predicts keypoints at multiple scales, leading to improved localization accuracy. Experiments have shown that our approach is competitive with the leading contrastive learning methods in image matching benchmarks and demonstrates robust generalization in tasks like visual odometry.

AB - Constrained by the local nature of CNNs, existing local feature description methods often overlook global and contextual spatial information. Vision Transformers (ViT) address this by leveraging self-attention to capture long-range dependencies and preserve spatial details more effectively than CNNs. Our work introduces a hybrid architecture that merges CNNs for local feature extraction with ViT for global feature capture, enhancing performance across diverse vision tasks. We propose a novel hierarchical Transformer encoder adaptable to various image resolutions, yielding multi-scale features without positional encoding. Additionally, we introduce a consistent attention-weighted triple loss to get the attention map and to optimize and match local descriptors. Utilizing a feature pyramid, our method predicts keypoints at multiple scales, leading to improved localization accuracy. Experiments have shown that our approach is competitive with the leading contrastive learning methods in image matching benchmarks and demonstrates robust generalization in tasks like visual odometry.

KW - Deep Learning

KW - Feature Description

KW - Feature Detection

KW - Global Information

KW - Vision Transformer

UR - http://www.scopus.com/inward/record.url?scp=86000784265&partnerID=8YFLogxK

U2 - 10.1109/CAC63892.2024.10864608

DO - 10.1109/CAC63892.2024.10864608

M3 - Conference contribution

AN - SCOPUS:86000784265

T3 - Proceedings - 2024 China Automation Congress, CAC 2024

SP - 7110

EP - 7115

BT - Proceedings - 2024 China Automation Congress, CAC 2024

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 1 November 2024 through 3 November 2024

ER -

D2VT: Better Detection and Description of Local Features with Vision Transformers

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此