Learning Transformation-Predictive Representations for Detection and Description of Local Features

Zihao Wang; Chunxu Wu; Yifei Yang; Zhen Li

doi:10.1109/CVPR52729.2023.01103

Learning Transformation-Predictive Representations for Detection and Description of Local Features

Zihao Wang, Chunxu Wu, Yifei Yang, Zhen Li^*

^*Corresponding author for this work

Office of International Students

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

5 Citations (Scopus)

Abstract

The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images may bring indistinguishable samples, like false positives or negatives, which acts as inconsistent supervision. Such resultant false samples mixed with hard samples prevent neural networks from learning descriptions for more accurate matching. To tackle this challenge, we propose to learn the transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponding views of the same 3D point (landmark) by using none of the negative sample pairs and avoiding collapsing solutions. Furthermore, we adopt self-supervised generation learning and curriculum learning to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels contribute to overcoming the training bottleneck (derived from the label noise of false positives) and facilitating the model training under a stronger transformation paradigm. Our self-supervised training pipeline greatly decreases the computation load and memory usage, and outperforms the sota on the standard image matching benchmarks by noticeable margins, demonstrating excellent generalization capability on multiple downstream tasks.

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Publisher	IEEE Computer Society
Pages	11464-11473
Number of pages	10
ISBN (Electronic)	9798350301298
DOIs	https://doi.org/10.1109/CVPR52729.2023.01103
Publication status	Published - 2023
Event	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada Duration: 18 Jun 2023 → 22 Jun 2023

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2023-June
ISSN (Print)	1063-6919

Conference

Conference	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Country/Territory	Canada
City	Vancouver
Period	18/06/23 → 22/06/23

Keywords

3D from multi-view and sensors

Access to Document

10.1109/CVPR52729.2023.01103

Cite this

Wang, Z., Wu, C., Yang, Y., & Li, Z. (2023). Learning Transformation-Predictive Representations for Detection and Description of Local Features. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 (pp. 11464-11473). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2023-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52729.2023.01103

Wang, Zihao ; Wu, Chunxu ; Yang, Yifei et al. / Learning Transformation-Predictive Representations for Detection and Description of Local Features. Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. pp. 11464-11473 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

@inproceedings{f44ced17e9954475a355151ddf075cc6,

title = "Learning Transformation-Predictive Representations for Detection and Description of Local Features",

abstract = "The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images may bring indistinguishable samples, like false positives or negatives, which acts as inconsistent supervision. Such resultant false samples mixed with hard samples prevent neural networks from learning descriptions for more accurate matching. To tackle this challenge, we propose to learn the transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponding views of the same 3D point (landmark) by using none of the negative sample pairs and avoiding collapsing solutions. Furthermore, we adopt self-supervised generation learning and curriculum learning to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels contribute to overcoming the training bottleneck (derived from the label noise of false positives) and facilitating the model training under a stronger transformation paradigm. Our self-supervised training pipeline greatly decreases the computation load and memory usage, and outperforms the sota on the standard image matching benchmarks by noticeable margins, demonstrating excellent generalization capability on multiple downstream tasks.",

keywords = "3D from multi-view and sensors",

author = "Zihao Wang and Chunxu Wu and Yifei Yang and Zhen Li",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.01103",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "11464--11473",

booktitle = "Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023",

address = "United States",

}

Wang, Z, Wu, C, Yang, Y & Li, Z 2023, Learning Transformation-Predictive Representations for Detection and Description of Local Features. in Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2023-June, IEEE Computer Society, pp. 11464-11473, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, Canada, 18/06/23. https://doi.org/10.1109/CVPR52729.2023.01103

Learning Transformation-Predictive Representations for Detection and Description of Local Features. / Wang, Zihao; Wu, Chunxu; Yang, Yifei et al.
Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. p. 11464-11473 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2023-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Learning Transformation-Predictive Representations for Detection and Description of Local Features

AU - Wang, Zihao

AU - Wu, Chunxu

AU - Yang, Yifei

AU - Li, Zhen

PY - 2023

Y1 - 2023

N2 - The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images may bring indistinguishable samples, like false positives or negatives, which acts as inconsistent supervision. Such resultant false samples mixed with hard samples prevent neural networks from learning descriptions for more accurate matching. To tackle this challenge, we propose to learn the transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponding views of the same 3D point (landmark) by using none of the negative sample pairs and avoiding collapsing solutions. Furthermore, we adopt self-supervised generation learning and curriculum learning to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels contribute to overcoming the training bottleneck (derived from the label noise of false positives) and facilitating the model training under a stronger transformation paradigm. Our self-supervised training pipeline greatly decreases the computation load and memory usage, and outperforms the sota on the standard image matching benchmarks by noticeable margins, demonstrating excellent generalization capability on multiple downstream tasks.

AB - The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images may bring indistinguishable samples, like false positives or negatives, which acts as inconsistent supervision. Such resultant false samples mixed with hard samples prevent neural networks from learning descriptions for more accurate matching. To tackle this challenge, we propose to learn the transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponding views of the same 3D point (landmark) by using none of the negative sample pairs and avoiding collapsing solutions. Furthermore, we adopt self-supervised generation learning and curriculum learning to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels contribute to overcoming the training bottleneck (derived from the label noise of false positives) and facilitating the model training under a stronger transformation paradigm. Our self-supervised training pipeline greatly decreases the computation load and memory usage, and outperforms the sota on the standard image matching benchmarks by noticeable margins, demonstrating excellent generalization capability on multiple downstream tasks.

KW - 3D from multi-view and sensors

UR - http://www.scopus.com/inward/record.url?scp=85173929519&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.01103

DO - 10.1109/CVPR52729.2023.01103

M3 - Conference contribution

AN - SCOPUS:85173929519

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 11464

EP - 11473

BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

PB - IEEE Computer Society

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

Y2 - 18 June 2023 through 22 June 2023

ER -

Wang Z, Wu C, Yang Y, Li Z. Learning Transformation-Predictive Representations for Detection and Description of Local Features. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society. 2023. p. 11464-11473. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52729.2023.01103

Learning Transformation-Predictive Representations for Detection and Description of Local Features

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this