TY - JOUR
T1 - Mix-Teaching
T2 - A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object Detection
AU - Yang, Lei
AU - Zhang, Xinyu
AU - Li, Jun
AU - Wang, Li
AU - Zhu, Minghan
AU - Zhang, Chuang
AU - Liu, Huaping
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Semi-supervised learning (SSL) has promising potential for improving model performance using both labelled and unlabelled data. Since recovering 3D information from 2D images is an ill-posed problem, the current state-of-the-art methods of monocular 3D object detection (Mono3D) have relatively low precision and recall, making semi-supervised learning for Mono3D tasks challenging and understudied. In this work, we propose a unified and effective semi-supervised learning framework called Mix-Teaching that can be applied to most monocular 3D object detectors. Based on the idea of decomposition and recombination, unlabelled samples are firstly decomposed into collections of image patches with high-quality predictions and collections of background images containing no objects. The student model is then trained on the mixed images containing dense instances with high-quality pseudo-labels generated by the recombination operation. In addition, we propose an uncertainty-based filter to distinguish high-quality pseudo-labels from noisy predictions during the decomposition process. As results in KITTI and nuScenes benchmarks, Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios. Our method achieves around +6.34% AP3D improvement against the GUPNet on the validation set when using only 10% labelled data. Using the full training set and the additional 38K raw images from KITTI, it can further improve the MonoFlex by +4.65% absolute improvement on AP3D for car detection, reaching 18.54% AP3D , which ranks the 1st place among all monocular based methods on the KITTI test leaderboard.
AB - Semi-supervised learning (SSL) has promising potential for improving model performance using both labelled and unlabelled data. Since recovering 3D information from 2D images is an ill-posed problem, the current state-of-the-art methods of monocular 3D object detection (Mono3D) have relatively low precision and recall, making semi-supervised learning for Mono3D tasks challenging and understudied. In this work, we propose a unified and effective semi-supervised learning framework called Mix-Teaching that can be applied to most monocular 3D object detectors. Based on the idea of decomposition and recombination, unlabelled samples are firstly decomposed into collections of image patches with high-quality predictions and collections of background images containing no objects. The student model is then trained on the mixed images containing dense instances with high-quality pseudo-labels generated by the recombination operation. In addition, we propose an uncertainty-based filter to distinguish high-quality pseudo-labels from noisy predictions during the decomposition process. As results in KITTI and nuScenes benchmarks, Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios. Our method achieves around +6.34% AP3D improvement against the GUPNet on the validation set when using only 10% labelled data. Using the full training set and the additional 38K raw images from KITTI, it can further improve the MonoFlex by +4.65% absolute improvement on AP3D for car detection, reaching 18.54% AP3D , which ranks the 1st place among all monocular based methods on the KITTI test leaderboard.
KW - 3D object detection
KW - Semi-supervised learning
KW - autonomous driving
UR - http://www.scopus.com/inward/record.url?scp=85159717570&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3270728
DO - 10.1109/TCSVT.2023.3270728
M3 - Article
AN - SCOPUS:85159717570
SN - 1051-8215
VL - 33
SP - 6832
EP - 6844
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 11
ER -