TY - JOUR
T1 - Unsupervised Optical-Sensor Extrinsic Calibration via Dual-Transformer Alignment
AU - Wang, Yuhao
AU - Zuo, Yong
AU - Tang, Yi
AU - Hong, Xiaobin
AU - Wu, Jian
AU - Bian, Ziyu
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/11
Y1 - 2025/11
N2 - Accurate extrinsic calibration between optical sensors, such as camera and LiDAR, is crucial for multimodal perception. Traditional methods based on specific calibration targets exhibit poor robustness in complex optical environments such as glare, reflections, or low light, and they rely on cumbersome manual operations. To address this, we propose a fully unsupervised, end-to-end calibration framework. Our approach adopts a dual-Transformer architecture: a Vision Transformer extracts semantic features from the image stream, while a Point Transformer captures the geometric structure of the 3D LiDAR point cloud. These cross-modal representations are aligned and fused through a neural network, and a regression algorithm is used to obtain the 6-DoF extrinsic transformation matrix. A multi-constraint loss function is designed to enhance structural consistency between modalities, thereby improving calibration stability and accuracy. On the KITTI benchmark, our method achieves a mean rotation error of 0.21° and a translation error of 3.31 cm; on a self-collected dataset, it attains an average reprojection error of 1.52 pixels. These results demonstrate a generalizable and robust solution for optical-sensor extrinsic calibration, enabling precise and self-sufficient perception in real-world applications.
AB - Accurate extrinsic calibration between optical sensors, such as camera and LiDAR, is crucial for multimodal perception. Traditional methods based on specific calibration targets exhibit poor robustness in complex optical environments such as glare, reflections, or low light, and they rely on cumbersome manual operations. To address this, we propose a fully unsupervised, end-to-end calibration framework. Our approach adopts a dual-Transformer architecture: a Vision Transformer extracts semantic features from the image stream, while a Point Transformer captures the geometric structure of the 3D LiDAR point cloud. These cross-modal representations are aligned and fused through a neural network, and a regression algorithm is used to obtain the 6-DoF extrinsic transformation matrix. A multi-constraint loss function is designed to enhance structural consistency between modalities, thereby improving calibration stability and accuracy. On the KITTI benchmark, our method achieves a mean rotation error of 0.21° and a translation error of 3.31 cm; on a self-collected dataset, it attains an average reprojection error of 1.52 pixels. These results demonstrate a generalizable and robust solution for optical-sensor extrinsic calibration, enabling precise and self-sufficient perception in real-world applications.
KW - LiDAR–camera calibration
KW - extrinsic parameters
KW - sensor fusion
KW - unsupervised
UR - https://www.scopus.com/pages/publications/105022927054
U2 - 10.3390/s25226944
DO - 10.3390/s25226944
M3 - Article
AN - SCOPUS:105022927054
SN - 1424-8220
VL - 25
JO - Sensors
JF - Sensors
IS - 22
M1 - 6944
ER -