TY - JOUR
T1 - Self-Supervised Monocular Depth Estimation for All-Day Images Based on Dual-Axis Transformer
AU - Hou, Shengyu
AU - Fu, Mengyin
AU - Wang, Rongchuan
AU - Yang, Yi
AU - Song, Wenjie
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - All-day self-supervised monocular depth estimation has strong practical significance for autonomous systems to continuously perceive the 3D information of the world. However, night-time scenes pose challenges of weak texture and violating the brightness consistency assumption due to low illumination and varying lighting, respectively, which easily leads to most existing self-supervised models only being able to handle day-time scenes. To address this problem, we propose a self-supervised monocular depth estimation unified framework that can handle all-day scenarios, which has three features: 1) an Illumination Compensation PoseNet (ICP) is designed, which is based on the classic Phong illumination theory and compensates for lighting changes in adjacent frames by estimating per-pixel transformations; 2) a Dual-Axis Transformer (DAT) block is proposed as the backbone network of the depth encoder, which infers the depth of local low-illumination areas through spatial-channel dual-dimensional global context information of night-time images; 3) a cross-layer Adaptive Fusion Module (AFM) is introduced between multiple DAT blocks, which learns attention weights between different layer features and adaptively fuses cross-layer features using the learned weights, enhancing the complementarity of different layer features. This work was evaluated on multiple datasets, including: RobotCar, Waymo and KITTI datasets, achieving state-of-the-art results in both day-time and night-time scenarios.
AB - All-day self-supervised monocular depth estimation has strong practical significance for autonomous systems to continuously perceive the 3D information of the world. However, night-time scenes pose challenges of weak texture and violating the brightness consistency assumption due to low illumination and varying lighting, respectively, which easily leads to most existing self-supervised models only being able to handle day-time scenes. To address this problem, we propose a self-supervised monocular depth estimation unified framework that can handle all-day scenarios, which has three features: 1) an Illumination Compensation PoseNet (ICP) is designed, which is based on the classic Phong illumination theory and compensates for lighting changes in adjacent frames by estimating per-pixel transformations; 2) a Dual-Axis Transformer (DAT) block is proposed as the backbone network of the depth encoder, which infers the depth of local low-illumination areas through spatial-channel dual-dimensional global context information of night-time images; 3) a cross-layer Adaptive Fusion Module (AFM) is introduced between multiple DAT blocks, which learns attention weights between different layer features and adaptively fuses cross-layer features using the learned weights, enhancing the complementarity of different layer features. This work was evaluated on multiple datasets, including: RobotCar, Waymo and KITTI datasets, achieving state-of-the-art results in both day-time and night-time scenarios.
KW - Monocular depth estimation
KW - multi-task learning
KW - transformer network
KW - unsupervised estimation
UR - http://www.scopus.com/inward/record.url?scp=85194852013&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2024.3406043
DO - 10.1109/TCSVT.2024.3406043
M3 - Article
AN - SCOPUS:85194852013
SN - 1051-8215
VL - 34
SP - 9939
EP - 9953
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 10
ER -