TY - GEN
T1 - Dual-Scale Attention Networks for Efficient Monocular Depth Estimation
AU - He, Zhen
AU - Sun, Zhongqi
AU - Yang, Jialong
AU - Du, Changkun
AU - Xia, Yuanqing
N1 - Publisher Copyright:
© 2025 Technical Committee on Control Theory, Chinese Association of Automation.
PY - 2025
Y1 - 2025
N2 - This paper proposes an innovative self-supervised monocular depth estimation algorithm-Dual-Scale Attention Module (DSAM). This method combines the advantages of Convolutional Neural Networks (CNNs) and Transformers by adapting the CNN architecture and introducing a spatial-channel synergistic attention mechanism (UniSA) for multi-scale feature processing, significantly improving the accuracy and robustness of depth estimation. Specifically, the CNN adaptation enhances local feature extraction and expands the receptive field by stacking depth-separable dilated convolutions with different dilation rates. Compared to existing self-supervised monocular depth estimation methods, DSAM demonstrates stronger adaptability in complex scenes and dynamic objects, achieving significant progress in capturing fine-grained depth variations and handling abrupt depth changes. Using a self-supervised learning framework, our method does not rely on manually labeled depth data and shows excellent performance across multiple datasets. Experimental results show that DSAM outperforms existing methods on several key metrics, especially with significant performance improvements on the KITTI dataset. The contributions of this paper lie in proposing a new dual-scale attention mechanism, a self-supervised depth estimation framework, and adapting the CNN architecture, providing innovative solutions for feature extraction, feature fusion, and global context modeling in depth estimation tasks.
AB - This paper proposes an innovative self-supervised monocular depth estimation algorithm-Dual-Scale Attention Module (DSAM). This method combines the advantages of Convolutional Neural Networks (CNNs) and Transformers by adapting the CNN architecture and introducing a spatial-channel synergistic attention mechanism (UniSA) for multi-scale feature processing, significantly improving the accuracy and robustness of depth estimation. Specifically, the CNN adaptation enhances local feature extraction and expands the receptive field by stacking depth-separable dilated convolutions with different dilation rates. Compared to existing self-supervised monocular depth estimation methods, DSAM demonstrates stronger adaptability in complex scenes and dynamic objects, achieving significant progress in capturing fine-grained depth variations and handling abrupt depth changes. Using a self-supervised learning framework, our method does not rely on manually labeled depth data and shows excellent performance across multiple datasets. Experimental results show that DSAM outperforms existing methods on several key metrics, especially with significant performance improvements on the KITTI dataset. The contributions of this paper lie in proposing a new dual-scale attention mechanism, a self-supervised depth estimation framework, and adapting the CNN architecture, providing innovative solutions for feature extraction, feature fusion, and global context modeling in depth estimation tasks.
KW - Convolutional Neural Networks (CNN)
KW - Dual-Scale Attention
KW - Monocular depth
KW - Self-supervised Learning
UR - https://www.scopus.com/pages/publications/105020270633
U2 - 10.23919/CCC64809.2025.11178391
DO - 10.23919/CCC64809.2025.11178391
M3 - Conference contribution
AN - SCOPUS:105020270633
T3 - Chinese Control Conference, CCC
SP - 9187
EP - 9192
BT - Proceedings of the 44th Chinese Control Conference, CCC 2025
A2 - Sun, Jian
A2 - Yin, Hongpeng
PB - IEEE Computer Society
T2 - 44th Chinese Control Conference, CCC 2025
Y2 - 28 July 2025 through 30 July 2025
ER -