SwinFVO: Self-Supervised Visual Odometry with Enhanced Global Spatiotemporal Perception

  • Rujun Song
  • , Ruoqi Li
  • , Zhuoling Xiao*
  • , Bo Yan
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Pose estimation using visual sensors has become a fundamental component in robotic navigation and autonomous driving systems. Learning-based monocular visual odometry (VO) has attracted substantial attention due to its resilience to camera parameter variations and dynamic environments. Given that camera movement manifests as pixel-level motion across the entire image in optical flow data, capturing both global contextual information and local feature details is crucial for accurate pose estimation. To address this challenge, we propose SwinFVO, a novel self-supervised visual odometry framework that incorporates enhanced motion perception to achieve global spatial dependency modeling with temporal continuity. Leveraging quadrant-based motion characteristics, we perform cross-regional feature interaction through a refined Swin Transformer architecture. Two robust spatiotemporal feature extractors are designed to extend the single-frame-based Swin Transformer to a temporally-aware framework for sequential understanding. Through the exploration of long-range spatial correlations and preservation of temporal consistency, SwinFVO delivers accurate and consistent pose estimation. Extensive experiments across multiple datasets demonstrate the superior performance and generalization capability of SwinFVO in both pose and depth estimation tasks. It achieves competitive results against classical algorithms and outperforms related state-of-the-art (SOTA) methods by up to 20.6% and 72.4% on average translational and rotational evaluations, respectively.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • optical flow
  • self-attention mechanism
  • spatiotemporal perception
  • vision Transformer
  • Visual odometry

Fingerprint

Dive into the research topics of 'SwinFVO: Self-Supervised Visual Odometry with Enhanced Global Spatiotemporal Perception'. Together they form a unique fingerprint.

Cite this