跳到主要导航 跳到搜索 跳到主要内容

ST-Gaze: Self-supervised multi-view gaze estimation via eye-guided decoupling and spatio-temporal fusion

  • Beijing Institute of Technology

科研成果: 期刊稿件文章同行评审

摘要

Self-supervised gaze estimation methods, while reducing reliance on annotated data, still face two challenges. First, existing approaches primarily depend on single-view inputs and suffer from performance degradation under large head poses due to self-occlusion. Second, in the absence of explicit supervision, subtle gaze cues are often overshadowed by dominant factors like head pose and appearance, preventing models from learning discriminative gaze representations. Therefore, we propose ST-Gaze, a self-supervised multi-view gaze estimation network. It explicitly decouples gaze from appearance and viewpoint by leveraging local eye priors and global spatio-temporal consistency constraints, trained exclusively with a reconstruction loss. Specifically, the Eye-Guided Feature Decoupling (EGFD) module leverages local eye features to dynamically modulate full-face features, guiding the initial decoupling of gaze and appearance. Subsequently, the Spatio-Temporal Feature Fusion (STFF) module fuses semantically consistent features across views and timestamps by leveraging the distinct spatio-temporal attributes of appearance, viewpoint, and gaze, yielding robust global representations. Our method outperforms current self-supervised methods on EVE and ETH-XGaze while remaining competitive under fully supervision. On ETH-XGaze, which features large head-pose variations, it achieves mean angular errors of 7.89° and 3.23° in the self-supervised and fully supervised settings, respectively. Visualizations further validate the proposed framework, demonstrating effective feature decouplingamong gaze, appearance, and viewpoint representations.

源语言英语
文章编号103483
期刊Displays
94
DOI
出版状态已出版 - 9月 2026
已对外发布

指纹

探究 'ST-Gaze: Self-supervised multi-view gaze estimation via eye-guided decoupling and spatio-temporal fusion' 的科研主题。它们共同构成独一无二的指纹。

引用此