Skip to main navigation Skip to search Skip to main content

ST-Gaze: Self-supervised multi-view gaze estimation via eye-guided decoupling and spatio-temporal fusion

  • Tianqi Zhang
  • , Jing Chen*
  • , Shanbin Qin
  • , Shanfeng Fu
  • , Jian Yang
  • *Corresponding author for this work
  • Beijing Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Self-supervised gaze estimation methods, while reducing reliance on annotated data, still face two challenges. First, existing approaches primarily depend on single-view inputs and suffer from performance degradation under large head poses due to self-occlusion. Second, in the absence of explicit supervision, subtle gaze cues are often overshadowed by dominant factors like head pose and appearance, preventing models from learning discriminative gaze representations. Therefore, we propose ST-Gaze, a self-supervised multi-view gaze estimation network. It explicitly decouples gaze from appearance and viewpoint by leveraging local eye priors and global spatio-temporal consistency constraints, trained exclusively with a reconstruction loss. Specifically, the Eye-Guided Feature Decoupling (EGFD) module leverages local eye features to dynamically modulate full-face features, guiding the initial decoupling of gaze and appearance. Subsequently, the Spatio-Temporal Feature Fusion (STFF) module fuses semantically consistent features across views and timestamps by leveraging the distinct spatio-temporal attributes of appearance, viewpoint, and gaze, yielding robust global representations. Our method outperforms current self-supervised methods on EVE and ETH-XGaze while remaining competitive under fully supervision. On ETH-XGaze, which features large head-pose variations, it achieves mean angular errors of 7.89° and 3.23° in the self-supervised and fully supervised settings, respectively. Visualizations further validate the proposed framework, demonstrating effective feature decouplingamong gaze, appearance, and viewpoint representations.

Original languageEnglish
Article number103483
JournalDisplays
Volume94
DOIs
Publication statusPublished - Sept 2026
Externally publishedYes

Keywords

  • Eye tracking
  • Feature decoupling
  • Gaze estimation
  • Multi-view
  • Self-supervised learning
  • Spatio-temporal constraints

Fingerprint

Dive into the research topics of 'ST-Gaze: Self-supervised multi-view gaze estimation via eye-guided decoupling and spatio-temporal fusion'. Together they form a unique fingerprint.

Cite this