Abstract
Self-supervised gaze estimation methods, while reducing reliance on annotated data, still face two challenges. First, existing approaches primarily depend on single-view inputs and suffer from performance degradation under large head poses due to self-occlusion. Second, in the absence of explicit supervision, subtle gaze cues are often overshadowed by dominant factors like head pose and appearance, preventing models from learning discriminative gaze representations. Therefore, we propose ST-Gaze, a self-supervised multi-view gaze estimation network. It explicitly decouples gaze from appearance and viewpoint by leveraging local eye priors and global spatio-temporal consistency constraints, trained exclusively with a reconstruction loss. Specifically, the Eye-Guided Feature Decoupling (EGFD) module leverages local eye features to dynamically modulate full-face features, guiding the initial decoupling of gaze and appearance. Subsequently, the Spatio-Temporal Feature Fusion (STFF) module fuses semantically consistent features across views and timestamps by leveraging the distinct spatio-temporal attributes of appearance, viewpoint, and gaze, yielding robust global representations. Our method outperforms current self-supervised methods on EVE and ETH-XGaze while remaining competitive under fully supervision. On ETH-XGaze, which features large head-pose variations, it achieves mean angular errors of 7.89° and 3.23° in the self-supervised and fully supervised settings, respectively. Visualizations further validate the proposed framework, demonstrating effective feature decouplingamong gaze, appearance, and viewpoint representations.
| Original language | English |
|---|---|
| Article number | 103483 |
| Journal | Displays |
| Volume | 94 |
| DOIs | |
| Publication status | Published - Sept 2026 |
| Externally published | Yes |
Keywords
- Eye tracking
- Feature decoupling
- Gaze estimation
- Multi-view
- Self-supervised learning
- Spatio-temporal constraints
Fingerprint
Dive into the research topics of 'ST-Gaze: Self-supervised multi-view gaze estimation via eye-guided decoupling and spatio-temporal fusion'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver