TY - JOUR
T1 - ST-Gaze
T2 - Self-supervised multi-view gaze estimation via eye-guided decoupling and spatio-temporal fusion
AU - Zhang, Tianqi
AU - Chen, Jing
AU - Qin, Shanbin
AU - Fu, Shanfeng
AU - Yang, Jian
N1 - Publisher Copyright:
© 2026 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
PY - 2026/9
Y1 - 2026/9
N2 - Self-supervised gaze estimation methods, while reducing reliance on annotated data, still face two challenges. First, existing approaches primarily depend on single-view inputs and suffer from performance degradation under large head poses due to self-occlusion. Second, in the absence of explicit supervision, subtle gaze cues are often overshadowed by dominant factors like head pose and appearance, preventing models from learning discriminative gaze representations. Therefore, we propose ST-Gaze, a self-supervised multi-view gaze estimation network. It explicitly decouples gaze from appearance and viewpoint by leveraging local eye priors and global spatio-temporal consistency constraints, trained exclusively with a reconstruction loss. Specifically, the Eye-Guided Feature Decoupling (EGFD) module leverages local eye features to dynamically modulate full-face features, guiding the initial decoupling of gaze and appearance. Subsequently, the Spatio-Temporal Feature Fusion (STFF) module fuses semantically consistent features across views and timestamps by leveraging the distinct spatio-temporal attributes of appearance, viewpoint, and gaze, yielding robust global representations. Our method outperforms current self-supervised methods on EVE and ETH-XGaze while remaining competitive under fully supervision. On ETH-XGaze, which features large head-pose variations, it achieves mean angular errors of 7.89° and 3.23° in the self-supervised and fully supervised settings, respectively. Visualizations further validate the proposed framework, demonstrating effective feature decouplingamong gaze, appearance, and viewpoint representations.
AB - Self-supervised gaze estimation methods, while reducing reliance on annotated data, still face two challenges. First, existing approaches primarily depend on single-view inputs and suffer from performance degradation under large head poses due to self-occlusion. Second, in the absence of explicit supervision, subtle gaze cues are often overshadowed by dominant factors like head pose and appearance, preventing models from learning discriminative gaze representations. Therefore, we propose ST-Gaze, a self-supervised multi-view gaze estimation network. It explicitly decouples gaze from appearance and viewpoint by leveraging local eye priors and global spatio-temporal consistency constraints, trained exclusively with a reconstruction loss. Specifically, the Eye-Guided Feature Decoupling (EGFD) module leverages local eye features to dynamically modulate full-face features, guiding the initial decoupling of gaze and appearance. Subsequently, the Spatio-Temporal Feature Fusion (STFF) module fuses semantically consistent features across views and timestamps by leveraging the distinct spatio-temporal attributes of appearance, viewpoint, and gaze, yielding robust global representations. Our method outperforms current self-supervised methods on EVE and ETH-XGaze while remaining competitive under fully supervision. On ETH-XGaze, which features large head-pose variations, it achieves mean angular errors of 7.89° and 3.23° in the self-supervised and fully supervised settings, respectively. Visualizations further validate the proposed framework, demonstrating effective feature decouplingamong gaze, appearance, and viewpoint representations.
KW - Eye tracking
KW - Feature decoupling
KW - Gaze estimation
KW - Multi-view
KW - Self-supervised learning
KW - Spatio-temporal constraints
UR - https://www.scopus.com/pages/publications/105036742243
U2 - 10.1016/j.displa.2026.103483
DO - 10.1016/j.displa.2026.103483
M3 - Article
AN - SCOPUS:105036742243
SN - 0141-9382
VL - 94
JO - Displays
JF - Displays
M1 - 103483
ER -