Abstract
This paper addresses the challenge of capturing global temporal dependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies across extended temporal horizons. To overcome this limitation, we introduce GISE-TTT, a novel architecture that integrates Test-Time Training (TTT) layers into transformer-based frameworks through a co-designed hierarchical approach.The TTT layer systematically condenses historical temporal information into hidden states that encode globally coherent contextual representations. By leveraging multistage contextual aggregation through hierarchical concatenation, our framework progressively refines spatiotemporal dependencies across network layers. This design represents the first systematic empirical evidence that distributing global information across multiple network layers is critical for optimal dependency utilization in video segmentation tasks.Ablation studies demonstrate that incorporating TTT modules at high-level feature stages significantly enhances global modeling capabilities, thereby improving the network's ability to capture long-range temporal relationships. Extensive experiments on DAVIS 2017 show that GISETTT achieves a 3.2 % improvement in segmentation accuracy over the baseline model, providing comprehensive evidence that global information should be strategically leveraged throughout the network architecture.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE 8th International Conference on Computer and Communication Engineering Technology, CCET 2025 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 226-230 |
| Number of pages | 5 |
| Edition | 2025 |
| ISBN (Electronic) | 9798331558109 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
| Event | 8th IEEE International Conference on Computer and Communication Engineering Technology, CCET 2025 - Beijing, China Duration: 15 Aug 2025 → 17 Aug 2025 |
Conference
| Conference | 8th IEEE International Conference on Computer and Communication Engineering Technology, CCET 2025 |
|---|---|
| Country/Territory | China |
| City | Beijing |
| Period | 15/08/25 → 17/08/25 |
Keywords
- Global Information
- TTT
- Video Object Segmentation