TY - JOUR
T1 - You can only watch the past
T2 - track attention network for online spatio-temporal action detection
AU - Su, Shaowen
AU - Gan, Minggang
AU - Zhang, Yan
N1 - Publisher Copyright:
© Science China Press 2026.
PY - 2026/2
Y1 - 2026/2
N2 - Online spatio-temporal action detection (OSTAD) aims to identify and localize action instances in real-time video streams without accessing future frames. However, the online setting imposes strict constraints of incremental inference, limited memory, and causal processing, which severely restrict the availability of effective information. To address this, we propose the track attention network (TAN), introducing a history-aware track-and-detect paradigm. Instead of detecting actions independently at each frame, TAN leverages historical detection results and spatio-temporal continuity to enhance current-frame features. Specifically, we propose three strategies. First, a history-aware actor distribution prediction strategy estimates current actor distributions based on spatial continuity and appearance similarity. Second, an actor distribution inference strategy via track attention introduces two attention modules—track channel attention and track efficient attention—to model semantic relations among actor distributions for robust fusion. Third, a history-aware feature modulation strategy injects localization priors from actor distributions into action features, improving representation quality and detection accuracy. Extensive experiments on the JHMDB21 and UCF24 benchmarks demonstrate the effectiveness of our method. TAN achieves 80.3% frame-level mAP (f-mAP) and 88.3% video-level mAP (v-mAP) on JHMDB21, and 88.1% f-mAP and 54.8% v-mAP on UCF24, outperforming existing online methods and even several offline approaches.
AB - Online spatio-temporal action detection (OSTAD) aims to identify and localize action instances in real-time video streams without accessing future frames. However, the online setting imposes strict constraints of incremental inference, limited memory, and causal processing, which severely restrict the availability of effective information. To address this, we propose the track attention network (TAN), introducing a history-aware track-and-detect paradigm. Instead of detecting actions independently at each frame, TAN leverages historical detection results and spatio-temporal continuity to enhance current-frame features. Specifically, we propose three strategies. First, a history-aware actor distribution prediction strategy estimates current actor distributions based on spatial continuity and appearance similarity. Second, an actor distribution inference strategy via track attention introduces two attention modules—track channel attention and track efficient attention—to model semantic relations among actor distributions for robust fusion. Third, a history-aware feature modulation strategy injects localization priors from actor distributions into action features, improving representation quality and detection accuracy. Extensive experiments on the JHMDB21 and UCF24 benchmarks demonstrate the effectiveness of our method. TAN achieves 80.3% frame-level mAP (f-mAP) and 88.3% video-level mAP (v-mAP) on JHMDB21, and 88.1% f-mAP and 54.8% v-mAP on UCF24, outperforming existing online methods and even several offline approaches.
KW - actor distribution
KW - historical detection
KW - online spatio-temporal action detection
KW - track attention
KW - track-and-detect
UR - https://www.scopus.com/pages/publications/105028282628
U2 - 10.1007/s11432-024-4501-3
DO - 10.1007/s11432-024-4501-3
M3 - Article
AN - SCOPUS:105028282628
SN - 1674-733X
VL - 69
JO - Science China Information Sciences
JF - Science China Information Sciences
IS - 2
M1 - 122107
ER -