Abstract
Online spatio-temporal action detection (OSTAD) is a crucial task in video understanding, responsible for identifying and categorizing action instances in video streams in an online manner. This paper presents a novel approach that employs adaptive sampling and hierarchical modulation to enhance OSTAD capabilities. Traditional methods, often constrained by fixed sampling rates, may lead to redundancy in scenarios with slower action speeds and overlook essential details in faster-moving sequences. Our innovative dynamic sampling strategy, informed by speed estimation, adaptively adjusts sampling intervals based on speed attention and visual differential features, thereby optimizing the informational content of each sampled video clip. Additionally, our method incorporates a hierarchical modulation mechanism that synergizes high-level semantic and low-level spatial information, significantly enhancing action localization and classification accuracy. The adaptive sampling network with hierarchical modulation, underpinned by these advancements, demonstrates substantial improvements on benchmark datasets such as JHMDB21 and UCF24, proving our methods’ efficacy in handling diverse and dynamic action sequences in an online setting.
Original language | English |
---|---|
Article number | 349 |
Journal | Multimedia Systems |
Volume | 30 |
Issue number | 6 |
DOIs | |
Publication status | Published - Dec 2024 |
Keywords
- Adaptive sampling
- Hierarchical modulation
- Online
- Spatio-temporal action detection
- Speed estimation