Abstract
Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at https://github.com/Rlab123/action_diff.
| Original language | English |
|---|---|
| Article number | 111207 |
| Journal | Pattern Recognition |
| Volume | 160 |
| DOIs | |
| Publication status | Published - Apr 2025 |
Keywords
- Diffusion
- Mask learning
- Temporal action localization
- Weakly-supervised learning
Fingerprint
Dive into the research topics of 'Diffusion-based framework for weakly-supervised temporal action localization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver