Abstract
Weakly supervised temporal action localization (WTAL) aims to identify action instances in untrimmed videos with only video-level supervision. Despite recent advances in WTAL methods, achieving accurate boundary localization remains a significant challenge. A key reason is that WTAL networks following a localization-by-classification pipeline tend to focus on the most discriminative features, neglecting some ambiguous features that may contain action instances. To make the WTAL model focus on low-discriminative features that include action instances, we propose an action-to-action diffusion (ActionDiff) network. This network leverages the smoothness of data generated by the diffusion model, using the diffusion model to output smooth and high-quality features that weaken the discriminative action features from the base branch, thereby enhancing the performance of the WTAL task. First, we develop a topk-based masking strategy to generate binary masks that serve as pseudo-labels for diffusion model learning. Then, we propose a diffusion branch to generate high-quality latent action space by iteratively removing noise guided by the designed pseudo-labels and conditional information. To enhance the diffusion branch’s capability to generate human behavioral features, we design an action-related conditional strategy to obtain conditional information and use it to guide the modeling of human behavior knowledge by the diffusion branch. Our comprehensive experiments demonstrate that the proposed method achieves a promising performance on three benchmark datasets: THUMOS14, ActivityNet v1.2, and v1.3.
| Original language | English |
|---|---|
| Pages (from-to) | 9371-9384 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 27 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
Keywords
- Temporal action localization
- diffusion models
- temporal enhancement
- weakly supervised learning
Fingerprint
Dive into the research topics of 'Action-to-Action Diffusion Network for Weakly Supervised Temporal Action Localization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver