Action-to-Action Diffusion Network for Weakly Supervised Temporal Action Localization

  • Yuanbing Zou
  • , Qingjie Zhao*
  • , Prodip Kumar Sarker
  • , Le Yang*
  • , Binglu Wang
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Weakly supervised temporal action localization (WTAL) aims to identify action instances in untrimmed videos with only video-level supervision. Despite recent advances in WTAL methods, achieving accurate boundary localization remains a significant challenge. A key reason is that WTAL networks following a localization-by-classification pipeline tend to focus on the most discriminative features, neglecting some ambiguous features that may contain action instances. To make the WTAL model focus on low-discriminative features that include action instances, we propose an action-to-action diffusion (ActionDiff) network. This network leverages the smoothness of data generated by the diffusion model, using the diffusion model to output smooth and high-quality features that weaken the discriminative action features from the base branch, thereby enhancing the performance of the WTAL task. First, we develop a topk-based masking strategy to generate binary masks that serve as pseudo-labels for diffusion model learning. Then, we propose a diffusion branch to generate high-quality latent action space by iteratively removing noise guided by the designed pseudo-labels and conditional information. To enhance the diffusion branch’s capability to generate human behavioral features, we design an action-related conditional strategy to obtain conditional information and use it to guide the modeling of human behavior knowledge by the diffusion branch. Our comprehensive experiments demonstrate that the proposed method achieves a promising performance on three benchmark datasets: THUMOS14, ActivityNet v1.2, and v1.3.

Original languageEnglish
Pages (from-to)9371-9384
Number of pages14
JournalIEEE Transactions on Multimedia
Volume27
DOIs
Publication statusPublished - 2025
Externally publishedYes

Keywords

  • Temporal action localization
  • diffusion models
  • temporal enhancement
  • weakly supervised learning

Fingerprint

Dive into the research topics of 'Action-to-Action Diffusion Network for Weakly Supervised Temporal Action Localization'. Together they form a unique fingerprint.

Cite this