Diffusion-based framework for weakly-supervised temporal action localization

Yuanbing Zou, Qingjie Zhao*, Prodip Kumar Sarker, Shanshan Li, Lei Wang, Wangwang Liu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at https://github.com/Rlab123/action_diff.

Original languageEnglish
Article number111207
JournalPattern Recognition
Volume160
DOIs
Publication statusPublished - Apr 2025

Keywords

  • Diffusion
  • Mask learning
  • Temporal action localization
  • Weakly-supervised learning

Fingerprint

Dive into the research topics of 'Diffusion-based framework for weakly-supervised temporal action localization'. Together they form a unique fingerprint.

Cite this