Weakly supervised temporal action localization via a multimodal feature map diffusion process

Yuanbing Zou, Qingjie Zhao*, Shanshan Li

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model's generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method's capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.

Original languageEnglish
Article number111044
JournalEngineering Applications of Artificial Intelligence
Volume156
DOIs
Publication statusPublished - 15 Sept 2025
Externally publishedYes

Keywords

  • Diffusion models
  • Multimodel feature fusion
  • Temporal action localization
  • Weakly-supervised learning

Fingerprint

Dive into the research topics of 'Weakly supervised temporal action localization via a multimodal feature map diffusion process'. Together they form a unique fingerprint.

Cite this