Advancing zero-shot humorous video understanding with test-time humor knowledge augmentation

  • Yayun Qi
  • , Xinxiao Wu*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Zero-shot humorous video understanding aims to identify where the humor in a video comes from without task-specific training, which is a more complex task beyond conventional video understanding. The main challenge lies in the scarcity and difficulty of acquiring humor-related knowledge essential for associating visual content with humor. Without such knowledge, subtle visual cues that evoke humor can be easily misinterpreted as ordinary events. To address this challenge, we propose a test-time humor knowledge augmentation method, called “perceive, retrieve, verify, and generate”, which enhances humorous video comprehension by integrating humor-related knowledge obtained through the collaboration of pre-trained models. Given a humorous video, our method first segments it into semantically coherent scenes and employs a Vision-Language Model (VLM) to perceive and describe the visual content of each scene. These scene-level perception results are then served as queries to retrieve knowledge about spatial and temporal humor elements from a Large Language Model via different prompting strategies. The retrieved temporal humor elements are subsequently verified based on whether the inferred visual evidence of elements is consistent with the video content to discard the noisy knowledge irrelevant to the video. Finally, the refined knowledge is integrated as contextual information for a VLM to generate humor interpretations. Promising results on the FunQA and ExFunTube datasets demonstrate the effectiveness of our method.

Original languageEnglish
Article number113014
JournalPattern Recognition
Volume174
DOIs
Publication statusPublished - Jun 2026
Externally publishedYes

Keywords

  • Humorous video understanding
  • Knowledge
  • Pre-trained models
  • Zero-shot

Fingerprint

Dive into the research topics of 'Advancing zero-shot humorous video understanding with test-time humor knowledge augmentation'. Together they form a unique fingerprint.

Cite this