DA-SWTS: Dual-attention and temporal sampling make long video understanding efficient

  • Xin Sun*
  • , Feng Zhang
  • , Xiangyu Ren
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid progress of Large Language Models (LLMs) is enabling a new generation of video-based dialogue systems. However, existing Video Large Language Models (VideoLLMs) are primarily designed for short videos and face two critical challenges with long videos: visual information overload due to excessive numbers of visual tokens, and visual information loss resulting from hard compression strategies. We introduce DA-SWTS, a model for long video understanding that incorporates Dual-Attention and Sliding Window Temporal Sampling mechanisms to tackle these issues. Our key idea is to condense the essential information of each frame into two informative visual tokens and apply a sliding window strategy to achieve soft compression. The dual-attention mechanism transforms each frame into two compact yet information-rich tokens: an inter-modal context token, which captures query-relevant visual cues via cross-modal interaction with the user query, and an intra-modal local token, which distills the frame's intrinsic visual semantics. Meanwhile, the sliding window temporal sampling mechanism focuses on temporal integration and maintains a consistent compression ratio across videos of varying lengths. Our approach generates high-quality yet low-quantity visual tokens, enabling more effective long video understanding. In our experiments, DA-SWTS yields accuracy gains of 4.6 %, 6.2 %, and 2.1 % on MSVD-QA, MSRVTT-QA, and ActivityNet-QA, respectively, and improves the average quality score on VCGBench by 0.54, indicating notable performance improvements. Furthermore, our model reduces inference time by over 60 %, demonstrating both effectiveness and efficiency.

Original languageEnglish
Article number122908
JournalInformation Sciences
Volume731
DOIs
Publication statusPublished - 5 Apr 2026
Externally publishedYes

Keywords

  • Large language model
  • Long video understanding
  • Multimodal

Fingerprint

Dive into the research topics of 'DA-SWTS: Dual-attention and temporal sampling make long video understanding efficient'. Together they form a unique fingerprint.

Cite this