TY - JOUR
T1 - DA-SWTS
T2 - Dual-attention and temporal sampling make long video understanding efficient
AU - Sun, Xin
AU - Zhang, Feng
AU - Ren, Xiangyu
N1 - Publisher Copyright:
© 2025 Elsevier Inc.
PY - 2026/4/5
Y1 - 2026/4/5
N2 - The rapid progress of Large Language Models (LLMs) is enabling a new generation of video-based dialogue systems. However, existing Video Large Language Models (VideoLLMs) are primarily designed for short videos and face two critical challenges with long videos: visual information overload due to excessive numbers of visual tokens, and visual information loss resulting from hard compression strategies. We introduce DA-SWTS, a model for long video understanding that incorporates Dual-Attention and Sliding Window Temporal Sampling mechanisms to tackle these issues. Our key idea is to condense the essential information of each frame into two informative visual tokens and apply a sliding window strategy to achieve soft compression. The dual-attention mechanism transforms each frame into two compact yet information-rich tokens: an inter-modal context token, which captures query-relevant visual cues via cross-modal interaction with the user query, and an intra-modal local token, which distills the frame's intrinsic visual semantics. Meanwhile, the sliding window temporal sampling mechanism focuses on temporal integration and maintains a consistent compression ratio across videos of varying lengths. Our approach generates high-quality yet low-quantity visual tokens, enabling more effective long video understanding. In our experiments, DA-SWTS yields accuracy gains of 4.6 %, 6.2 %, and 2.1 % on MSVD-QA, MSRVTT-QA, and ActivityNet-QA, respectively, and improves the average quality score on VCGBench by 0.54, indicating notable performance improvements. Furthermore, our model reduces inference time by over 60 %, demonstrating both effectiveness and efficiency.
AB - The rapid progress of Large Language Models (LLMs) is enabling a new generation of video-based dialogue systems. However, existing Video Large Language Models (VideoLLMs) are primarily designed for short videos and face two critical challenges with long videos: visual information overload due to excessive numbers of visual tokens, and visual information loss resulting from hard compression strategies. We introduce DA-SWTS, a model for long video understanding that incorporates Dual-Attention and Sliding Window Temporal Sampling mechanisms to tackle these issues. Our key idea is to condense the essential information of each frame into two informative visual tokens and apply a sliding window strategy to achieve soft compression. The dual-attention mechanism transforms each frame into two compact yet information-rich tokens: an inter-modal context token, which captures query-relevant visual cues via cross-modal interaction with the user query, and an intra-modal local token, which distills the frame's intrinsic visual semantics. Meanwhile, the sliding window temporal sampling mechanism focuses on temporal integration and maintains a consistent compression ratio across videos of varying lengths. Our approach generates high-quality yet low-quantity visual tokens, enabling more effective long video understanding. In our experiments, DA-SWTS yields accuracy gains of 4.6 %, 6.2 %, and 2.1 % on MSVD-QA, MSRVTT-QA, and ActivityNet-QA, respectively, and improves the average quality score on VCGBench by 0.54, indicating notable performance improvements. Furthermore, our model reduces inference time by over 60 %, demonstrating both effectiveness and efficiency.
KW - Large language model
KW - Long video understanding
KW - Multimodal
UR - https://www.scopus.com/pages/publications/105022801462
U2 - 10.1016/j.ins.2025.122908
DO - 10.1016/j.ins.2025.122908
M3 - Article
AN - SCOPUS:105022801462
SN - 0020-0255
VL - 731
JO - Information Sciences
JF - Information Sciences
M1 - 122908
ER -