摘要
Multimodal video reasoning and anomaly detection remain key challenges for Large Language Models (LLMs) due to limited video-text alignment, narrow knowledge coverage, and difficulties in handling complex or weakly video-related queries. To address these limitations, we propose a Hierarchical Multi-Agent Retrieval-Augmented Generation (HM-RAG) framework that integrates internal temporal understanding with external knowledge retrieval. Specifically, our approach operates through a coordinated pipeline: it begins with a question decomposition agent that reformulates complex queries into structured sub-tasks, followed by multi-source reasoning agents, comprising a web agent for external retrieval and a memory-enhanced model for long-range temporal dependencies. Finally, decision agent synthesizes these multi-source insights to resolve contradictions and generate precise predictions. By hierarchically coordinating agents across retrieval and reasoning modalities, our framework achieves effective knowledge fusion. Extensive evaluations demonstrate that HM-RAG significantly improves performance not only on standard multimodal video reasoning benchmarks but also effectively identifies irregular events, validating its robustness in video anomaly detection tasks. Code is available at https://github.com/hanzif1/HM-RAG.
| 源语言 | 英语 |
|---|---|
| 文章编号 | 113622 |
| 期刊 | Pattern Recognition |
| 卷 | 179 |
| DOI | |
| 出版状态 | 已出版 - 11月 2026 |
| 已对外发布 | 是 |
指纹
探究 'HM-RAG: Long video reasoning and anomaly detection via hierarchical multi-agent retrieval-augmented generation' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver