Abstract
Multimodal video reasoning and anomaly detection remain key challenges for Large Language Models (LLMs) due to limited video-text alignment, narrow knowledge coverage, and difficulties in handling complex or weakly video-related queries. To address these limitations, we propose a Hierarchical Multi-Agent Retrieval-Augmented Generation (HM-RAG) framework that integrates internal temporal understanding with external knowledge retrieval. Specifically, our approach operates through a coordinated pipeline: it begins with a question decomposition agent that reformulates complex queries into structured sub-tasks, followed by multi-source reasoning agents, comprising a web agent for external retrieval and a memory-enhanced model for long-range temporal dependencies. Finally, decision agent synthesizes these multi-source insights to resolve contradictions and generate precise predictions. By hierarchically coordinating agents across retrieval and reasoning modalities, our framework achieves effective knowledge fusion. Extensive evaluations demonstrate that HM-RAG significantly improves performance not only on standard multimodal video reasoning benchmarks but also effectively identifies irregular events, validating its robustness in video anomaly detection tasks. Code is available at https://github.com/hanzif1/HM-RAG.
| Original language | English |
|---|---|
| Article number | 113622 |
| Journal | Pattern Recognition |
| Volume | 179 |
| DOIs | |
| Publication status | Published - Nov 2026 |
| Externally published | Yes |
Keywords
- Knowledge fusion
- Multi-agent system
- Multimodal video reasoning
- Retrieval-augmented generation
- Video anomaly detection
- Video-language understanding
Fingerprint
Dive into the research topics of 'HM-RAG: Long video reasoning and anomaly detection via hierarchical multi-agent retrieval-augmented generation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver