Skip to main navigation Skip to search Skip to main content

HM-RAG: Long video reasoning and anomaly detection via hierarchical multi-agent retrieval-augmented generation

  • Jisheng Dang
  • , Quan Wan
  • , Dewei Liu
  • , Ziyue Wang
  • , Bimei Wang*
  • , Pei Liu
  • , Hong Peng
  • , Bin Hu
  • , Tat Seng Chua
  • *Corresponding author for this work
  • Lanzhou University
  • Hong Kong University of Science and Technology
  • Beijing Institute of Technology
  • National University of Singapore

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal video reasoning and anomaly detection remain key challenges for Large Language Models (LLMs) due to limited video-text alignment, narrow knowledge coverage, and difficulties in handling complex or weakly video-related queries. To address these limitations, we propose a Hierarchical Multi-Agent Retrieval-Augmented Generation (HM-RAG) framework that integrates internal temporal understanding with external knowledge retrieval. Specifically, our approach operates through a coordinated pipeline: it begins with a question decomposition agent that reformulates complex queries into structured sub-tasks, followed by multi-source reasoning agents, comprising a web agent for external retrieval and a memory-enhanced model for long-range temporal dependencies. Finally, decision agent synthesizes these multi-source insights to resolve contradictions and generate precise predictions. By hierarchically coordinating agents across retrieval and reasoning modalities, our framework achieves effective knowledge fusion. Extensive evaluations demonstrate that HM-RAG significantly improves performance not only on standard multimodal video reasoning benchmarks but also effectively identifies irregular events, validating its robustness in video anomaly detection tasks. Code is available at https://github.com/hanzif1/HM-RAG.

Original languageEnglish
Article number113622
JournalPattern Recognition
Volume179
DOIs
Publication statusPublished - Nov 2026
Externally publishedYes

Keywords

  • Knowledge fusion
  • Multi-agent system
  • Multimodal video reasoning
  • Retrieval-augmented generation
  • Video anomaly detection
  • Video-language understanding

Fingerprint

Dive into the research topics of 'HM-RAG: Long video reasoning and anomaly detection via hierarchical multi-agent retrieval-augmented generation'. Together they form a unique fingerprint.

Cite this