A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes

Lian Zhang, Yuzhen Wu, Lingxue Wang*, Mingkun Chen, Yi Cai

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.

源语言英语
主期刊名Tenth Symposium on Novel Optoelectronic Detection Technology and Applications
编辑Chen Ping
出版商SPIE
ISBN(电子版)9781510688148
DOI
出版状态已出版 - 2025
活动10th Symposium on Novel Optoelectronic Detection Technology and Applications - Taiyuan, 中国
期限: 1 11月 20243 11月 2024

出版系列

姓名Proceedings of SPIE - The International Society for Optical Engineering
13511
ISSN(印刷版)0277-786X
ISSN(电子版)1996-756X

会议

会议10th Symposium on Novel Optoelectronic Detection Technology and Applications
国家/地区中国
Taiyuan
时期1/11/243/11/24

指纹

探究 'A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes' 的科研主题。它们共同构成独一无二的指纹。

引用此

Zhang, L., Wu, Y., Wang, L., Chen, M., & Cai, Y. (2025). A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes. 在 C. Ping (编辑), Tenth Symposium on Novel Optoelectronic Detection Technology and Applications 文章 135110L (Proceedings of SPIE - The International Society for Optical Engineering; 卷 13511). SPIE. https://doi.org/10.1117/12.3054715