TY - GEN
T1 - A Two-Stage Cognitive Framework for Referring Multi-Object Tracking
T2 - 10th Symposium on Novel Optoelectronic Detection Technology and Applications
AU - Zhang, Lian
AU - Wu, Yuzhen
AU - Wang, Lingxue
AU - Chen, Mingkun
AU - Cai, Yi
N1 - Publisher Copyright:
© 2025 SPIE.
PY - 2025
Y1 - 2025
N2 - In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.
AB - In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.
KW - Agent Attention
KW - Referring Multi-Object Tracking
KW - Two-stage Cognitive Architecture
UR - http://www.scopus.com/inward/record.url?scp=85219217667&partnerID=8YFLogxK
U2 - 10.1117/12.3054715
DO - 10.1117/12.3054715
M3 - Conference contribution
AN - SCOPUS:85219217667
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - Tenth Symposium on Novel Optoelectronic Detection Technology and Applications
A2 - Ping, Chen
PB - SPIE
Y2 - 1 November 2024 through 3 November 2024
ER -