TY - JOUR
T1 - Revisiting color-event based tracking
T2 - A unified network, dataset, and metric
AU - Tang, Chuanming
AU - Wang, Xiao
AU - Huang, Ju
AU - Jiang, Bo
AU - Zhu, Lin
AU - Chen, Shifeng
AU - Zhang, Jianlin
AU - Wang, Yaowei
AU - Tian, Yonghong
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/4
Y1 - 2026/4
N2 - Combining Color and Event cameras (also called Dynamic Vision Sensors, DVS) for robust object tracking is a newly emerging research topic in recent years. Existing color-event tracking frameworks usually contain multiple scattered modules which may lead to low efficiency and high computational complexity, including feature extraction, fusion, matching, interactive learning, etc. In this paper, we propose a single-stage backbone network for Color-Event Unified Tracking (CEUTrack) that achieves the above functions simultaneously. Given the event points and color frames, we first transform the points into voxels and crop the template and search regions for both modalities, respectively. Then, these regions are projected into tokens and jointly fed into the adaptive vision Transformer network. The output features will be fed into a tracking head for target object localization. Our proposed CEUTrack is simple, effective, and efficient, achieving over 75 FPS and SOTA performance. To better validate the effectiveness of our model and address the data deficiency of the color-event tracking task, we propose a generic and large-scale benchmark dataset for color-event tracking, termed COESOT, which contains 90 categories and 1354 video sequences. Furthermore, a new evaluation criterion has been proposed, aiming to better assess tracking results by measuring the difficulty level of video frames. We hope the newly proposed method and dataset provide a better platform for color-event-based tracking. The dataset, toolkit, and source code have been released on https://github.com/Event-AHU/COESOT.
AB - Combining Color and Event cameras (also called Dynamic Vision Sensors, DVS) for robust object tracking is a newly emerging research topic in recent years. Existing color-event tracking frameworks usually contain multiple scattered modules which may lead to low efficiency and high computational complexity, including feature extraction, fusion, matching, interactive learning, etc. In this paper, we propose a single-stage backbone network for Color-Event Unified Tracking (CEUTrack) that achieves the above functions simultaneously. Given the event points and color frames, we first transform the points into voxels and crop the template and search regions for both modalities, respectively. Then, these regions are projected into tokens and jointly fed into the adaptive vision Transformer network. The output features will be fed into a tracking head for target object localization. Our proposed CEUTrack is simple, effective, and efficient, achieving over 75 FPS and SOTA performance. To better validate the effectiveness of our model and address the data deficiency of the color-event tracking task, we propose a generic and large-scale benchmark dataset for color-event tracking, termed COESOT, which contains 90 categories and 1354 video sequences. Furthermore, a new evaluation criterion has been proposed, aiming to better assess tracking results by measuring the difficulty level of video frames. We hope the newly proposed method and dataset provide a better platform for color-event-based tracking. The dataset, toolkit, and source code have been released on https://github.com/Event-AHU/COESOT.
KW - Color-event tracking
KW - Dataset and unified network
KW - Visual tracking
UR - https://www.scopus.com/pages/publications/105021571897
U2 - 10.1016/j.patcog.2025.112718
DO - 10.1016/j.patcog.2025.112718
M3 - Article
AN - SCOPUS:105021571897
SN - 0031-3203
VL - 172
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 112718
ER -