TY - GEN
T1 - Event Stream-Based Visual Object Tracking
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Wang, Xiao
AU - Wang, Shiao
AU - Tang, Chuanming
AU - Zhu, Lin
AU - Jiang, Bo
AU - Tian, Yonghong
AU - Tang, Jin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Tracking with bio-inspired event cameras has garnered increasing interest in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The former incurs higher inference costs while the latter may be susceptible to the impact of noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multimodal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multimodal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. In particular, since existing event-based tracking datasets are all low-resolution (346 × 260), we propose the first large-scale high-resolution (1280 × 720) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pong, etc. Ex-tensive experiments on both low-resolution (FE240hz, Vi-sEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method.
AB - Tracking with bio-inspired event cameras has garnered increasing interest in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The former incurs higher inference costs while the latter may be susceptible to the impact of noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multimodal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multimodal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. In particular, since existing event-based tracking datasets are all low-resolution (346 × 260), we propose the first large-scale high-resolution (1280 × 720) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pong, etc. Ex-tensive experiments on both low-resolution (FE240hz, Vi-sEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method.
KW - Event-based Tracking; Visual Tracking; Benchmark Dataset; EventVOT dataset
UR - http://www.scopus.com/inward/record.url?scp=85191648892&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.01821
DO - 10.1109/CVPR52733.2024.01821
M3 - Conference contribution
AN - SCOPUS:85191648892
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 19248
EP - 19257
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -