TY - JOUR
T1 - SpikingViT
T2 - A Multiscale Spiking Vision Transformer Model for Event-Based Object Detection
AU - Yu, Lixing
AU - Chen, Hanqi
AU - Wang, Ziming
AU - Zhan, Shaojie
AU - Shao, Jiankun
AU - Liu, Qingjie
AU - Xu, Shu
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2025
Y1 - 2025
N2 - Event cameras have unique advantages in object detection, capturing asynchronous events without continuous frames. They excel in dynamic range, low latency, and high-speed motion scenarios, with lower power consumption. However, aggregating event data into image frames leads to information loss and reduced detection performance. Applying traditional neural networks to event camera outputs is challenging due to event data's distinct characteristics. In this study, we present a novel spiking neural networks (SNNs)-based object detection model, the spiking vision transformer (SpikingViT) to address these issues. First, we design a dedicated event data converting module that effectively captures the unique characteristics of event data, mitigating the risk of information loss while preserving its spatiotemporal features. Second, we introduce SpikingViT, a novel object detection model that leverages SNNs capable of extracting spatiotemporal information among events data. SpikingViT combines the advantages of SNNs and transformer models, incorporating mechanisms such as attention and residual voltage memory to further enhance detection performance. Extensive experiments have substantiated the remarkable proficiency of SpikingViT in event-based object detection, positioning it as a formidable contender. Our proposed approach adeptly retains spatiotemporal information inherent in event data, leading to a substantial enhancement in detection performance.
AB - Event cameras have unique advantages in object detection, capturing asynchronous events without continuous frames. They excel in dynamic range, low latency, and high-speed motion scenarios, with lower power consumption. However, aggregating event data into image frames leads to information loss and reduced detection performance. Applying traditional neural networks to event camera outputs is challenging due to event data's distinct characteristics. In this study, we present a novel spiking neural networks (SNNs)-based object detection model, the spiking vision transformer (SpikingViT) to address these issues. First, we design a dedicated event data converting module that effectively captures the unique characteristics of event data, mitigating the risk of information loss while preserving its spatiotemporal features. Second, we introduce SpikingViT, a novel object detection model that leverages SNNs capable of extracting spatiotemporal information among events data. SpikingViT combines the advantages of SNNs and transformer models, incorporating mechanisms such as attention and residual voltage memory to further enhance detection performance. Extensive experiments have substantiated the remarkable proficiency of SpikingViT in event-based object detection, positioning it as a formidable contender. Our proposed approach adeptly retains spatiotemporal information inherent in event data, leading to a substantial enhancement in detection performance.
KW - DVS data converting
KW - object detection
KW - residual voltage memory
KW - spiking transformer
UR - https://www.scopus.com/pages/publications/85197566338
U2 - 10.1109/TCDS.2024.3422873
DO - 10.1109/TCDS.2024.3422873
M3 - Article
AN - SCOPUS:85197566338
SN - 2379-8920
VL - 17
SP - 130
EP - 146
JO - IEEE Transactions on Cognitive and Developmental Systems
JF - IEEE Transactions on Cognitive and Developmental Systems
IS - 1
ER -