TY - JOUR
T1 - SSTFormer
T2 - Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
AU - Wang, Xiao
AU - Rong, Yao
AU - Wu, Zongzhen
AU - Zhu, Lin
AU - Jiang, Bo
AU - Tang, Jin
AU - Tian, Yonghong
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2025
Y1 - 2025
N2 - Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, few of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarcity of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework.
AB - Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, few of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarcity of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework.
KW - Bio-inspired Computing
KW - Bottleneck Mechanism
KW - Spiking Neural Networks
KW - Transformer Networks
KW - Video Classification
UR - http://www.scopus.com/inward/record.url?scp=105005320513&partnerID=8YFLogxK
U2 - 10.1109/TCDS.2025.3568833
DO - 10.1109/TCDS.2025.3568833
M3 - Article
AN - SCOPUS:105005320513
SN - 2379-8920
JO - IEEE Transactions on Cognitive and Developmental Systems
JF - IEEE Transactions on Cognitive and Developmental Systems
ER -