TY - JOUR
T1 - Event Stream based Human Action Recognition
T2 - A High-Definition Benchmark Dataset and Algorithms
AU - Wang, Xiao
AU - Wang, Shiao
AU - Shao, Pengpeng
AU - Zhu, Lin
AU - Jiang, Bo
AU - Tian, Yonghong
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2026.
PY - 2026/4
Y1 - 2026/4
N2 - Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution (346×260). In this paper, we propose a large-scale, high-definition (1280×800) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and a novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code have been released on https://github.com/Event-AHU/CeleX-HAR.
AB - Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution (346×260). In this paper, we propose a large-scale, high-definition (1280×800) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and a novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code have been released on https://github.com/Event-AHU/CeleX-HAR.
KW - Event Camera
KW - Human Action Recognition
KW - Mamba Network
KW - Spatio-temporal Feature Learning
UR - https://www.scopus.com/pages/publications/105033693123
U2 - 10.1007/s11263-026-02769-4
DO - 10.1007/s11263-026-02769-4
M3 - Article
AN - SCOPUS:105033693123
SN - 0920-5691
VL - 134
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 4
M1 - 181
ER -