HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang; Zongzhen Wu; Bo Jiang; Zhimin Bao; Lin Zhu; Guoqi Li; Yaowei Wang; Yonghong Tian

doi:10.1609/aaai.v38i6.28372

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang, Zongzhen Wu, Bo Jiang^*, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, Yonghong Tian

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Conference article › peer-review

8 Citations (Scopus)

Abstract

The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/EventAHU/HARDVS.

Original language	English
Pages (from-to)	5615-5623
Number of pages	9
Journal	Proceedings of the AAAI Conference on Artificial Intelligence
Volume	38
Issue number	6
DOIs	https://doi.org/10.1609/aaai.v38i6.28372
Publication status	Published - 25 Mar 2024
Event	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada Duration: 20 Feb 2024 → 27 Feb 2024

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1609/aaai.v38i6.28372

Cite this

Wang, X., Wu, Z., Jiang, B., Bao, Z., Zhu, L., Li, G., Wang, Y., & Tian, Y. (2024). HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5615-5623. https://doi.org/10.1609/aaai.v38i6.28372

@article{ca68088fdbc341e396ea03693a5fedb6,

title = "HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors",

abstract = "The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/EventAHU/HARDVS.",

author = "Xiao Wang and Zongzhen Wu and Bo Jiang and Zhimin Bao and Lin Zhu and Guoqi Li and Yaowei Wang and Yonghong Tian",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i6.28372",

language = "English",

volume = "38",

pages = "5615--5623",

journal = "Proceedings of the AAAI Conference on Artificial Intelligence",

issn = "2159-5399",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "6",

}

TY - JOUR

T1 - HARDVS

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

AU - Wang, Xiao

AU - Wu, Zongzhen

AU - Jiang, Bo

AU - Bao, Zhimin

AU - Zhu, Lin

AU - Li, Guoqi

AU - Wang, Yaowei

AU - Tian, Yonghong

PY - 2024/3/25

Y1 - 2024/3/25

N2 - The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/EventAHU/HARDVS.

AB - The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/EventAHU/HARDVS.

UR - http://www.scopus.com/inward/record.url?scp=85180738535&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i6.28372

DO - 10.1609/aaai.v38i6.28372

M3 - Conference article

AN - SCOPUS:85180738535

SN - 2159-5399

VL - 38

SP - 5615

EP - 5623

JO - Proceedings of the AAAI Conference on Artificial Intelligence

JF - Proceedings of the AAAI Conference on Artificial Intelligence

IS - 6

Y2 - 20 February 2024 through 27 February 2024

ER -

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Abstract

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this