TY - GEN
T1 - Event-based Few-shot Fine-grained Human Action Recognition
AU - Yang, Zonglin
AU - Yang, Yan
AU - Shi, Yuheng
AU - Yang, Hao
AU - Zhang, Ruikun
AU - Liu, Liu
AU - Wu, Xinxiao
AU - Pan, Liyuan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Few-shot fine-grained human (FGH) action recognition is crucial in the context of human-robot interaction within open-set real-world environments. Existing works mainly focus on features extracted from RGB frames. However, their performances are drastically impacted in challenging scenarios, such as high-dynamic or low lighting conditions. Event cameras can independently and sparsely capture brightness changes in a scene at microsecond resolution and high dynamic range, which offer a promising solution. However, the modality differences between events and RGB frames, and the lack of paired fine-grained data hinder the development of event-based FGH action recognition. Therefore, in this paper, we introduce the first Event Camera Fine-grained Human Action (E-FAction) dataset. This dataset comprises 3304 paired 'event stream and RGB sequence', covering 15 coarse action classes and 128 fine-grained actions. Then, we develop a versatile event feature extractor. Considering the spatial sparsity of event stream, we design two modules to mine the temporal motion and semantic features under the guidance of paired RGB frames, facilitating robust weight initialization for the feature extractor in few-shot FGH action recognition. We conduct extensive experiments on both published and our built synthetic and real datasets, and consistently achieve state-of-the-art performance compared to existing baselines. Code and dataset will be available at link.
AB - Few-shot fine-grained human (FGH) action recognition is crucial in the context of human-robot interaction within open-set real-world environments. Existing works mainly focus on features extracted from RGB frames. However, their performances are drastically impacted in challenging scenarios, such as high-dynamic or low lighting conditions. Event cameras can independently and sparsely capture brightness changes in a scene at microsecond resolution and high dynamic range, which offer a promising solution. However, the modality differences between events and RGB frames, and the lack of paired fine-grained data hinder the development of event-based FGH action recognition. Therefore, in this paper, we introduce the first Event Camera Fine-grained Human Action (E-FAction) dataset. This dataset comprises 3304 paired 'event stream and RGB sequence', covering 15 coarse action classes and 128 fine-grained actions. Then, we develop a versatile event feature extractor. Considering the spatial sparsity of event stream, we design two modules to mine the temporal motion and semantic features under the guidance of paired RGB frames, facilitating robust weight initialization for the feature extractor in few-shot FGH action recognition. We conduct extensive experiments on both published and our built synthetic and real datasets, and consistently achieve state-of-the-art performance compared to existing baselines. Code and dataset will be available at link.
UR - http://www.scopus.com/inward/record.url?scp=85209448852&partnerID=8YFLogxK
U2 - 10.1109/IROS58592.2024.10801824
DO - 10.1109/IROS58592.2024.10801824
M3 - Conference contribution
AN - SCOPUS:85209448852
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 519
EP - 526
BT - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
Y2 - 14 October 2024 through 18 October 2024
ER -