Abstract
Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.
Original language | English |
---|---|
Article number | 8832232 |
Pages (from-to) | 1556-1560 |
Number of pages | 5 |
Journal | IEEE Signal Processing Letters |
Volume | 26 |
Issue number | 10 |
DOIs | |
Publication status | Published - Oct 2019 |
Keywords
- Action recognition
- deep learning
- long-term feature layer
- spatial-temporal network