Abstract
CNN-based crowd counting methods have achieved great progress in recent years. However, most of these CNN-based crowd counting methods do not make full use of contextual information, which contains high-level semantic features and low-level detail features from different receptive fields of CNN. But rich contextual information is important to solve the scale variation problem of crowd counting. So the precision of previous CNN-based crowd counting methods is decreased. To solve this problem, we propose an adaptive attention fusion mechanism (AAFM). AAFM can use multi-scale features from different receptive fields of CNN effectively. It integrates the convolution network for feature learning and the attention mechanism for multi-scale features fusion. We apply the first 13 convolution layers of VGG-16 as the encoder module to extract the base features. Then, the base features are fed into the decoder module. The decoder module mainly contains Density Regression Branch (DRB) and Feature Fusion Branch (FFB). DRB uses multiple convolution layers for feature learning and multi-scale feature extraction. FFB uses attention modules for modeling multi-scale features and element-wise multiply for features fusion. Therefore, AAFM can obtain rich contextual information into the encoder-decoder framework for generating high-quality crowd density maps and accurate counting. We perform experiments on ShanghaiTech, UCF-CC-50, and UCF-QNRF datasets, and AAFM achieves promising results.
Original language | English |
---|---|
Article number | 9151937 |
Pages (from-to) | 138297-138306 |
Number of pages | 10 |
Journal | IEEE Access |
Volume | 8 |
DOIs | |
Publication status | Published - 2020 |
Keywords
- Crowd counting
- adaptive attention fusion mechanism
- density estimation