Full Length ArticleMulti-Scale and spatial position-based channel attention network for crowd counting
Introduction
Crowd analysis has recently attracted much attention due to its wide applications in public safety, video surveillance, and urban planning. Researchers have tried to analyze crowd scenes from many aspects, such as coherent motion detection [1], crowd behavior analysis [2], tracking [3], and crowd counting [15]. In this paper, we focus on the crowd counting task in crowd scenes.
Crowd counting aims at estimating the number of pedestrians in crowd scenes. Early research works mainly adopt detection-based [4], [5], [6] and regression-based [7], [8], [9] methods to estimate pedestrian numbers. The detection-based method deploys a sliding window to detect each person, and adds up all the detected persons to get the crowd count. This approach counts accurately in sparse crowd scenes, but performs poorly in high-density and severely occluded crowd scenes. In contrast, the regression-based method, which directly calculates the mapping function between image features and crowd counts, obtains better results in dense crowd scenes. Nevertheless, this type of method omits the crowd distribution information when regressing the count value, and cannot analyze crowd behaviors (e.g., riots and stampede accidents) in crowd scenarios. Thus, researchers propose to use density estimation-based methods [10], [11] to generate a density map for the crowd image, which is the mainstream algorithm in the crowd counting field. The density map can accurately present the spatial distribution information of the crowd, and the crowd count can be obtained by summing the whole density map. Although the initial density estimation-based methods have achieved a meaningful improvement, these approaches are based on handcrafted features and are not enough to cope with the problems of large-scale variation, non-uniform distribution and complex backgrounds that exist in real scenes.
Recently, encouraged by the powerful feature learning ability of convolutional neural networks (CNNs) [12], [13], [14], state-of-the-art works mostly adopt CNN architectures for crowd density map estimation [15], [16], [17], [18], [19], [20], [21], [22]. Very recently, many researchers incorporate the attention mechanism [23], [24], [25], [26], [27] into CNN-based crowd density estimation algorithms to get better performance. [23], [24], [25], [26] adopt the spatial attention model to guide the network to focus on the head region and ignore the background noise. These methods improve the spatial representation ability of the feature map, but they don’t model the interdependencies between feature channels. In CNN, the extracted features contain channel-wise feature maps with the size of , where different channel-wise feature maps have different contributions to the feature representation. However, there are a large number of useless and redundant features among the channel-wise feature maps, which will adversely affect the accurate representation of features. Some works [27], [28], [29] propose the channel attention model (CAM) to solve this problem. CAM calculates a set of one-dimensional channel attention weights, and assigns large weights to important channel-wise features and small weights to useless ones. However, such CAM roughly gives a weight value to the entire channel-wise feature map, resulting in useful and useless information being enhanced or weakened at the same time, e.g., the useless information in the high contributive channel-wise feature map will be enhanced.
To address this issue, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet) for the crowd counting task, which is characterized by integrating spatial position-based channel attention models (SPCAMs) with multiple scales. SPCAM generates a three-dimensional attention weight map to assign different channel attention weights to different positions of the channel-wise map, thereby maximally extracting useful features and suppressing useless features. Meanwhile, considering the large-scale variation of pedestrians, we use a single-column network to embed SPCAM modules into multiple layers to aggregate features with different receptive fields. Furthermore, we propose a new adaptive loss which introduces a headcount loss to the density map loss to improve the accuracy of the network in sparse crowd scenarios. To strike a balance between the two loss functions, the adaptive loss employs adaptive weights to automatically adjust the ratio between the two kinds of losses. Extensive experimental results show that our approach achieves superior performance on four public counting datasets including ShanghaiTech Part_A, UCF_QNRF, NWPU-Crowd and TRANCOS.
In short, our main contributions can be summarized as follows:
(1) We design a spatial position-based channel attention model (SPCAM) to assign different channel attention weights to different spatial positions of the channel-wise feature maps. Guided by the probability map generated by SPCAM, the network maximally emphasizes informative features and suppresses redundant features, thus improving the feature representation capability of the network.
(2) We propose the MS-SPCANet, which incorporates the SPCAM modules into multiple layers to gather multi-scale channel-wise attention features to capture high contribution information while being robust to changes in pedestrian scales.
(3) We train our network with a new adaptive loss that introduces a headcount loss to the density map loss and uses adaptive weights to combine the two losses. The proposed adaptive loss improves the accuracy of crowd counting for sparse crowd samples.
Section snippets
Related work
Crowd counting has made significant progress in recent years. In the following, we briefly introduce related works in the field of crowd counting.
Proposed approach
In this section, we first propose a novel spatial position-based channel attention model (SPCAM) to generate an attention mask that recalibrates the channel-wise feature response at each spatial position. Then we design the MS-SPCANet, which utilizes the SPCAM with a multi-scale structure to handle the scale variations of people while focusing on more powerful representations. In addition, we adopt a new adaptive loss to improve the estimation accuracy in sparse crowd scenes. Finally, we
Experiments
First, we introduce the evaluation metrics used in the experiments. Then, we compare the proposed MS-SPCANet with start-of-the-art algorithms on four public counting datasets, and conduct several ablation studies to demonstrate the effectiveness of the proposed components. Finally, we compare the explanatory power and estimated density maps of our network and CSRNet.
Conclusion
In this paper, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet) to well exploit the ability to capture worthwhile information and handle the scale changes of the crowd. MS-SPCANet utilizes the spatial position-based channel attention model (SPCAM) to assign different channel attention weights to different spatial positions of the channel-wise feature maps, thus maximally enhancing high contribution information and suppressing redundant information. By
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Natural Science Foundation of China [61675161, 62275211].
References (55)
- et al.
Multi-resolution attention convolutional neural network for crowd counting
Neurocomputing.
(2019) - et al.
SCAR: Spatial-/channel-wise attention regression networks for crowd counting
Neurocomputing.
(2019) - et al.
A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes
IEEE Trans. Image Processing.
(2016) - et al.
Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance
IEEE Trans. Image Process.
(2016) - X. Liu, Multi-view 3D human tracking in crowded scenes, in: Proceedings of the AAAI Conference on Artificial...
- et al.
Pedestrian detection: an evaluation of the state of the art
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - P. Viola, M. Jones, Robust real-time face detection, in: Proceedings of the IEEE international conference on computer...
- N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE conference on...
- A.B. Chan, N. Vasconcelos, Bayesian poisson regression for crowd counting, in: Proceedings of the IEEE international...
- H. Idrees, I. Saleemi, C. Seibert, M. Shah, Multi-source multi-scale counting in extremely dense crowd images, in:...
Disc: Deep image saliency computing via progressive representation learning
IEEE Trans. Neural Netw. Learn. Syst.
Distortion-aware correlation tracking
IEEE Trans. Image Process.
Crowd density estimation based on classification activation map and patch density level
Neural Comput. Appl.
Single-column CNN for crowd counting with pixel-wise attention Mechanism
Neural Comput. Appl.
Crowd counting with crowd attention convolutional neural network
Neurocomputing.
Cited by (6)
People counting using IR-UWB radar sensors and machine learning techniques
2024, Systems and Soft ComputingCorrelation-attention guided regression network for efficient crowd counting
2024, Journal of Visual Communication and Image RepresentationDeep feature network with multi-scale fusion for highly congested crowd counting
2024, International Journal of Machine Learning and CyberneticsSubway Platform Passenger Flow Counting Algorithm Based on Feature-Enhanced Pyramid and Mixed Attention
2023, Journal of Advanced Transportation