Multi-Scale and spatial position-based channel attention network for crowd counting

doi:10.1016/j.jvcir.2022.103718

Journal of Visual Communication and Image Representation

Volume 90, February 2023, 103718

https://doi.org/10.1016/j.jvcir.2022.103718 Get rights and content

Highlights

•
Channel attention model captures important channels for each feature map position.
•
Multi-scale features combined from multiple layers handle different person sizes.
•
Adaptive loss improves the accuracy of crowd counting for sparse crowd samples.

Abstract

Crowd counting algorithms have recently incorporated attention mechanisms into convolutional neural networks (CNNs) to achieve significant progress. The channel attention model (CAM), as a popular attention mechanism, calculates a set of probability weights to select important channel-wise feature responses. However, most CAMs roughly assign a weight to the entire channel-wise map, which makes useful and useless information being treat indiscriminately, thereby limiting the representational capacity of networks. In this paper, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet), which integrates spatial position-based channel attention models (SPCAMs) with multiple scales into a CNN. SPCAM assigns different channel attention weights to different positions of channel-wise maps to capture more informative features. Furthermore, an adaptive loss, which uses adaptive coefficients to combine density map loss and headcount loss, is constructed to improve network performance in sparse crowd scenes. Experimental results on four public datasets verify the superiority of the scheme.

Introduction

Crowd analysis has recently attracted much attention due to its wide applications in public safety, video surveillance, and urban planning. Researchers have tried to analyze crowd scenes from many aspects, such as coherent motion detection [1], crowd behavior analysis [2], tracking [3], and crowd counting [15]. In this paper, we focus on the crowd counting task in crowd scenes.

Crowd counting aims at estimating the number of pedestrians in crowd scenes. Early research works mainly adopt detection-based [4], [5], [6] and regression-based [7], [8], [9] methods to estimate pedestrian numbers. The detection-based method deploys a sliding window to detect each person, and adds up all the detected persons to get the crowd count. This approach counts accurately in sparse crowd scenes, but performs poorly in high-density and severely occluded crowd scenes. In contrast, the regression-based method, which directly calculates the mapping function between image features and crowd counts, obtains better results in dense crowd scenes. Nevertheless, this type of method omits the crowd distribution information when regressing the count value, and cannot analyze crowd behaviors (e.g., riots and stampede accidents) in crowd scenarios. Thus, researchers propose to use density estimation-based methods [10], [11] to generate a density map for the crowd image, which is the mainstream algorithm in the crowd counting field. The density map can accurately present the spatial distribution information of the crowd, and the crowd count can be obtained by summing the whole density map. Although the initial density estimation-based methods have achieved a meaningful improvement, these approaches are based on handcrafted features and are not enough to cope with the problems of large-scale variation, non-uniform distribution and complex backgrounds that exist in real scenes.

Recently, encouraged by the powerful feature learning ability of convolutional neural networks (CNNs) [12], [13], [14], state-of-the-art works mostly adopt CNN architectures for crowd density map estimation [15], [16], [17], [18], [19], [20], [21], [22]. Very recently, many researchers incorporate the attention mechanism [23], [24], [25], [26], [27] into CNN-based crowd density estimation algorithms to get better performance. [23], [24], [25], [26] adopt the spatial attention model to guide the network to focus on the head region and ignore the background noise. These methods improve the spatial representation ability of the feature map, but they don’t model the interdependencies between feature channels. In CNN, the extracted features $f \in R^{C \times H \times W}$ contain $C$ channel-wise feature maps with the size of $H \times W$ , where different channel-wise feature maps have different contributions to the feature representation. However, there are a large number of useless and redundant features among the $C$ channel-wise feature maps, which will adversely affect the accurate representation of features. Some works [27], [28], [29] propose the channel attention model (CAM) to solve this problem. CAM calculates a set of one-dimensional channel attention weights, and assigns large weights to important channel-wise features and small weights to useless ones. However, such CAM roughly gives a weight value to the entire channel-wise feature map, resulting in useful and useless information being enhanced or weakened at the same time, e.g., the useless information in the high contributive channel-wise feature map will be enhanced.

To address this issue, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet) for the crowd counting task, which is characterized by integrating spatial position-based channel attention models (SPCAMs) with multiple scales. SPCAM generates a three-dimensional attention weight map to assign different channel attention weights to different positions of the channel-wise map, thereby maximally extracting useful features and suppressing useless features. Meanwhile, considering the large-scale variation of pedestrians, we use a single-column network to embed SPCAM modules into multiple layers to aggregate features with different receptive fields. Furthermore, we propose a new adaptive loss which introduces a headcount loss to the density map loss to improve the accuracy of the network in sparse crowd scenarios. To strike a balance between the two loss functions, the adaptive loss employs adaptive weights to automatically adjust the ratio between the two kinds of losses. Extensive experimental results show that our approach achieves superior performance on four public counting datasets including ShanghaiTech Part_A, UCF_QNRF, NWPU-Crowd and TRANCOS.

In short, our main contributions can be summarized as follows:

(1) We design a spatial position-based channel attention model (SPCAM) to assign different channel attention weights to different spatial positions of the channel-wise feature maps. Guided by the probability map generated by SPCAM, the network maximally emphasizes informative features and suppresses redundant features, thus improving the feature representation capability of the network.

(2) We propose the MS-SPCANet, which incorporates the SPCAM modules into multiple layers to gather multi-scale channel-wise attention features to capture high contribution information while being robust to changes in pedestrian scales.

(3) We train our network with a new adaptive loss that introduces a headcount loss to the density map loss and uses adaptive weights to combine the two losses. The proposed adaptive loss improves the accuracy of crowd counting for sparse crowd samples.

Section snippets

Related work

Crowd counting has made significant progress in recent years. In the following, we briefly introduce related works in the field of crowd counting.

Proposed approach

In this section, we first propose a novel spatial position-based channel attention model (SPCAM) to generate an attention mask that recalibrates the channel-wise feature response at each spatial position. Then we design the MS-SPCANet, which utilizes the SPCAM with a multi-scale structure to handle the scale variations of people while focusing on more powerful representations. In addition, we adopt a new adaptive loss to improve the estimation accuracy in sparse crowd scenes. Finally, we

Experiments

First, we introduce the evaluation metrics used in the experiments. Then, we compare the proposed MS-SPCANet with start-of-the-art algorithms on four public counting datasets, and conduct several ablation studies to demonstrate the effectiveness of the proposed components. Finally, we compare the explanatory power and estimated density maps of our network and CSRNet.

Conclusion

In this paper, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet) to well exploit the ability to capture worthwhile information and handle the scale changes of the crowd. MS-SPCANet utilizes the spatial position-based channel attention model (SPCAM) to assign different channel attention weights to different spatial positions of the channel-wise feature maps, thus maximally enhancing high contribution information and suppressing redundant information. By

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Natural Science Foundation of China [61675161, 62275211].

References (55)

Y. Zhang et al.
Multi-resolution attention convolutional neural network for crowd counting
Neurocomputing.
(2019)
J. Gao et al.
SCAR: Spatial-/channel-wise attention regression networks for crowd counting
Neurocomputing.
(2019)
W. Lin et al.
A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes
IEEE Trans. Image Processing.
(2016)
S. Yi et al.
Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance
IEEE Trans. Image Process.
(2016)
X. Liu, Multi-view 3D human tracking in crowded scenes, in: Proceedings of the AAAI Conference on Artificial...
P. Dollár et al.
Pedestrian detection: an evaluation of the state of the art
IEEE Trans. Pattern Anal. Mach. Intell.
(2012)
P. Viola, M. Jones, Robust real-time face detection, in: Proceedings of the IEEE international conference on computer...
N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE conference on...
A.B. Chan, N. Vasconcelos, Bayesian poisson regression for crowd counting, in: Proceedings of the IEEE international...
H. Idrees, I. Saleemi, C. Seibert, M. Shah, Multi-source multi-scale counting in extremely dense crowd images, in:...

K. Chen, C.C. Loy, S. Gong, T. Xiang, Feature mining for localized crowd counting, in: Proceedings of british machine...

V. Lempitsky, A. Zisserman, Learning to count objects in images, in: Proceedings of the international conference on...

L. Fiaschi, U. Koethe, R. Nair, F.A. Hamprecht, Learning to count with regression forest and structured labels, in:...

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...

T. Chen et al.

Disc: Deep image saliency computing via progressive representation learning

IEEE Trans. Neural Netw. Learn. Syst.

(2016)

H. Li et al.

Distortion-aware correlation tracking

IEEE Trans. Image Process.

(2017)

Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network,...

V.A. Sindagi, V.M. Patel, Generating high-quality crowd density maps using contextual pyramid cnns, in: Proceedings of...

V.A. Sindagi, V.M. Patel, Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd...

A. Zhang, J. Shen, Z. Xiao, F. Zhu, X. Zhen, X. Cao, L. Shao, Relational attention network for crowd counting, in:...

L. Zhu et al.

Crowd density estimation based on classification activation map and patch density level

Neural Comput. Appl.

(2020)

B. Wang et al.

Single-column CNN for crowd counting with pixel-wise attention Mechanism

Neural Comput. Appl.

(2020)

Y. Li, X. Zhang, D. Chen, Csrnet: dilated convolutional neural networks for understanding the highly congested scenes,...

N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, H. Wu, Adcrowdnet: an attention-injective deformable convolutional network for...

L. Zhu, Z. Zhao, C. Lu, Y. Lin, Y. Peng, T. Yao, Dual path multi-scale fusion networks with attention for crowd...

J. Chen et al.

Crowd counting with crowd attention convolutional neural network

Neurocomputing.

(2020)

Y. Hou, C. Li, F. Yang, C. Ma, L. Zhu, Y. Li, H. Jia, X. Xie, Bba-net: A bi-branch attention network for crowd...

Cited by (6)

People counting using IR-UWB radar sensors and machine learning techniques
2024, Systems and Soft Computing
This study aims to detect and count people using impulse radio ultra-wideband radar and machine learning algorithms. However, the data quality, difficulty distinguishing human signals from noise and clutter, and instances where human presence is not detected make it challenging to count multiple humans. To overcome these challenges, we apply wavelet transformation to reduce signal size and use simple moving averages to eliminate noise. Next, we create features based on statistical and entropic properties of the signal and apply several classification algorithms, including ANN, Random Forest, KNN, XGBOOST, and multiple linear regression, to predict the number of people present. Our findings reveal that using the ANN classifier with the Daubechies 4 (db4) wavelet provides better results than other classifiers, with an accuracy rate of 99%. Additionally, filtering the data improves accuracy, and labeling the data after extracting essential characteristics significantly improves the model’s accuracy.
Correlation-attention guided regression network for efficient crowd counting
2024, Journal of Visual Communication and Image Representation
As a valuable component of intelligent video surveillance, crowd counting has received lots of attention. In practice, however, crowd counting always suffers from the problem of the scale change of pedestrians. To mitigate this limitation, we propose a novel correlation-attention guided regression network to estimate the number of people, termed CGR-Net. To make the generation process of spatial attention and channel attention independent of each other, we design a parallel channel/spatial-wise attention module (PCSAM) to avoid error accumulation. A pixel-wise assisted attention module (PAAM) is developed for learning crowd uneven distribution on the different image pixels to further enhance the ability of the CGR-Net. Furthermore, we present a new loss function to ensure the effectiveness and performance of the proposed method. Comprehensive experimental results demonstrate that our model delivers enhanced representation and attains state-of-the-art performance.
Deep feature network with multi-scale fusion for highly congested crowd counting
2024, International Journal of Machine Learning and Cybernetics
People Counting Using Ir-Uwb Radar Sensors with Machine Learning Techniques
2023, SSRN
Non-Line-Of-Sight Recognition and Localization Optimization in Higher Dimensional Space with Mere Distance Measurements
2023, SSRN
Subway Platform Passenger Flow Counting Algorithm Based on Feature-Enhanced Pyramid and Mixed Attention
2023, Journal of Advanced Transportation

View full text

Full Length ArticleMulti-Scale and spatial position-based channel attention network for crowd counting

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed approach

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgements

Neurocomputing.

Neurocomputing.

A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes

IEEE Trans. Image Processing.

Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance

IEEE Trans. Image Process.

Pedestrian detection: an evaluation of the state of the art

IEEE Trans. Pattern Anal. Mach. Intell.

Disc: Deep image saliency computing via progressive representation learning

IEEE Trans. Neural Netw. Learn. Syst.

Distortion-aware correlation tracking

IEEE Trans. Image Process.

Crowd density estimation based on classification activation map and patch density level

Neural Comput. Appl.

Single-column CNN for crowd counting with pixel-wise attention Mechanism

Neural Comput. Appl.

Crowd counting with crowd attention convolutional neural network

Neurocomputing.

Full Length Article
Multi-Scale and spatial position-based channel attention network for crowd counting