Full Length Article
Multi-Scale and spatial position-based channel attention network for crowd counting

https://doi.org/10.1016/j.jvcir.2022.103718Get rights and content

Highlights

  • Channel attention model captures important channels for each feature map position.

  • Multi-scale features combined from multiple layers handle different person sizes.

  • Adaptive loss improves the accuracy of crowd counting for sparse crowd samples.

Abstract

Crowd counting algorithms have recently incorporated attention mechanisms into convolutional neural networks (CNNs) to achieve significant progress. The channel attention model (CAM), as a popular attention mechanism, calculates a set of probability weights to select important channel-wise feature responses. However, most CAMs roughly assign a weight to the entire channel-wise map, which makes useful and useless information being treat indiscriminately, thereby limiting the representational capacity of networks. In this paper, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet), which integrates spatial position-based channel attention models (SPCAMs) with multiple scales into a CNN. SPCAM assigns different channel attention weights to different positions of channel-wise maps to capture more informative features. Furthermore, an adaptive loss, which uses adaptive coefficients to combine density map loss and headcount loss, is constructed to improve network performance in sparse crowd scenes. Experimental results on four public datasets verify the superiority of the scheme.

Introduction

Crowd analysis has recently attracted much attention due to its wide applications in public safety, video surveillance, and urban planning. Researchers have tried to analyze crowd scenes from many aspects, such as coherent motion detection [1], crowd behavior analysis [2], tracking [3], and crowd counting [15]. In this paper, we focus on the crowd counting task in crowd scenes.

Crowd counting aims at estimating the number of pedestrians in crowd scenes. Early research works mainly adopt detection-based [4], [5], [6] and regression-based [7], [8], [9] methods to estimate pedestrian numbers. The detection-based method deploys a sliding window to detect each person, and adds up all the detected persons to get the crowd count. This approach counts accurately in sparse crowd scenes, but performs poorly in high-density and severely occluded crowd scenes. In contrast, the regression-based method, which directly calculates the mapping function between image features and crowd counts, obtains better results in dense crowd scenes. Nevertheless, this type of method omits the crowd distribution information when regressing the count value, and cannot analyze crowd behaviors (e.g., riots and stampede accidents) in crowd scenarios. Thus, researchers propose to use density estimation-based methods [10], [11] to generate a density map for the crowd image, which is the mainstream algorithm in the crowd counting field. The density map can accurately present the spatial distribution information of the crowd, and the crowd count can be obtained by summing the whole density map. Although the initial density estimation-based methods have achieved a meaningful improvement, these approaches are based on handcrafted features and are not enough to cope with the problems of large-scale variation, non-uniform distribution and complex backgrounds that exist in real scenes.

Recently, encouraged by the powerful feature learning ability of convolutional neural networks (CNNs) [12], [13], [14], state-of-the-art works mostly adopt CNN architectures for crowd density map estimation [15], [16], [17], [18], [19], [20], [21], [22]. Very recently, many researchers incorporate the attention mechanism [23], [24], [25], [26], [27] into CNN-based crowd density estimation algorithms to get better performance. [23], [24], [25], [26] adopt the spatial attention model to guide the network to focus on the head region and ignore the background noise. These methods improve the spatial representation ability of the feature map, but they don’t model the interdependencies between feature channels. In CNN, the extracted features fRC×H×W contain C channel-wise feature maps with the size of H×W, where different channel-wise feature maps have different contributions to the feature representation. However, there are a large number of useless and redundant features among the C channel-wise feature maps, which will adversely affect the accurate representation of features. Some works [27], [28], [29] propose the channel attention model (CAM) to solve this problem. CAM calculates a set of one-dimensional channel attention weights, and assigns large weights to important channel-wise features and small weights to useless ones. However, such CAM roughly gives a weight value to the entire channel-wise feature map, resulting in useful and useless information being enhanced or weakened at the same time, e.g., the useless information in the high contributive channel-wise feature map will be enhanced.

To address this issue, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet) for the crowd counting task, which is characterized by integrating spatial position-based channel attention models (SPCAMs) with multiple scales. SPCAM generates a three-dimensional attention weight map to assign different channel attention weights to different positions of the channel-wise map, thereby maximally extracting useful features and suppressing useless features. Meanwhile, considering the large-scale variation of pedestrians, we use a single-column network to embed SPCAM modules into multiple layers to aggregate features with different receptive fields. Furthermore, we propose a new adaptive loss which introduces a headcount loss to the density map loss to improve the accuracy of the network in sparse crowd scenarios. To strike a balance between the two loss functions, the adaptive loss employs adaptive weights to automatically adjust the ratio between the two kinds of losses. Extensive experimental results show that our approach achieves superior performance on four public counting datasets including ShanghaiTech Part_A, UCF_QNRF, NWPU-Crowd and TRANCOS.

In short, our main contributions can be summarized as follows:

(1) We design a spatial position-based channel attention model (SPCAM) to assign different channel attention weights to different spatial positions of the channel-wise feature maps. Guided by the probability map generated by SPCAM, the network maximally emphasizes informative features and suppresses redundant features, thus improving the feature representation capability of the network.

(2) We propose the MS-SPCANet, which incorporates the SPCAM modules into multiple layers to gather multi-scale channel-wise attention features to capture high contribution information while being robust to changes in pedestrian scales.

(3) We train our network with a new adaptive loss that introduces a headcount loss to the density map loss and uses adaptive weights to combine the two losses. The proposed adaptive loss improves the accuracy of crowd counting for sparse crowd samples.

Section snippets

Related work

Crowd counting has made significant progress in recent years. In the following, we briefly introduce related works in the field of crowd counting.

Proposed approach

In this section, we first propose a novel spatial position-based channel attention model (SPCAM) to generate an attention mask that recalibrates the channel-wise feature response at each spatial position. Then we design the MS-SPCANet, which utilizes the SPCAM with a multi-scale structure to handle the scale variations of people while focusing on more powerful representations. In addition, we adopt a new adaptive loss to improve the estimation accuracy in sparse crowd scenes. Finally, we

Experiments

First, we introduce the evaluation metrics used in the experiments. Then, we compare the proposed MS-SPCANet with start-of-the-art algorithms on four public counting datasets, and conduct several ablation studies to demonstrate the effectiveness of the proposed components. Finally, we compare the explanatory power and estimated density maps of our network and CSRNet.

Conclusion

In this paper, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet) to well exploit the ability to capture worthwhile information and handle the scale changes of the crowd. MS-SPCANet utilizes the spatial position-based channel attention model (SPCAM) to assign different channel attention weights to different spatial positions of the channel-wise feature maps, thus maximally enhancing high contribution information and suppressing redundant information. By

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Natural Science Foundation of China [61675161, 62275211].

References (55)

  • Y. Zhang et al.

    Multi-resolution attention convolutional neural network for crowd counting

    Neurocomputing.

    (2019)
  • J. Gao et al.

    SCAR: Spatial-/channel-wise attention regression networks for crowd counting

    Neurocomputing.

    (2019)
  • W. Lin et al.

    A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes

    IEEE Trans. Image Processing.

    (2016)
  • S. Yi et al.

    Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance

    IEEE Trans. Image Process.

    (2016)
  • X. Liu, Multi-view 3D human tracking in crowded scenes, in: Proceedings of the AAAI Conference on Artificial...
  • P. Dollár et al.

    Pedestrian detection: an evaluation of the state of the art

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • P. Viola, M. Jones, Robust real-time face detection, in: Proceedings of the IEEE international conference on computer...
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE conference on...
  • A.B. Chan, N. Vasconcelos, Bayesian poisson regression for crowd counting, in: Proceedings of the IEEE international...
  • H. Idrees, I. Saleemi, C. Seibert, M. Shah, Multi-source multi-scale counting in extremely dense crowd images, in:...
  • K. Chen, C.C. Loy, S. Gong, T. Xiang, Feature mining for localized crowd counting, in: Proceedings of british machine...
  • V. Lempitsky, A. Zisserman, Learning to count objects in images, in: Proceedings of the international conference on...
  • L. Fiaschi, U. Koethe, R. Nair, F.A. Hamprecht, Learning to count with regression forest and structured labels, in:...
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
  • T. Chen et al.

    Disc: Deep image saliency computing via progressive representation learning

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • H. Li et al.

    Distortion-aware correlation tracking

    IEEE Trans. Image Process.

    (2017)
  • Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network,...
  • V.A. Sindagi, V.M. Patel, Generating high-quality crowd density maps using contextual pyramid cnns, in: Proceedings of...
  • V.A. Sindagi, V.M. Patel, Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd...
  • A. Zhang, J. Shen, Z. Xiao, F. Zhu, X. Zhen, X. Cao, L. Shao, Relational attention network for crowd counting, in:...
  • L. Zhu et al.

    Crowd density estimation based on classification activation map and patch density level

    Neural Comput. Appl.

    (2020)
  • B. Wang et al.

    Single-column CNN for crowd counting with pixel-wise attention Mechanism

    Neural Comput. Appl.

    (2020)
  • Y. Li, X. Zhang, D. Chen, Csrnet: dilated convolutional neural networks for understanding the highly congested scenes,...
  • N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, H. Wu, Adcrowdnet: an attention-injective deformable convolutional network for...
  • L. Zhu, Z. Zhao, C. Lu, Y. Lin, Y. Peng, T. Yao, Dual path multi-scale fusion networks with attention for crowd...
  • J. Chen et al.

    Crowd counting with crowd attention convolutional neural network

    Neurocomputing.

    (2020)
  • Y. Hou, C. Li, F. Yang, C. Ma, L. Zhu, Y. Li, H. Jia, X. Xie, Bba-net: A bi-branch attention network for crowd...
  • Cited by (6)

    • Correlation-attention guided regression network for efficient crowd counting

      2024, Journal of Visual Communication and Image Representation
    • Deep feature network with multi-scale fusion for highly congested crowd counting

      2024, International Journal of Machine Learning and Cybernetics
    View full text