Abstract
The swift development of convolutional neural network (CNN) has enabled significant headway in crowd counting research. However, the fixed-size convolutional kernels of traditional methods make it difficult to handle problems such as drastic scale change and complex background interference. In this regard, we propose a hybrid crowd counting model to tackle existing challenges. Firstly, we leverage a global self-attention module (GAM) after CNN backbone to capture wider contextual information. Secondly, due to the gradual recovery of the feature map size in the decoding stage, the local self-attention module (LAM) is employed to reduce computational complexity. With this design, the model can fuse features from global and local perspectives to better cope with scale change. Additionally, to establish the interdependence between spatial and channel dimensions, we further design a novel channel self-attention module (CAM) and combine it with LAM. Finally, we construct a simple yet useful double head module that outputs a foreground segmentation map in addition to the intermediate density map, which are then multiplied together in a pixel-wise style to suppress background interference. The experimental results on several benchmark datasets demonstrate that our method achieves remarkable improvement.
This work was supported in part by the National Natural Science Foundation of China under Grant 62133013 and in part by the Chinese Association for Artificial Intelligence (CAAI)-Huawei MindSpore Open Fund.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: CVPR, pp. 589–597 (2016)
Li, Y., Zhang, X., Chen, D.: CSRnet: dilated convolutional neural networks for understanding the highly congested scenes. In: CVPR, pp. 1091–1100 (2018)
Sindagi, V., Patel, V.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: ICCV, pp. 1002–1012 (2019)
Song, Q., et al.: To choose or to fuse? Scale selection for crowd counting. In: AAAI, pp. 2576–2583 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6877–6886 (2021)
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollar, P., Girshick, R.: Early convolutions help transformers see better. In: NeurIPS, pp. 30392–30400 (2021)
Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: CVPR, pp. 4031–4039 (2017)
Chen, X., Bin, Y., Sang, N., Gao, C.: Scale pyramid network for crowd counting. In: WACV, pp. 1941–1950 (2019)
Jiang, X., et al.: Attention scaling for crowd counting. In: CVPR, pp. 4705–4714 (2020)
Rong, L., Li, C.: Coarse- and fine-grained attention network with background-aware loss for crowd density map estimation. In: WACV, pp. 3674–3683 (2021)
Yan, Z., et al.: Perspective-guided convolution networks for crowd counting. In: ICCV, pp. 952–961 (2019)
Yan, Z., Zhang, R., Zhang, H., Zhang, Q., Zuo, W.: Crowd counting via perspective-guided fractional-dilation convolution. IEEE Trans. Multimedia, pp. 2633–2647 (2022)
Yang, S., Guo, W., Ren, Y.: Crowdformer: an overlap patching vision transformer for top-down crowd counting. In: IJCAI, pp. 1545–1551 (2022)
Lin, H., Ma, Z., Ji, R., Wang, Y., Hong, X.: Boosting crowd counting via multifaceted attention. In: CVPR, pp. 19596–19605 (2022)
Qian, Y., Zhang, L., Hong, X., Donovan, C., Arandjelovic, O.: Segmentation assisted u-shaped multi-scale transformer for crowd counting. In: BMVC (2022)
Chu, X., et al.: Twins: Revisiting the design of spatial attention in vision transformers. In: NeurIPS, pp. 9355–9366 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 936–944 (2017)
Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)
Li, X., et al.: Semantic flow for fast and accurate scene parsing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 775–793. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_45
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)
Idrees, H., et al.: Composition loss for counting, density map estimation and localization in dense crowds. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 544–559. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_33
Sindagi, V.A., Yasarla, R., Patel, V.M.: Jhu-crowd++: large-scale crowd counting dataset and a benchmark method. Technical report (2020)
Wang, Q., Gao, J., Lin, W., Li, X.: NWPU-crowd: a large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 2141–2149 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: ICCV, pp. 6141–6150 (2019)
Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. In: NeurIPS, pp. 1595–1607 (2020)
Liu, H., Zhao, Q., Ma, Y., Dai, F.: Bipartite matching for crowd counting with point supervision. In: IJCAI, pp. 860–866 (2021)
Lin, H., et al.: Direct measure matching for crowd counting. In: IJCAI, pp. 837–844 (2021)
Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13661, pp. 38–54. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_3
Shu, W., Wan, J., Tan, K.C., Kwong, S., Chan, A.B.: Crowd counting in the frequency domain. In: CVPR, pp. 19586–19595 (2022)
Cheng, Z.Q., Dai, Q., Li, H., Song, J., Wu, X., Hauptmann, A.G.: Rethinking spatial invariance of convolutional networks for object counting. In: CVPR, pp. 19606–19616 (2022)
Song, Q., et al.: Rethinking counting and localization in crowds: a purely point-based framework. In: ICCV, pp. 3345–3354 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, Y., Yin, B. (2024). HTNet: A Hybrid Model Boosted by Triple Self-attention for Crowd Counting. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14436. Springer, Singapore. https://doi.org/10.1007/978-981-99-8555-5_23
Download citation
DOI: https://doi.org/10.1007/978-981-99-8555-5_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8554-8
Online ISBN: 978-981-99-8555-5
eBook Packages: Computer ScienceComputer Science (R0)