Abstract
Accurately estimating the number of individuals contained in an image is the purpose of the crowd counting. It has always faced two major difficulties: uneven distribution of crowd density and large span of head size. Focusing on the former, most CNN-based methods divide the image into multiple patches for processing, ignoring the connection between the patches. For the latter, the multi-scale feature fusion method using feature pyramid ignores the matching relationship between the head size and the hierarchical features. In response to the above issues, we propose a crowd counting network named CCST based on swin transformer, and tailor a feature adaptive fusion regression head called FAFHead. Swin transformer can fully exchange information within and between patches, and effectively alleviate the problem of uneven distribution of crowd density. FAFHead can adaptively fuse multi-level features, improve the matching relationship between head size and feature pyramid hierarchy, and relief the problem of large span of head size available. Experimental results on common datasets show that CCST has better counting performance than all weakly supervised counting works and great majority of popular density map-based fully supervised works.


Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Wang, Y., Zou, Y.: Fast visual object counting via example-based density estimation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3653–3657, IEEE, 2016
Walach, C., Wolf, L.: Learning to count with cnn boosting. In: European conference on computer vision, pp. 660–676, Springer, 2016
Lempitsky, V., Zisserman, A.: Learning to count objects in images. Adv. Neural Inf. Process. Syst. 23, 1324–1332 (2010)
Onoro-Rubio, D., López-Sastre, R.J.: Towards perspective-free object counting with deep learning. In: European conference on computer vision, pp. 615–629, Springer, 2016
Zhang, H., Kyaw, Z., Chang, S.-F., Chua, T.-S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5532–5540, 2017
Guerrero-Gómez-Olmedo, R., Torre-Jiménez, B., López-Sastre, R., Maldonado-Bascón, S., Onoro-Rubio, D.: Extremely overlapping vehicle counting. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 423–431, Springer, 2015
Li, B., Shu, X., Yan, R.: Storyboard relational model for group activity recognition. In: Proceedings of the 2nd ACM International Conference on Multimedia in Asia, pp. 1–7, 2021
Shu, X., Zhang, L., Qi, G.-J., Liu, W., Tang, J.: Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
Shu, X., Yang, J., Yan, R., Song, Y.: Expansion-squeeze-excitation fusion network for elderly activity recognition. In: IEEE Transactions on Circuits and Systems for Video Technology, 2022
Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for roup activity recognition. In: Proceedings of the 26th ACM international conference on Multimedia, pp. 1292–1300, 2018
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Higcin: hierarchical graph-based cross inference network for group activity recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: European Conference on Computer Vision, pp. 208–224, Springer, 2020
Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, pp. 90–97, IEEE, 2005
Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, 2007
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1, pp. 886–893, Ieee, 2005
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597, 2016
Lei, Y., Liu, Y., Zhang, P., Liu, L.: Towards using count-level weak supervision for crowd counting. Pattern Recognit. 109, 107616 (2021)
Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., Sebe, N.: Weakly-supervised crowd counting learns from sorting rather than locations. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 1–17, Springer, 2020
Sam, D.B., Sajjan, N.N., Maurya, H., Babu, R.V.: Almost unsupervised learning for dense crowd counting. Proc. AAAI Conf. Artif. Intell. 33, 8868–8875 (2019)
von Borstel, M., Kandemir, M., Schmidt, P., Rao, K., Rajamani, K., Hamprecht, F.A.: Gaussian process density counting from weak supervision. In: European Conference on Computer Vision, pp. 365–380, Springer, 2016
Boominathan, L., Kruthiventi, S., Babu, R.V.: Crowdnet: A deep convolutional network for dense crowd counting. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 640–644, 2016
Zhang, L., Shi, M., Chen, Q.: Crowd counting via scale-adaptive convolutional neural network. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121, IEEE, 2018
Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750, 2018
Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., Shao, L.: Crowd counting and density estimation by trellis encoder-decoder networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6133–6142, 2019
Dong, L., Zhang, H., Ji, Y., Ding, Y.: Crowd counting by using multi-level density-based spatial information: a multi-scale cnn framework. Inf. Sci. 528, 79–91 (2020)
Khan, S.D., Basalamah, S.: Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis. Comput. 37(8), 2127–2137 (2021)
Li, Z., Lu, S., Dong, Y., Guo, J.: Msffa: a multi-scale feature fusion and attention mechanism network for crowd counting. Vis. Comput., pp. 1–12, (2022)
Gao, J., Han, T., Yuan, Y., Wang, Q.: Learning independent instance maps for crowd localization. arXiv preprint arXiv:2012.04164, 2020
Gao, J., Gong, M., Li, X.: Congested crowd instance localization with dilated convolutional swin transformer. arXiv preprint arXiv:2108.00584, 2021
Liang, D., Chen, X., Xu, W., Zhou, Y., Bai, X.: Transcrowd: weakly-supervised crowd counting with transformer. arXiv preprint arXiv:2104.09116, 2021
Amirgholipour Kasmani, S., He, X., Jia, W., Wang, D., Zeibots, M.: A-ccnn: adaptive ccnn for density estimation and crowd counting. arXiv e-prints, pp. arXiv–1804, 2018
Babu Sam, D., Surya, S., Venkatesh Babu, R.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5744–5752, 2017
Tian, Y., Lei, Y., Zhang, J., Wang, J.Z.: Padnet: pan-density crowd counting. IEEE Trans. Image Process. 29, 2714–2727 (2019)
Sam, D.B., Sajjan, N.N., Babu, R.V., Srinivasan, M.: Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3618–3626, 2018
Han, K., Wan, W., Yao, H., Hou, L.: Image crowd counting using convolutional neural network and markov random field. J. Adv. Comput. Intell. Intell. Inf. 21(4), 632–638 (2017)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229, Springer, 2020
Li, B., Huang, H., Zhang, A., Liu, P., Liu, C.: Approaches on crowd counting and density estimation: a review. Pattern Anal. Appl., pp. 1–22, (2021)
Kang, D., Chan, A.: Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115, 2018
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017
Cenggoro, T.W., Aslamiah, A.H., Yunanto, A.: Feature pyramid networks for crowd counting. Proc. Comput. Sci. 157, 175–182 (2019)
Kalyani, G. Janakiramaiah, B. LV, N. P. Karuna, A., et al.: Efficient crowd counting model using feature pyramid network and resnext 2021
Wang, W., Liu, Q., Wang, W.: Pyramid-dilated deep convolutional neural network for crowd counting. Appl. Intell., pp. 1–13, (2021)
Lei, T., Zhang, D., Wang, R., Li, S., Zhang, W., Nandi, A.K.: Mfp-net: multi-scale feature pyramid network for crowd counting. IET Image Process., 2021
Varior, R.R., Shuai, B., Tighe, J., Modolo, D.: Multi-scale attention network for crowd counting. arXiv preprint arXiv:1901.06026, 2019
Gao, J., Wang, Q., Yuan, Y.: Scar: spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 363, 1–8 (2019)
Zhu, L., Zhao, Z., Lu, C., Lin, Y., Peng, Y., Yao, T.: Dual path multi-scale fusion networks with attention for crowd counting. arXiv preprint arXiv:1902.01115, 2019
Jiang, X., Zhang, L., Xu, M., Zhang, T., Lv, P., Zhou, B., Yang, X., Pang, Y.: Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4706–4715, 2020
Tian, Y., Chu, X., Wang, H.: Cctrans: simplifying and improving crowd counting with transformer. In: arXiv preprint arXiv:2109.14483, 2021
Sun, G., Liu, Y., Probst, T., Paudel, D.P., Popovic, N., Van Gool, L.: Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021
Sajid, U., Chen, X., Sajid, H., Kim, T., Wang, G.: Audio-visual transformer based crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2249–2259, 2021
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569, 2021
Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849, 2019
Liu, S., Huang, D., Wang, Y.: Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019
Zhu, C., Chen, F., Shen, Z. Savvides, M.: Soft anchor-point object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 91–107, Springer, 2020
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J.: You only look one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13039–13048, 2021
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018
Sindagi, V., Yasarla, R., Patel, V.M.: Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
Sindagi, V.A., Yasarla, R., Patel, V.M.: Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1221–1231, 2019
Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.: Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546, 2018
Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554, 2013
Kingma D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
Sindagi V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6, IEEE, 2017
Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1774–1783, 2019
Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5099–5108, 2019
Li, Y., Zhang, X., Chen, D.: Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100, 2018
Sindagi V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1002–1012, 2019
Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8198–8207, 2019
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6142–6151, 2019
Xu, C., Liang, D., Xu, Y., Bai, S., Zhan, W., Bai, X., Tomizuka, M.: Autoscale: learning to scale for crowd counting. Int. J. Comput. Vis., pp. 1–30, (2022)
Wan, J., Liu, Z., Chan, A.B.: A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1974–1983, 2021
Abousamra, S., Hoai, M., Samaras, D., Chen, C.: Localization in the crowd with topological constraints. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021
Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. arXiv preprint arXiv:2202.13065, 2022
Shi, Z., Mettes, P., Snoek, C.G.: Counting with focus for free. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4200–4209, 2019
Zhang, A., Shen, J., Xiao, Z., Zhu, F., Zhen, X., Cao, X., Shao, L.: Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6788–6797, 2019
Bai, S., He, Z., Qiao, Y., Hu, H., Wu, W., Yan, J.: Adaptive dilated network with self-correction supervision for counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4594–4603, 2020
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890, 2021
Song, Q., Wang, C., Wang, Y., Tai, Y., Wang, C., Li, J., Wu, J., Ma, J.: To choose or to fuse? scale selection for crowd counting. Proc. AAAI Conf. Artif. Intell. 35, 2576–2583 (2021)
Acknowledgements
The research project is partially supported by National Key R &D Program of China (No.2021ZD0111902), National Natural Science Foundation of China (No.62072015, No.U21B2038, U1811463, U19B2039), Beijing Natural Science Foundation (4222021).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, B., Zhang, Y., Xu, H. et al. CCST: crowd counting with swin transformer. Vis Comput 39, 2671–2682 (2023). https://doi.org/10.1007/s00371-022-02485-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02485-3