Skip to main content

CountFormer: Multi-view Crowd Counting Transformer

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15110))

Included in the following conference series:

  • 273 Accesses

Abstract

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios. In this work, we propose a concise 3D MVC framework called CountFormer to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences. Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

H. Mo and X. Zhang—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)

    Google Scholar 

  2. Bertozz, M., Broggi, A., Fascioli, A.: Stereo inverse perspective mapping: theory and applications. Image Vision Comput. (IVC) 16(8), 585–590 (1998)

    Article  Google Scholar 

  3. Boominathan, L., Kruthiventi, S.S.S., Babu, R.V.: Crowdnet: a deep convolutional network for dense crowd counting. In: Proceedings of the International Conference on Multimedia (MM), pp. 640–644. ACM (2016)

    Google Scholar 

  4. Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11621–11631. IEEE (2020)

    Google Scholar 

  5. Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 757–773. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_45

    Chapter  Google Scholar 

  6. Cheng, Z.Q., Dai, Q., Li, H., Song, J., Wu, X., Hauptmann, A.G.: Rethinking spatial invariance of convolutional networks for object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19638–19648. IEEE (2022)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  8. Du, Z., Shi, M., Deng, J., Zafeiriou, S.: Redesigning multi-scale neural network for crowd counting. IEEE Trans. Image Process. (2023)

    Google Scholar 

  9. Fang, Y., Gao, S., Li, J., Luo, W., He, L., Bo, H.: Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing 392, 98–107 (2020)

    Article  Google Scholar 

  10. Ferryman, J., Shahrokni, A.: Pets2009: dataset and challenge. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. IEEE (2009)

    Google Scholar 

  11. Gao, J., Gong, M., Li, X.: Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing 513, 94–103 (2022)

    Article  Google Scholar 

  12. Gao, J., et al.: Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. Front. Inf. Technol. Electron. Eng. 24(2), 187–202 (2023)

    Article  Google Scholar 

  13. Gao, J., Wang, Q., Li, X.: Pcc net: perspective crowd counting via spatial convolutional network. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 30(10), 3486–3498 (2019)

    Article  Google Scholar 

  14. Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7036–7045. IEEE (2019)

    Google Scholar 

  15. Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15273–15282. IEEE (2021)

    Google Scholar 

  16. Hu, Y.: NAS-count: counting-by-density with neural architecture search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 747–766. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_45

    Chapter  Google Scholar 

  17. Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view (2021)

    Google Scholar 

  18. Huang, Y., Zheng, W., Zhang, Z., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9223–9232. IEEE (2023)

    Google Scholar 

  19. Huang, Z.-K., Chen, W.T., Kuo, S.Y., Yang, M.H., Chiang, Y.C.: Counting crowds in bad weather (2023)

    Google Scholar 

  20. Jaderberg, M., Simonyan, J., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 28 (2015)

    Google Scholar 

  21. Jiang, X., et al.: Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4706–4715. IEEE (2020)

    Google Scholar 

  22. Jiang, Y., et al.: Polarformer: multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 1, pp. 1042–1050 (2023)

    Google Scholar 

  23. Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling (2016)

    Google Scholar 

  24. Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6399–6408. IEEE (2019)

    Google Scholar 

  25. Lei, Y., Liu, Y., Zhang, P., Liu, L.: Towards using count-level weak supervision for crowd counting. Pattern Recogn. (PR) 109, 107616 (2021)

    Google Scholar 

  26. Li, T., et al.: Lanesegnet: map learning with lane segment perception for autonomous driving (2023)

    Google Scholar 

  27. Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 37, pp. 1477–1485 (2023)

    Google Scholar 

  28. Li, Y., Zhang, X., Chen, D.: Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)

    Google Scholar 

  29. Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 1–18. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_1

  30. Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: Fb-bev: bev representation from forward-backward view transformations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6919–6928. IEEE (2023)

    Google Scholar 

  31. Lian, D., Chen, X., Li, J., Luo, W., Gao, S.: Locating and counting heads in crowds with a depth prior. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(12), 9056–9072 (2021)

    Article  Google Scholar 

  32. Liang, D., Chen, X., Wei, X., Zhou, Yu., Bai, X.: Transcrowd: weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 65(6), 160104 (2022)

    Article  Google Scholar 

  33. Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 38–54. Springer, Heidelberg (2022)

    Google Scholar 

  34. Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. Adv. Neural Inf. Process. Syst. (NeurIPS) 35, 10421–10434 (2022)

    Google Scholar 

  35. Liao, B., et al.: Maptr: structured modeling and learning for online vectorized hd map construction. In: International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  36. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. IEEE (2017)

    Google Scholar 

  37. Liu, C., Lu, H., Cao, Z., Liu, T.: Point-query quadtree for crowd counting, localization, and more (2023)

    Google Scholar 

  38. Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3d object detection from multi-camera videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18580–18590 (2023)

    Google Scholar 

  39. Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: Decidenet: counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5197–5206. IEEE (2018)

    Google Scholar 

  40. Liu, L., Lu, H., Zou, H., Xiong, H., Cao, Z., Shen, C.: Weighing counts: sequential crowd counting by reinforcement learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 164–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_10

    Chapter  Google Scholar 

  41. Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE (2019)

    Google Scholar 

  42. Liu, N., Long, Y., Zou, C., Niu, Q., Pan, L., Wu, H.: Adcrowdnet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)

    Google Scholar 

  43. Liu, W., Durasov, N., Fua, P.: Leveraging self-supervision for cross-domain crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5341–5352. IEEE (2022)

    Google Scholar 

  44. Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5099–5108. IEEE (2019)

    Google Scholar 

  45. Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 531–548. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_31

  46. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

    Google Scholar 

  47. Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: Proceddings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)

    Google Scholar 

  48. Ma, Y., Sanchez, V., Guha, T.: Fusioncount: efficient crowd counting via multiscale feature fusion. In: International Conference on Image Processing (ICIP), pp. 3256–3260. IEEE (2022)

    Google Scholar 

  49. Ma, Z., Hong, X., Wei, X., Qiu, Y., Gong, Y.: Towards a universal model for cross-dataset crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3205–3214. IEEE (2021)

    Google Scholar 

  50. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6142–6151. IEEE (2019)

    Google Scholar 

  51. Man, Y., Gui, L.Y., Wang, Y.X.: Bev-guided multi-modality fusion for driving perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21960–21969 (2023)

    Google Scholar 

  52. Mo, H., et al.: Background noise filtering and distribution dividing for crowd counting. IEEE Trans. Image Process. (TIP) 29, 8199–8212 (2020)

    Article  Google Scholar 

  53. Mo, H., et al.: Attention-guided collaborative counting. IEEE Trans. Image Process. (TIP) 31, 6306–6319 (2022)

    Article  Google Scholar 

  54. Pan, X., Mo, H., Zhou, Z., Wu, W.: Attention guided region division for crowd counting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2568–2572. IEEE (2020)

    Google Scholar 

  55. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  56. Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4342–4351. IEEE (2019)

    Google Scholar 

  57. Ranasinghe, Y., Nair, N.G., Bandara, W.G.C., Patel, N.M.: Diffuse-denoise-count: accurate crowd-counting with diffusion models (2023)

    Google Scholar 

  58. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2

    Chapter  Google Scholar 

  59. Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7279–7288. IEEE (2019)

    Google Scholar 

  60. Shi, X., Li, X., Wu, C., Kong, S., Yang, J., He, L.: A real-time deep network for crowd counting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2328–2332. IEEE (2020)

    Google Scholar 

  61. Sindagi, V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Proceedings of the International Conference on Advanced Video and Signal based Surveillance (AVSS), pp. 1–6. IEEE (2017)

    Google Scholar 

  62. Song, Q., et al.: Rethinking counting and localization in crowds: a purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3365–3374. IEEE (2021)

    Google Scholar 

  63. Song, Q., et al.: To choose or to fuse? scale selection for crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 2576–2583 (2021)

    Google Scholar 

  64. Sun, G., Liu, Y., Probst, T., Paudel, D.P., Popovic, N., Van Gool, L.: Boosting crowd counting with transformers (2021)

    Google Scholar 

  65. Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2446–2454. IEEE (2020)

    Google Scholar 

  66. Tian, Y., Chu, X., Wang, H.: Cctrans: simplifying and improving crowd counting with transformer (2021)

    Google Scholar 

  67. Tong, W., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8406–8415 (2023)

    Google Scholar 

  68. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (NeurIPS) 30 (2017)

    Google Scholar 

  69. Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. Adv. Neural Inf. Process. Syst. (NeurIPS), 1595–1607 (2020)

    Google Scholar 

  70. Wang, P., Gao, C., Wang, Y., Li, H., Gao, Y.: Mobilecount: an efficient encoder-decoder framework for real-time crowd counting. Neurocomputing 407, 292–299 (2020)

    Article  Google Scholar 

  71. Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception (2023)

    Google Scholar 

  72. Wang, Y., Chen, Y., Zhang, Z.: Frustumformer: adaptive instance-aware resampling for multi-view 3d detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5096–5105 (2023)

    Google Scholar 

  73. Wei, X., et al.: Scene-adaptive attention network for crowd counting (2021)

    Google Scholar 

  74. Wei, X., Qiu, Y., Ma, Z., Hong, X., Gong, Y.: Semi-supervised crowd counting via multiple representation learning. IEEE Trans. Image Process. (2023)

    Google Scholar 

  75. Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21729–21740. IEEE (2023)

    Google Scholar 

  76. Xu, R., et al.: Cobevt: cooperative bird’s eye view semantic segmentation with sparse transformers (2022)

    Google Scholar 

  77. Yang, C., et al.: Bevformer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17830–17839. IEEE (2023)

    Google Scholar 

  78. Yang, S., Guo, W., Ren, Y.: Crowdformer: an overlap patching vision transformer for top-down crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 23–29 (2022)

    Google Scholar 

  79. Yuan, K., Guo, Z., Wang, Z.J.: Rggnet: tolerance aware lidar-camera online calibration with geometric deep learning and generative model. IEEE Rob. Autom. Lett. (RA-L) 5(4), 6956–6963 (2020)

    Google Scholar 

  80. Yuan, M., Wang, Y., Wei, X.: Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 509–525. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_30

  81. Zeng, L., Xu, X., Cai, B., Qiu, S., Zhang, T.: Multi-scale convolutional neural networks for crowd counting. In: Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE (2017)

    Google Scholar 

  82. Zhai, Q., Yang, F., Li, X., Xie, G.-S., Cheng, H., Liu, Z.: Co-communication graph convolutional network for multi-view crowd counting. IEEE Trans. Multimedia (TMM) 25, 5813–5825 (2022)

    Article  Google Scholar 

  83. Zhang, A., et al.: Attentional neural fields for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5714–5723. IEEE (2019)

    Google Scholar 

  84. Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 833–841. IEEE (2015)

    Google Scholar 

  85. Zhang, L., Shi, M., Chen, Q.: Crowd counting via scale-adaptive convolutional neural network. In: Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121. IEEE (2018)

    Google Scholar 

  86. Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8297–8306. IEEE (2019)

    Google Scholar 

  87. Zhang, Q., Chan, A.B.: 3d crowd counting via multi-view fusion with 3d gaussian kernels. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 12837–12844 (2020)

    Google Scholar 

  88. Zhang, Q., Chan, A.B.: 3D crowd counting via geometric attention-guided multi-view fusion. Int. J. Comput. Vision 130(12), 3123–3139 (2022)

    Article  Google Scholar 

  89. Zhang, Q., Chan, A.B.: Calibration-free multi-view crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 227–244. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_14

  90. Zhang, Q., Chan, A.B.: Wide-area crowd counting: multi-view fusion networks for counting in large scenes. Int. J. Comput. Vision (IJCV) 130(8), 1938–1960 (2022)

    Article  Google Scholar 

  91. Zhang, Q., Lin, W., Chan, A.B.: Cross-view cross-scene multi-view crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 557–567. IEEE (2021)

    Google Scholar 

  92. Zhang, X., et al.: Dcnas: densely connected neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13956–13967. IEEE (2021)

    Google Scholar 

  93. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 589–597. IEEE (2016)

    Google Scholar 

  94. Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3d semantic occupancy prediction (2023)

    Google Scholar 

  95. Zheng, L., Li, Y., Mu, Y.: Learning factorized cross-view fusion for multi-view crowd counting. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)

    Google Scholar 

  96. Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13760–13769. IEEE (2022)

    Google Scholar 

  97. Zhu, H., Yuan, J., Zhong, X., Yang, Z., Wang, Z., He, S.: Daot: domain-agnostically aligned optimal transport for domain-adaptive crowd counting (2023)

    Google Scholar 

  98. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

Download references

Acknowledgements

The invaluable support and sponsorship provided by the Open Project Program of the State Key Laboratory of Virtual Reality Technology and Systems at Beihang University (Project No. VRLAB2024C05) has been instrumental to the successful completion of this research endeavor. The financial resources and institutional backing afforded by this esteemed program have been pivotal in enabling the pursuit and realization of the scholarly insights presented herein. It is with the utmost gratitude that the author acknowledges the crucial role played by this prestigious source of support in empowering the investigative work underlying this scholarly contribution.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Mo .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 421 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mo, H. et al. (2025). CountFormer: Multi-view Crowd Counting Transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15110. Springer, Cham. https://doi.org/10.1007/978-3-031-72943-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72943-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72942-3

  • Online ISBN: 978-3-031-72943-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics