Abstract
Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios. In this work, we propose a concise 3D MVC framework called CountFormer to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences. Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.
H. Mo and X. Zhang—Equal contributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Bertozz, M., Broggi, A., Fascioli, A.: Stereo inverse perspective mapping: theory and applications. Image Vision Comput. (IVC) 16(8), 585–590 (1998)
Boominathan, L., Kruthiventi, S.S.S., Babu, R.V.: Crowdnet: a deep convolutional network for dense crowd counting. In: Proceedings of the International Conference on Multimedia (MM), pp. 640–644. ACM (2016)
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11621–11631. IEEE (2020)
Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 757–773. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_45
Cheng, Z.Q., Dai, Q., Li, H., Song, J., Wu, X., Hauptmann, A.G.: Rethinking spatial invariance of convolutional networks for object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19638–19648. IEEE (2022)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2020)
Du, Z., Shi, M., Deng, J., Zafeiriou, S.: Redesigning multi-scale neural network for crowd counting. IEEE Trans. Image Process. (2023)
Fang, Y., Gao, S., Li, J., Luo, W., He, L., Bo, H.: Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing 392, 98–107 (2020)
Ferryman, J., Shahrokni, A.: Pets2009: dataset and challenge. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. IEEE (2009)
Gao, J., Gong, M., Li, X.: Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing 513, 94–103 (2022)
Gao, J., et al.: Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. Front. Inf. Technol. Electron. Eng. 24(2), 187–202 (2023)
Gao, J., Wang, Q., Li, X.: Pcc net: perspective crowd counting via spatial convolutional network. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 30(10), 3486–3498 (2019)
Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7036–7045. IEEE (2019)
Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15273–15282. IEEE (2021)
Hu, Y.: NAS-count: counting-by-density with neural architecture search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 747–766. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_45
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view (2021)
Huang, Y., Zheng, W., Zhang, Z., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9223–9232. IEEE (2023)
Huang, Z.-K., Chen, W.T., Kuo, S.Y., Yang, M.H., Chiang, Y.C.: Counting crowds in bad weather (2023)
Jaderberg, M., Simonyan, J., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 28 (2015)
Jiang, X., et al.: Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4706–4715. IEEE (2020)
Jiang, Y., et al.: Polarformer: multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 1, pp. 1042–1050 (2023)
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling (2016)
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6399–6408. IEEE (2019)
Lei, Y., Liu, Y., Zhang, P., Liu, L.: Towards using count-level weak supervision for crowd counting. Pattern Recogn. (PR) 109, 107616 (2021)
Li, T., et al.: Lanesegnet: map learning with lane segment perception for autonomous driving (2023)
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 37, pp. 1477–1485 (2023)
Li, Y., Zhang, X., Chen, D.: Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 1–18. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: Fb-bev: bev representation from forward-backward view transformations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6919–6928. IEEE (2023)
Lian, D., Chen, X., Li, J., Luo, W., Gao, S.: Locating and counting heads in crowds with a depth prior. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(12), 9056–9072 (2021)
Liang, D., Chen, X., Wei, X., Zhou, Yu., Bai, X.: Transcrowd: weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 65(6), 160104 (2022)
Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 38–54. Springer, Heidelberg (2022)
Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. Adv. Neural Inf. Process. Syst. (NeurIPS) 35, 10421–10434 (2022)
Liao, B., et al.: Maptr: structured modeling and learning for online vectorized hd map construction. In: International Conference on Learning Representations (ICLR) (2022)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. IEEE (2017)
Liu, C., Lu, H., Cao, Z., Liu, T.: Point-query quadtree for crowd counting, localization, and more (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3d object detection from multi-camera videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18580–18590 (2023)
Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: Decidenet: counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5197–5206. IEEE (2018)
Liu, L., Lu, H., Zou, H., Xiong, H., Cao, Z., Shen, C.: Weighing counts: sequential crowd counting by reinforcement learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 164–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_10
Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE (2019)
Liu, N., Long, Y., Zou, C., Niu, Q., Pan, L., Wu, H.: Adcrowdnet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Liu, W., Durasov, N., Fua, P.: Leveraging self-supervision for cross-domain crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5341–5352. IEEE (2022)
Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5099–5108. IEEE (2019)
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 531–548. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: Proceddings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)
Ma, Y., Sanchez, V., Guha, T.: Fusioncount: efficient crowd counting via multiscale feature fusion. In: International Conference on Image Processing (ICIP), pp. 3256–3260. IEEE (2022)
Ma, Z., Hong, X., Wei, X., Qiu, Y., Gong, Y.: Towards a universal model for cross-dataset crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3205–3214. IEEE (2021)
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6142–6151. IEEE (2019)
Man, Y., Gui, L.Y., Wang, Y.X.: Bev-guided multi-modality fusion for driving perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21960–21969 (2023)
Mo, H., et al.: Background noise filtering and distribution dividing for crowd counting. IEEE Trans. Image Process. (TIP) 29, 8199–8212 (2020)
Mo, H., et al.: Attention-guided collaborative counting. IEEE Trans. Image Process. (TIP) 31, 6306–6319 (2022)
Pan, X., Mo, H., Zhou, Z., Wu, W.: Attention guided region division for crowd counting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2568–2572. IEEE (2020)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4342–4351. IEEE (2019)
Ranasinghe, Y., Nair, N.G., Bandara, W.G.C., Patel, N.M.: Diffuse-denoise-count: accurate crowd-counting with diffusion models (2023)
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2
Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7279–7288. IEEE (2019)
Shi, X., Li, X., Wu, C., Kong, S., Yang, J., He, L.: A real-time deep network for crowd counting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2328–2332. IEEE (2020)
Sindagi, V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Proceedings of the International Conference on Advanced Video and Signal based Surveillance (AVSS), pp. 1–6. IEEE (2017)
Song, Q., et al.: Rethinking counting and localization in crowds: a purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3365–3374. IEEE (2021)
Song, Q., et al.: To choose or to fuse? scale selection for crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 2576–2583 (2021)
Sun, G., Liu, Y., Probst, T., Paudel, D.P., Popovic, N., Van Gool, L.: Boosting crowd counting with transformers (2021)
Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2446–2454. IEEE (2020)
Tian, Y., Chu, X., Wang, H.: Cctrans: simplifying and improving crowd counting with transformer (2021)
Tong, W., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8406–8415 (2023)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (NeurIPS) 30 (2017)
Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. Adv. Neural Inf. Process. Syst. (NeurIPS), 1595–1607 (2020)
Wang, P., Gao, C., Wang, Y., Li, H., Gao, Y.: Mobilecount: an efficient encoder-decoder framework for real-time crowd counting. Neurocomputing 407, 292–299 (2020)
Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception (2023)
Wang, Y., Chen, Y., Zhang, Z.: Frustumformer: adaptive instance-aware resampling for multi-view 3d detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5096–5105 (2023)
Wei, X., et al.: Scene-adaptive attention network for crowd counting (2021)
Wei, X., Qiu, Y., Ma, Z., Hong, X., Gong, Y.: Semi-supervised crowd counting via multiple representation learning. IEEE Trans. Image Process. (2023)
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21729–21740. IEEE (2023)
Xu, R., et al.: Cobevt: cooperative bird’s eye view semantic segmentation with sparse transformers (2022)
Yang, C., et al.: Bevformer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17830–17839. IEEE (2023)
Yang, S., Guo, W., Ren, Y.: Crowdformer: an overlap patching vision transformer for top-down crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 23–29 (2022)
Yuan, K., Guo, Z., Wang, Z.J.: Rggnet: tolerance aware lidar-camera online calibration with geometric deep learning and generative model. IEEE Rob. Autom. Lett. (RA-L) 5(4), 6956–6963 (2020)
Yuan, M., Wang, Y., Wei, X.: Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 509–525. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_30
Zeng, L., Xu, X., Cai, B., Qiu, S., Zhang, T.: Multi-scale convolutional neural networks for crowd counting. In: Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE (2017)
Zhai, Q., Yang, F., Li, X., Xie, G.-S., Cheng, H., Liu, Z.: Co-communication graph convolutional network for multi-view crowd counting. IEEE Trans. Multimedia (TMM) 25, 5813–5825 (2022)
Zhang, A., et al.: Attentional neural fields for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5714–5723. IEEE (2019)
Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 833–841. IEEE (2015)
Zhang, L., Shi, M., Chen, Q.: Crowd counting via scale-adaptive convolutional neural network. In: Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121. IEEE (2018)
Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8297–8306. IEEE (2019)
Zhang, Q., Chan, A.B.: 3d crowd counting via multi-view fusion with 3d gaussian kernels. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 12837–12844 (2020)
Zhang, Q., Chan, A.B.: 3D crowd counting via geometric attention-guided multi-view fusion. Int. J. Comput. Vision 130(12), 3123–3139 (2022)
Zhang, Q., Chan, A.B.: Calibration-free multi-view crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 227–244. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_14
Zhang, Q., Chan, A.B.: Wide-area crowd counting: multi-view fusion networks for counting in large scenes. Int. J. Comput. Vision (IJCV) 130(8), 1938–1960 (2022)
Zhang, Q., Lin, W., Chan, A.B.: Cross-view cross-scene multi-view crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 557–567. IEEE (2021)
Zhang, X., et al.: Dcnas: densely connected neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13956–13967. IEEE (2021)
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 589–597. IEEE (2016)
Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3d semantic occupancy prediction (2023)
Zheng, L., Li, Y., Mu, Y.: Learning factorized cross-view fusion for multi-view crowd counting. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13760–13769. IEEE (2022)
Zhu, H., Yuan, J., Zhong, X., Yang, Z., Wang, Z., He, S.: Daot: domain-agnostically aligned optimal transport for domain-adaptive crowd counting (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (ICLR) (2020)
Acknowledgements
The invaluable support and sponsorship provided by the Open Project Program of the State Key Laboratory of Virtual Reality Technology and Systems at Beihang University (Project No. VRLAB2024C05) has been instrumental to the successful completion of this research endeavor. The financial resources and institutional backing afforded by this esteemed program have been pivotal in enabling the pursuit and realization of the scholarly insights presented herein. It is with the utmost gratitude that the author acknowledges the crucial role played by this prestigious source of support in empowering the investigative work underlying this scholarly contribution.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mo, H. et al. (2025). CountFormer: Multi-view Crowd Counting Transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15110. Springer, Cham. https://doi.org/10.1007/978-3-031-72943-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-72943-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72942-3
Online ISBN: 978-3-031-72943-0
eBook Packages: Computer ScienceComputer Science (R0)