CountFormer: Multi-view Crowd Counting Transformer

Mo, Hong; Zhang, Xiong; Tan, Jianchao; Yang, Cheng; Gu, Qiong; Hang, Bo; Ren, Wenqi

doi:10.1007/978-3-031-72943-0_2

Hong Mo¹³,
Xiong Zhang¹⁴,
Jianchao Tan¹⁵,
Cheng Yang¹⁴,
Qiong Gu¹³,
Bo Hang¹³ &
…
Wenqi Ren¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15110))

Included in the following conference series:

European Conference on Computer Vision

273 Accesses

Abstract

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios. In this work, we propose a concise 3D MVC framework called CountFormer to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences. Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

H. Mo and X. Zhang—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

3D Crowd Counting via Geometric Attention-Guided Multi-view Fusion

Article 29 September 2022

Calibration-Free Multi-view Crowd Counting

Multiview Detection with Feature Perspective Transformation

References

Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Google Scholar
Bertozz, M., Broggi, A., Fascioli, A.: Stereo inverse perspective mapping: theory and applications. Image Vision Comput. (IVC) 16(8), 585–590 (1998)
Article Google Scholar
Boominathan, L., Kruthiventi, S.S.S., Babu, R.V.: Crowdnet: a deep convolutional network for dense crowd counting. In: Proceedings of the International Conference on Multimedia (MM), pp. 640–644. ACM (2016)
Google Scholar
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11621–11631. IEEE (2020)
Google Scholar
Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 757–773. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_45
Chapter Google Scholar
Cheng, Z.Q., Dai, Q., Li, H., Song, J., Wu, X., Hauptmann, A.G.: Rethinking spatial invariance of convolutional networks for object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19638–19648. IEEE (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Du, Z., Shi, M., Deng, J., Zafeiriou, S.: Redesigning multi-scale neural network for crowd counting. IEEE Trans. Image Process. (2023)
Google Scholar
Fang, Y., Gao, S., Li, J., Luo, W., He, L., Bo, H.: Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing 392, 98–107 (2020)
Article Google Scholar
Ferryman, J., Shahrokni, A.: Pets2009: dataset and challenge. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. IEEE (2009)
Google Scholar
Gao, J., Gong, M., Li, X.: Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing 513, 94–103 (2022)
Article Google Scholar
Gao, J., et al.: Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. Front. Inf. Technol. Electron. Eng. 24(2), 187–202 (2023)
Article Google Scholar
Gao, J., Wang, Q., Li, X.: Pcc net: perspective crowd counting via spatial convolutional network. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 30(10), 3486–3498 (2019)
Article Google Scholar
Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7036–7045. IEEE (2019)
Google Scholar
Hu, A., et al.: Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15273–15282. IEEE (2021)
Google Scholar
Hu, Y.: NAS-count: counting-by-density with neural architecture search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 747–766. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_45
Chapter Google Scholar
Huang, J., Huang, G., Zhu, Z., Ye, Y., Du, D.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view (2021)
Google Scholar
Huang, Y., Zheng, W., Zhang, Z., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9223–9232. IEEE (2023)
Google Scholar
Huang, Z.-K., Chen, W.T., Kuo, S.Y., Yang, M.H., Chiang, Y.C.: Counting crowds in bad weather (2023)
Google Scholar
Jaderberg, M., Simonyan, J., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 28 (2015)
Google Scholar
Jiang, X., et al.: Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4706–4715. IEEE (2020)
Google Scholar
Jiang, Y., et al.: Polarformer: multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 1, pp. 1042–1050 (2023)
Google Scholar
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling (2016)
Google Scholar
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6399–6408. IEEE (2019)
Google Scholar
Lei, Y., Liu, Y., Zhang, P., Liu, L.: Towards using count-level weak supervision for crowd counting. Pattern Recogn. (PR) 109, 107616 (2021)
Google Scholar
Li, T., et al.: Lanesegnet: map learning with lane segment perception for autonomous driving (2023)
Google Scholar
Li, Y., et al.: Bevdepth: acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 37, pp. 1477–1485 (2023)
Google Scholar
Li, Y., Zhang, X., Chen, D.: Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Google Scholar
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 1–18. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Li, Z., Yu, Z., Wang, W., Anandkumar, A., Lu, T., Alvarez, J.M.: Fb-bev: bev representation from forward-backward view transformations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6919–6928. IEEE (2023)
Google Scholar
Lian, D., Chen, X., Li, J., Luo, W., Gao, S.: Locating and counting heads in crowds with a depth prior. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(12), 9056–9072 (2021)
Article Google Scholar
Liang, D., Chen, X., Wei, X., Zhou, Yu., Bai, X.: Transcrowd: weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 65(6), 160104 (2022)
Article Google Scholar
Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 38–54. Springer, Heidelberg (2022)
Google Scholar
Liang, T., et al.: Bevfusion: a simple and robust lidar-camera fusion framework. Adv. Neural Inf. Process. Syst. (NeurIPS) 35, 10421–10434 (2022)
Google Scholar
Liao, B., et al.: Maptr: structured modeling and learning for online vectorized hd map construction. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. IEEE (2017)
Google Scholar
Liu, C., Lu, H., Cao, Z., Liu, T.: Point-query quadtree for crowd counting, localization, and more (2023)
Google Scholar
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3d object detection from multi-camera videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18580–18590 (2023)
Google Scholar
Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: Decidenet: counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5197–5206. IEEE (2018)
Google Scholar
Liu, L., Lu, H., Zou, H., Xiong, H., Cao, Z., Shen, C.: Weighing counts: sequential crowd counting by reinforcement learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 164–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_10
Chapter Google Scholar
Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE (2019)
Google Scholar
Liu, N., Long, Y., Zou, C., Niu, Q., Pan, L., Wu, H.: Adcrowdnet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Google Scholar
Liu, W., Durasov, N., Fua, P.: Leveraging self-supervision for cross-domain crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5341–5352. IEEE (2022)
Google Scholar
Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5099–5108. IEEE (2019)
Google Scholar
Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: position embedding transformation for multi-view 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 531–548. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Google Scholar
Liu, Z., et al.: Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: Proceddings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. IEEE (2023)
Google Scholar
Ma, Y., Sanchez, V., Guha, T.: Fusioncount: efficient crowd counting via multiscale feature fusion. In: International Conference on Image Processing (ICIP), pp. 3256–3260. IEEE (2022)
Google Scholar
Ma, Z., Hong, X., Wei, X., Qiu, Y., Gong, Y.: Towards a universal model for cross-dataset crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3205–3214. IEEE (2021)
Google Scholar
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6142–6151. IEEE (2019)
Google Scholar
Man, Y., Gui, L.Y., Wang, Y.X.: Bev-guided multi-modality fusion for driving perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21960–21969 (2023)
Google Scholar
Mo, H., et al.: Background noise filtering and distribution dividing for crowd counting. IEEE Trans. Image Process. (TIP) 29, 8199–8212 (2020)
Article Google Scholar
Mo, H., et al.: Attention-guided collaborative counting. IEEE Trans. Image Process. (TIP) 31, 6306–6319 (2022)
Article Google Scholar
Pan, X., Mo, H., Zhou, Z., Wu, W.: Attention guided region division for crowd counting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2568–2572. IEEE (2020)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4342–4351. IEEE (2019)
Google Scholar
Ranasinghe, Y., Nair, N.G., Bandara, W.G.C., Patel, N.M.: Diffuse-denoise-count: accurate crowd-counting with diffusion models (2023)
Google Scholar
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2
Chapter Google Scholar
Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7279–7288. IEEE (2019)
Google Scholar
Shi, X., Li, X., Wu, C., Kong, S., Yang, J., He, L.: A real-time deep network for crowd counting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2328–2332. IEEE (2020)
Google Scholar
Sindagi, V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Proceedings of the International Conference on Advanced Video and Signal based Surveillance (AVSS), pp. 1–6. IEEE (2017)
Google Scholar
Song, Q., et al.: Rethinking counting and localization in crowds: a purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3365–3374. IEEE (2021)
Google Scholar
Song, Q., et al.: To choose or to fuse? scale selection for crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 2576–2583 (2021)
Google Scholar
Sun, G., Liu, Y., Probst, T., Paudel, D.P., Popovic, N., Van Gool, L.: Boosting crowd counting with transformers (2021)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2446–2454. IEEE (2020)
Google Scholar
Tian, Y., Chu, X., Wang, H.: Cctrans: simplifying and improving crowd counting with transformer (2021)
Google Scholar
Tong, W., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8406–8415 (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (NeurIPS) 30 (2017)
Google Scholar
Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. Adv. Neural Inf. Process. Syst. (NeurIPS), 1595–1607 (2020)
Google Scholar
Wang, P., Gao, C., Wang, Y., Li, H., Gao, Y.: Mobilecount: an efficient encoder-decoder framework for real-time crowd counting. Neurocomputing 407, 292–299 (2020)
Article Google Scholar
Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception (2023)
Google Scholar
Wang, Y., Chen, Y., Zhang, Z.: Frustumformer: adaptive instance-aware resampling for multi-view 3d detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5096–5105 (2023)
Google Scholar
Wei, X., et al.: Scene-adaptive attention network for crowd counting (2021)
Google Scholar
Wei, X., Qiu, Y., Ma, Z., Hong, X., Gong, Y.: Semi-supervised crowd counting via multiple representation learning. IEEE Trans. Image Process. (2023)
Google Scholar
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21729–21740. IEEE (2023)
Google Scholar
Xu, R., et al.: Cobevt: cooperative bird’s eye view semantic segmentation with sparse transformers (2022)
Google Scholar
Yang, C., et al.: Bevformer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17830–17839. IEEE (2023)
Google Scholar
Yang, S., Guo, W., Ren, Y.: Crowdformer: an overlap patching vision transformer for top-down crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 23–29 (2022)
Google Scholar
Yuan, K., Guo, Z., Wang, Z.J.: Rggnet: tolerance aware lidar-camera online calibration with geometric deep learning and generative model. IEEE Rob. Autom. Lett. (RA-L) 5(4), 6956–6963 (2020)
Google Scholar
Yuan, M., Wang, Y., Wei, X.: Translation, scale and rotation: cross-modal alignment meets rgb-infrared vehicle detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 509–525. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_30
Zeng, L., Xu, X., Cai, B., Qiu, S., Zhang, T.: Multi-scale convolutional neural networks for crowd counting. In: Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE (2017)
Google Scholar
Zhai, Q., Yang, F., Li, X., Xie, G.-S., Cheng, H., Liu, Z.: Co-communication graph convolutional network for multi-view crowd counting. IEEE Trans. Multimedia (TMM) 25, 5813–5825 (2022)
Article Google Scholar
Zhang, A., et al.: Attentional neural fields for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5714–5723. IEEE (2019)
Google Scholar
Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 833–841. IEEE (2015)
Google Scholar
Zhang, L., Shi, M., Chen, Q.: Crowd counting via scale-adaptive convolutional neural network. In: Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121. IEEE (2018)
Google Scholar
Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8297–8306. IEEE (2019)
Google Scholar
Zhang, Q., Chan, A.B.: 3d crowd counting via multi-view fusion with 3d gaussian kernels. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 12837–12844 (2020)
Google Scholar
Zhang, Q., Chan, A.B.: 3D crowd counting via geometric attention-guided multi-view fusion. Int. J. Comput. Vision 130(12), 3123–3139 (2022)
Article Google Scholar
Zhang, Q., Chan, A.B.: Calibration-free multi-view crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 227–244. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_14
Zhang, Q., Chan, A.B.: Wide-area crowd counting: multi-view fusion networks for counting in large scenes. Int. J. Comput. Vision (IJCV) 130(8), 1938–1960 (2022)
Article Google Scholar
Zhang, Q., Lin, W., Chan, A.B.: Cross-view cross-scene multi-view crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 557–567. IEEE (2021)
Google Scholar
Zhang, X., et al.: Dcnas: densely connected neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13956–13967. IEEE (2021)
Google Scholar
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 589–597. IEEE (2016)
Google Scholar
Zhang, Y., Zhu, Z., Du, D.: Occformer: dual-path transformer for vision-based 3d semantic occupancy prediction (2023)
Google Scholar
Zheng, L., Li, Y., Mu, Y.: Learning factorized cross-view fusion for multi-view crowd counting. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Google Scholar
Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13760–13769. IEEE (2022)
Google Scholar
Zhu, H., Yuan, J., Zhong, X., Yang, Z., Wang, Z., He, S.: Daot: domain-agnostically aligned optimal transport for domain-adaptive crowd counting (2023)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar

Download references

Acknowledgements

The invaluable support and sponsorship provided by the Open Project Program of the State Key Laboratory of Virtual Reality Technology and Systems at Beihang University (Project No. VRLAB2024C05) has been instrumental to the successful completion of this research endeavor. The financial resources and institutional backing afforded by this esteemed program have been pivotal in enabling the pursuit and realization of the scholarly insights presented herein. It is with the utmost gratitude that the author acknowledges the crucial role played by this prestigious source of support in empowering the investigative work underlying this scholarly contribution.

Author information

Authors and Affiliations

Hubei University of Arts and Science, Wuhan, China
Hong Mo, Qiong Gu & Bo Hang
Neolix Autonomous Vehicle, Guangzhou, China
Xiong Zhang & Cheng Yang
Kuaishou Technology, Beijing, China
Jianchao Tan
Sun Yat-Sen University, Guangzhou, China
Wenqi Ren

Authors

Hong Mo
View author publications
You can also search for this author in PubMed Google Scholar
Xiong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianchao Tan
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Gu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Hang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqi Ren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Mo .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 421 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mo, H. et al. (2025). CountFormer: Multi-view Crowd Counting Transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15110. Springer, Cham. https://doi.org/10.1007/978-3-031-72943-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-72943-0_2
Published: 29 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72942-3
Online ISBN: 978-3-031-72943-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics