Skip to main content
Log in

3D Crowd Counting via Geometric Attention-Guided Multi-view Fusion

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Recently multi-view crowd counting using deep neural networks has been proposed to enable counting in large and wide scenes using multiple cameras. The current methods project the camera-view features to the average-height plane of the 3D world, and then fuse the projected multi-view features to predict a 2D scene-level density map on the ground (i.e., birds-eye view). Unlike the previous research, we consider the variable height of the people in the 3D world and propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground-plane. Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views. The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density. Furthermore, instead of using the standard method of copying the features along the view ray in the 2D-to-3D projection, we propose an attention module based on a height estimation network, which forces each 2D pixels to be projected to one 3D voxel along the view ray. We also explore the projection consistency among the 3D prediction and the ground-truth in the 2D views to further enhance the counting performance. The proposed method is tested on the synthetic and real-world multi-view counting datasets and achieves better or comparable counting performance to the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Bai, S., He, Z., Qiao, Y., Hu, H., Wu, W., & Yan, J. (2020). Adaptive dilated network with self-correction supervision for counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4594–4603).

  • Boominathan, L., Kruthiventi, S. S., & Babu, R. V. (2016). Crowdnet: A deep convolutional network for dense crowd counting. In ACM multimedia conference. ACM (pp. 640–644).

  • Cao, X., & Wang, Z., et al. (2018). Scale aggregation network for accurate and efficient crowd counting. In ECCV (pp. 734–750).

  • Chan, A. B., Liang, Z. S. J., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR (pp. 1–7).

  • Chang, A. X., et al. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.

  • Chen, K., Chen, L. C., Gong, S., & Xiang, T. (2012). Feature mining for localised crowd counting. In BMVC.

  • Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV. Springer (pp. 628–644).

  • Dittrich, F., de Oliveira, L. E., Britto, Jr A. S., & Koerich, A. L. (2017). People counting in crowded and outdoor scenes using a hybrid multi-camera approach. arXiv preprint arXiv:1704.00326.

  • Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–6).

  • Ge, W., & Collins, R. T. (2010). Crowd detection with a multiview sampler. In ECCV (pp. 324–337).

  • Girdhar, R., Fouhey, D. F., Rodriguez, M., & Gupta, A. (2016). Learning a predictable and generative vector representation for objects. In ECCV. Springer (pp. 484–499).

  • Huang, P. H., & Matzen, K., et al. (2018). Deepmvs: Learning multi-view stereopsis. In CVPR (pp. 2821–2830).

  • Idrees, H., et al. (2018). Composition loss for counting, density map estimation and localization in dense crowds. In ECCV (pp. 532–546).

  • Idrees, H., Saleemi, I., Seibert, C., & Shah, M. (2013). Multi-source multi-scale counting in extremely dense crowd images. In CVPR (pp. 2547–2554).

  • Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In ICCV.

  • Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).

  • Jiang, X., et al. (2019). Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR (pp. 6133–6142).

  • Jiang, X., Zhang, L., Xu, M., Zhang, T., Lv, P., Zhou, B., Yang, X., & Pang, Y. (2020). Attention scaling for crowd counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4706–4715).

  • Kang, D., & Chan, A. (2018). Crowd counting by adaptively fusing predictions from an image pyramid. In BMVC.

  • Kang, D., Dhar, D., & Chan, A. (2017). Incorporating side information by adaptive convolution. In Advances in neural information processing systems (pp. 3867–3877).

  • Kar, A., Häne, C., & Malik, J. (2017). Learning a multi-view stereo machine. In NIPS (pp. 365–376).

  • Li, J., Huang, L., & Liu, C. (2012). People counting across multiple cameras for intelligent video surveillance. In IEEE ninth international conference on advanced video and signal-based surveillance (AVSS). IEEE (pp. 178–183).

  • Li, Y., Zhang, X., & Chen, D. (2018). Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR (pp. 1091–1100).

  • Lian, D., Li, J., Zheng, J., Luo, W., & Gao, S. (2019). Density map regression guided detection network for rgb-d crowd counting and localization. In CVPR (pp. 1821–1830).

  • Liao, S., Hu, Y., Zhu, X., & Li, S. Z. (2015). Person re-identification by local maximal occurrence representation and metric learning. In CVPR (pp. 2197–2206).

  • Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing IKEA objects: Fine pose estimation. In ICCV.

  • Liu, C., et al. (2019). Recurrent attentive zooming for joint crowd counting and precise localization. In CVPR (pp. 1217–1226).

  • Liu, J., Gao, C., Meng, D., Hauptmann, A. G. (2018). Decidenet: Counting varying density crowds through attention guided detection and density estimation. In CVPR (pp. 5197–5206).

  • Liu, W., Salzmann, M., Fua, P. (2019). Context-aware crowd counting. In CVPR (pp. 5099–5108).

  • Liu, X., Yang, J., Ding, W. (2020). Adaptive mixture regression network with local counting map for crowd counting. arXiv preprint arXiv:2005.05776.

  • Ma, H., Zeng, C., & Ling, C. X. (2012). A reliable people counting system via multiple cameras. ACM Transactions on Intelligent Systems and Technology (TIST), 3(2), 31.

    Google Scholar 

  • Ma, Z., Wei, X., Hong, X., & Gong, Y. (2019). Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6142–6151).

  • Maddalena, L., Petrosino, A., & Russo, F. (2014). People counting by learning their appearance in a multi-view camera environment. Pattern Recognition Letters, 36, 125–134.

    Article  Google Scholar 

  • Onoro-Rubio, D., López-Sastre, R. J. (2016). Towards perspective-free object counting with deep learning. In ECCV. Springer (pp .615–629).

  • Ranjan, V., Le, H., & Hoai, M. (2018). Iterative crowd counting. In ECCV (pp. 270–285).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Ristani, E., & Solera, F., et al. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV workshop on benchmarking multi-target tracking.

  • Ryan, D., Denman, S., Fookes, C., & Sridharan, S. (2014). Scene invariant multi camera crowd counting. Pattern Recognition Letters, 44(8), 98–112.

    Article  Google Scholar 

  • Sam, D. B., Surya, S., & Babu, R. V. (2017). Switching convolutional neural network for crowd counting. In CVPR (pp. 4031–4039).

  • Shen, Z., Xu, Y., Ni, B., Wang, M., Hu, J., & Yang, X. (2018). Crowd counting via adversarial cross-scale consistency pursuit. In CVPR (pp. 5245–5254).

  • Shi, M., & Yang, Z., et al. (2019). Revisiting perspective information for efficient crowd counting. In CVPR (pp. 7279–7288).

  • Sindagi, V. A., & Patel, V. M. (2017). Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV (pp. 1879–1888).

  • Sindagi, V. A., Yasarla, R., Babu, D. S., Babu, R. V., & Patel, V. M. (2020). Learning to count in the crowd from limited labeled data. arXiv preprint arXiv:2007.03195.

  • Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., & Zollhöfer, M. (2019). Deepvoxels: Learning persistent 3d feature embeddings. In Proceedings of computer vision and pattern recognition (CVPR). IEEE.

  • Tang, N., Lin, Y. Y., Weng, M. F., & Liao, H. Y. (2014). Cross-camera knowledge transfer for multiview people counting. IEEE Transactions on Image Processing, 24(1), 80–93.

    Article  MathSciNet  Google Scholar 

  • Wang, B., Liu, H., Samaras, D., & Hoai, M. (2020). Distribution matching for crowd counting. arXiv preprint arXiv:2009.13077.

  • Wang, Q., & Gao, J., et al. (2019). Learning from synthetic data for crowd counting in the wild. In CVPR (pp. 8198–8207).

  • Xiong, H., Lu, H., Liu, C., Liu, L., Cao, Z., & Shen, C. (2019). From open set to closed set: Counting objects by spatial divide-and-conquer. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Yan, X., & Yang, J., et al. (2016). Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS (pp. 1696–1704).

  • Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., & Sebe, N. (2020). Reverse perspective network for perspective-aware object counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4374–4383).

  • Zhang, C., & Li. H., et al. (2015). Cross-scene crowd counting via deep convolutional neural networks. In CVPR (pp. 833–841).

  • Zhang, Q., & Chan, A. B. (2019). Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In CVPR (pp. 8297–8306).

  • Zhang, Q., & Chan, A. B. (2020). 3d crowd counting via multi-view fusion with 3d gaussian kernels. In AAAI (pp. 12837–12844).

  • Zhang, Q., & Chan, A. B. (2021). Cross-view cross-scene multi-view crowd counting. In Submitted to CVPR 2021.

  • Zhang, Y., et al. (2016). Single-image crowd counting via multi-column convolutional neural network. In CVPR (pp. 589–597).

Download references

Acknowledgements

This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 11212518, CityU 11215820), and by a Strategic Research Grant from City University of Hong Kong (Project No. 7005665).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Zhang.

Additional information

Communicated by Akihiro Sugimoto.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Q., Chan, A.B. 3D Crowd Counting via Geometric Attention-Guided Multi-view Fusion. Int J Comput Vis 130, 3123–3139 (2022). https://doi.org/10.1007/s11263-022-01685-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01685-7

Keywords

Navigation