Skip to main content
Log in

Cross-modal collaborative representation and multi-level supervision for crowd counting

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Crowd features are often extracted from RGB images to complete the tasks of density estimation and crowd counting. However, RGB images will be affected in some particularly poor illumination, resulting in the inability to accurately identify semantic objects, and thermal images can help solve this problem. Considering the comprehensive utilization of optical and thermal imaging information, we propose a crowd counting method based on cross-modal coordinated representation and multi-level supervision. In order to capture the complementary features of different modalities, RGB and thermal images are used as specific steams of cross-modal cooperative learning. The missing specific information is compensated and the shared information is enhanced; both are through the aggregation and distribution calculation of specific steams and shared steam. Furthermore, in order to weaken the influence of the background and strengthen the identification of crowd regions, we combine the multi-scale crowd feature extraction and region recognition. Multiple output layers are added in the propagation process of multi-modal streams, so as to achieve the purpose of multi-level supervision. Moreover, we replace the baseline training loss with the Bayesian loss for monitoring the counting expectation of each annotation point. Finally, comprehensive experiments on the RGBT-CC benchmark show the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Lu, Y., Wu, Y., Liu, B., Zhang, T., Li, B., Chu, Q., Yu, N.: Cross-modality person re-identification with shared-specific feature transfer. In: IEEE Conference on Computer Vision and Pattern Recognition, 14–19 June 2020, Seattle, WA, USA, pp. 13379–13389. IEEE (2020)

  2. Lian, D., Li, J., Zheng, J., Luo, W., Gao, S.: Density map regression guided detection network for RGB-D crowd counting and localization. In: IEEE Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, USA, pp. 1821–1830. IEEE (2019)

  3. Zhai, Y., Fan, D., Yang, J., Borji, A., Shao, L., Han, J., Wang, L.: Bifurcated backbone strategy for RGB-D salient object detection. IEEE Trans. Image Process. 30, 8727–8742 (2021)

    Article  Google Scholar 

  4. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: IEEE International Conference on Machine Learning, 18–21 Dec 2017, Taipei, Taiwan, pp. 1126–1135. IEEE (2017)

  5. Zhou, D., He, Q.: Cascaded multi-task learning of head segmentation and density regression for RGBD crowd counting. IEEE Access 8, 101616–101627 (2020)

    Article  Google Scholar 

  6. Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-Induced Multi-Scale Recurrent Attention Network for Saliency Detection. In: IEEE/CVF International Conference on Computer Vision, 20–26 Oct 2019, Seoul, South Korea, pp. 7253–7262. IEEE (2019)

  7. Fan, D., Zhai, Yi., Borji, A., Yang, J., Shao, L.: Bbs-net: RGB-D salient object detection with a bifurcated backbone strategy network. In: European Conference on Computer Vision, 23–28 Aug 2020 (2020)

  8. Liu, L., Chen, J., Wu, H., Li, G., Li, C., Lin, L.: Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: IEEE International Conference on Computer Vision, 20–25 June 2021, Nashville, TN, USA. IEEE (2021)

  9. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: IEEE International Conference on Computer Vision, 20–26 Oct 2019, South Korea, pp. 6142-6151. IEEE (2019)

  10. Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: IEEE Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, USA, pp. 5099-5108. IEEE (2019)

  11. Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., Shao, L.: Crowd counting and density estimation by trellis encoder–decoder networks. In: IEEE Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, USA, pp. 6133–6142. IEEE (2019)

  12. Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: IEEE International Conference on Computer Vision, 20–26 Oct 2019, South Korea, pp. 1774–1783. IEEE (2019)

  13. Liu, N., Long, Y., Zou, C., Niu, Q., Pan, L., Wu, H.: Adcrowdnet: an attention-injective deformable convolutional network for crowd understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, USA, pp. 3225–3234. IEEE (2019)

  14. Dong, Z., Zhang, R., Shao, X., Li, Y.: Scale-recursive network with point supervision for crowd scene analysis. Neurocomputing 384, 314–324 (2019)

    Article  Google Scholar 

  15. Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: Decidenet: counting varying density crowds through attention guided detection and density estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, 18–22 June 2018, Salt Lake, UT, USA, pp. 5197–5206. IEEE (2018)

  16. Liu, Y., Shi, M., Zhao, Q., Wang, X.: Point in, box out: beyond counting persons in crowds, In: IEEE Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, USA, pp. 6469–6478. IEEE (2019)

  17. Liu, C., Weng, X., Mu, Y.: Recurrent attentive zooming for joint crowd counting and precise localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, USA, pp. 1217–1226. IEEE (2019)

  18. Sam, D.B., Peri, S.V., Sundararaman, M.N., Kamath, A., Radhakrishnan, V.B.: Locate, size and count: accurately resolving people in dense crowds via detection. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2379–2751 (2020)

    Google Scholar 

  19. Fu, K., Fan, D., Ji, G., Zhao, Q.: Jldcf: joint learning and densely-cooperative fusion frame work for RGB-D salient object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, 14–19 June 2020, Seattle, WA, USA, pp. 3052–3062. IEEE (2020)

  20. Sun, T., Di, Z., Che, P., Liu, C., Wang, Y.: Leveraging crowd sourced GPS data for road extraction from aerial imagery. In: IEEE Conference on Computer Vision and Pattern Recognition, 14–19 June 2020, Long Beach, CA, USA, pp. 7509–7518. IEEE (2020)

  21. Piao, Y., Rong, Z., Zhang, M., Ren, W., Lu, H.: A2dele: adaptive and attentive depth distiller for efficient RGB-D salient object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, 14–19 June 2020, Seattle, WA, USA, pp. 9060-9069. IEEE (2020)

  22. Zhao, J., Cao, Y., Fan, D., Cheng, M., Li, X., Zhang, L.: Contrast prior and fluid pyramid integration for rgbd salient object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, 16-20 June 2019, Long Beach, CA, USA, pp. 3927–3936. IEEE (2019)

  23. Ye, M., Lan, X., Li, J., Yuen, P.C.: Hierarchical discriminative learning for visible thermal person re-identifification. In: Thirty-Second AAAI Conference on Artificial Intelligence, 2–7 Feb 2018, Hilton New Orleans Riverside, New Orleans, LO, USA. IEEE (2018)

  24. Dai, P., Ji, R., Wang, H., Wu, Q., Huang, Y.: Cross-modality person re-identification with generative adversarial training. In: International Joint Conference on Artificial Intelligence, 13–19 July 2018, Stockholm, Sweden, pp. 677–683. IJCAI (2018)

  25. Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: IEEE Conference on Computer Vision and Pattern Recognition, 18–22 June (2018), Salt Lake City, UT, USA, pp. 1091–1100. IEEE (2018)

  26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. (2014). arXiv preprint arXiv:1409.1556

  27. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L.: Pytorch: an imperative style, high-performance deep learning library. In: Neural Information Processing Systems, 3-6 Dec 2018, Montréal, Canada, pp. 8026–8037. NIPS (2018)

  28. Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980

  29. Guerrero-Gómez-Olmedo, R., Torre-Jiménez, B., López-Sastre, R., Maldonado-Bascón, S., O\(\tilde{n}\)oro-Rubio, D.: Extremely overlapping vehicle counting. In: Iberian Conference on Pattern Recognition and Image Analysis, 17–19 June 2015, Santiago de Compostela, Spain, pp. 423–431 (2015)

  30. Pang, Y., Zhang, L., Zhao, X., Lu, H.: Hierarchical dynamic filtering network for RGB-D salient object detection. In: European Conference on Computer Vision, 23–28 Aug 2020 (2020)

  31. Zhang, Q., Chan, A.: Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In: IEEE Conference on Computer Vision and Pattern Recognition, 16–20 June 2019, Long Beach, CA, pp. 8297–8306. IEEE (2019)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61771420 and 62001413, the National Natural Science Foundation of Hebei Province under Grant No. F2020203064, as well as the China Postdoctoral Science Foundation under Grant No. 2018M641674, the Doctoral Foundation in Yanshan University under Grant No. BL18033 and Science and Technology Research and Development Program of Qinhuangdao under Grant No. 202101A004.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengping Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Hu, Z., Zhao, M. et al. Cross-modal collaborative representation and multi-level supervision for crowd counting. SIViP 17, 601–608 (2023). https://doi.org/10.1007/s11760-022-02266-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02266-4

Keywords

Navigation