Skip to main content

CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting

  • Conference paper
  • First Online:
Biometric Recognition (CCBR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14463))

Included in the following conference series:

  • 395 Accesses

Abstract

Crowd counting is a crucial task in computer vision, offering numerous applications in smart security, remote sensing, agriculture and forestry. While pure image-based models have made significant advancements, they tend to perform poorly under low-light and dark conditions. Recent work has partially addressed these challenges by exploring the interactions between cross-modal features, such as RGB and thermal, but they often overlook redundant information present within these features. To address this limitation, we introduce a refined cross-modal fusion network for RGB-T crowd counting. The key design of our method lies in the refined cross-modal feature fusion module. This module initially processes the dual-modal information using a cross attention module, enabling effective interaction between the two modalities. Subsequently, it leverages adaptively calibrated weights to extract essential features while mitigating the impact of redundant ones. By employing this strategy, our method effectively combines the strengths of dual-path features. Building upon this fusion module, our network incorporates hierarchical layers of fused features, which are perceived as targets of interest at various scales. This hierarchical perception allows us to capture crowd information from both global and local perspectives, enabling more accurate crowd counting. Extensive experiments are conducted to demonstrate the superiority of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597 (2016)

    Google Scholar 

  2. Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 757–773. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_45

    Chapter  Google Scholar 

  3. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  4. Liang, D., Chen, X., Xu, W., Zhou, Y., Bai, X.: TransCrowd: weakly-supervised crowd counting with transformers. SCIENCE CHINA Inf. Sci. 65(6), 1–14 (2022)

    Article  Google Scholar 

  5. Lin, H., Ma, Z., Ji, R., Wang, Y., Hong, X.: Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19628–19637 (2022)

    Google Scholar 

  6. Tian, Y., Chu, X., Wang, H.: CCTrans: simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483 (2021)

  7. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  8. Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554 (2013)

    Google Scholar 

  9. Guo, D., Li, K., Zha, Z.J., Wang, M.: DADNet: dilated-attention-deformable ConvNet for crowd counting. In: Proceedings of the ACM International Conference on Multimedia, pp. 1823–1832 (2019)

    Google Scholar 

  10. Zhang, Y., Choi, S., Hong, S.: Spatio-channel attention blocks for cross-modal crowd counting. In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chellappa, R. (eds.) ACCV 2022. LNCS, vol. 13842, pp. 90–107. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-26284-5_2

    Chapter  Google Scholar 

  11. Chen, P., Gao, J., Yuan, Y., Wang, Q.: MAFNet: a multi-attention fusion network for RGB-T crowd counting. arXiv preprint arXiv:2208.06761 (2022)

  12. Liu, L., Chen, J., Wu, H., Li, G., Li, C., Lin, L.: Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4823–4833 (2021)

    Google Scholar 

  13. Pang, Y., Zhang, L., Zhao, X., Lu, H.: Hierarchical dynamic filtering network for RGB-D salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 235–252. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_15

    Chapter  Google Scholar 

  14. Fan, D.-P., Zhai, Y., Borji, A., Yang, J., Shao, L.: BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 275–292. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_17

    Chapter  Google Scholar 

  15. Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8297–8306 (2019)

    Google Scholar 

  16. Li, H., Zhang, S., Kong, W.: RGB-D crowd counting with cross-modal cycle-attention fusion and fine-coarse supervision. IEEE Trans. Industr. Inf. 19(1), 306–316 (2022)

    Article  Google Scholar 

  17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  18. Zhang, H., Zu, K., Lu, J., Zou, Y., Meng, D.: EPSANet: an efficient pyramid squeeze attention block on convolutional neural network. In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chellappa, R. (eds.) ACCV 2022. LNCS, vol. 13843, pp. 1161–1177. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-26313-2_33

    Chapter  Google Scholar 

  19. Shi, Z., Mettes, P., Snoek, C.G.: Counting with focus for free. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4200–4209 (2019)

    Google Scholar 

  20. Zhou, W., Pan, Y., Lei, J., Ye, L., Yu, L.: DEFNet: dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Trans. Intell. Transp. Syst. 23(12), 24540–24549 (2022)

    Article  Google Scholar 

  21. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6142–6151 (2019)

    Google Scholar 

  22. Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. In: Advances in Neural Information Processing Systems 33, pp. 1595–1607 (2020)

    Google Scholar 

Download references

Acknowledgment

This work is supported by the National Natural Science Foundation of China (No. 62001237), the Joint Funds of the National Natural Science Foundation of China (No. U21B2044), the Jiangsu Planned Projects for Postdoctoral Research Funds (No. 2021K052A), the China Postdoctoral Science Foundation Funded Project (No. 2021M701756), the Startup Foundation for Introducing Talent of NUIST (No. 2020r084).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengqin Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cai, J., Wang, Q., Jiang, S. (2023). CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting. In: Jia, W., et al. Biometric Recognition. CCBR 2023. Lecture Notes in Computer Science, vol 14463. Springer, Singapore. https://doi.org/10.1007/978-981-99-8565-4_40

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8565-4_40

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8564-7

  • Online ISBN: 978-981-99-8565-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics