CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting

Cai, Jialu; Wang, Qing; Jiang, Shengqin

doi:10.1007/978-981-99-8565-4_40

Jialu Cai¹⁵,
Qing Wang¹⁵ &
Shengqin Jiang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14463))

Included in the following conference series:

Chinese Conference on Biometric Recognition

395 Accesses

Abstract

Crowd counting is a crucial task in computer vision, offering numerous applications in smart security, remote sensing, agriculture and forestry. While pure image-based models have made significant advancements, they tend to perform poorly under low-light and dark conditions. Recent work has partially addressed these challenges by exploring the interactions between cross-modal features, such as RGB and thermal, but they often overlook redundant information present within these features. To address this limitation, we introduce a refined cross-modal fusion network for RGB-T crowd counting. The key design of our method lies in the refined cross-modal feature fusion module. This module initially processes the dual-modal information using a cross attention module, enabling effective interaction between the two modalities. Subsequently, it leverages adaptively calibrated weights to extract essential features while mitigating the impact of redundant ones. By employing this strategy, our method effectively combines the strengths of dual-path features. Building upon this fusion module, our network incorporates hierarchical layers of fused features, which are perceived as targets of interest at various scales. This hierarchical perception allows us to capture crowd information from both global and local perspectives, enabling more accurate crowd counting. Extensive experiments are conducted to demonstrate the superiority of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597 (2016)
Google Scholar
Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 757–773. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_45
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Liang, D., Chen, X., Xu, W., Zhou, Y., Bai, X.: TransCrowd: weakly-supervised crowd counting with transformers. SCIENCE CHINA Inf. Sci. 65(6), 1–14 (2022)
Article Google Scholar
Lin, H., Ma, Z., Ji, R., Wang, Y., Hong, X.: Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19628–19637 (2022)
Google Scholar
Tian, Y., Chu, X., Wang, H.: CCTrans: simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483 (2021)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554 (2013)
Google Scholar
Guo, D., Li, K., Zha, Z.J., Wang, M.: DADNet: dilated-attention-deformable ConvNet for crowd counting. In: Proceedings of the ACM International Conference on Multimedia, pp. 1823–1832 (2019)
Google Scholar
Zhang, Y., Choi, S., Hong, S.: Spatio-channel attention blocks for cross-modal crowd counting. In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chellappa, R. (eds.) ACCV 2022. LNCS, vol. 13842, pp. 90–107. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-26284-5_2
Chapter Google Scholar
Chen, P., Gao, J., Yuan, Y., Wang, Q.: MAFNet: a multi-attention fusion network for RGB-T crowd counting. arXiv preprint arXiv:2208.06761 (2022)
Liu, L., Chen, J., Wu, H., Li, G., Li, C., Lin, L.: Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4823–4833 (2021)
Google Scholar
Pang, Y., Zhang, L., Zhao, X., Lu, H.: Hierarchical dynamic filtering network for RGB-D salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 235–252. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_15
Chapter Google Scholar
Fan, D.-P., Zhai, Y., Borji, A., Yang, J., Shao, L.: BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 275–292. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_17
Chapter Google Scholar
Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8297–8306 (2019)
Google Scholar
Li, H., Zhang, S., Kong, W.: RGB-D crowd counting with cross-modal cycle-attention fusion and fine-coarse supervision. IEEE Trans. Industr. Inf. 19(1), 306–316 (2022)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Zhang, H., Zu, K., Lu, J., Zou, Y., Meng, D.: EPSANet: an efficient pyramid squeeze attention block on convolutional neural network. In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chellappa, R. (eds.) ACCV 2022. LNCS, vol. 13843, pp. 1161–1177. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-26313-2_33
Chapter Google Scholar
Shi, Z., Mettes, P., Snoek, C.G.: Counting with focus for free. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4200–4209 (2019)
Google Scholar
Zhou, W., Pan, Y., Lei, J., Ye, L., Yu, L.: DEFNet: dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Trans. Intell. Transp. Syst. 23(12), 24540–24549 (2022)
Article Google Scholar
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6142–6151 (2019)
Google Scholar
Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. In: Advances in Neural Information Processing Systems 33, pp. 1595–1607 (2020)
Google Scholar

Download references

Acknowledgment

This work is supported by the National Natural Science Foundation of China (No. 62001237), the Joint Funds of the National Natural Science Foundation of China (No. U21B2044), the Jiangsu Planned Projects for Postdoctoral Research Funds (No. 2021K052A), the China Postdoctoral Science Foundation Funded Project (No. 2021M701756), the Startup Foundation for Introducing Talent of NUIST (No. 2020r084).

Author information

Authors and Affiliations

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Jialu Cai, Qing Wang & Shengqin Jiang

Authors

Jialu Cai
View author publications
You can also search for this author in PubMed Google Scholar
Qing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shengqin Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengqin Jiang .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Wei Jia
South China University of Technology, Guangzhou, China
Wenxiong Kang
China University of Mining and Technology, Xuzhou, China
Zaiyu Pan
Shandong University, Jinan, China
Xianye Ben
China University of Mining and Technology, Xuzhou, China
Zhengfu Bian
Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Chinese Academy of Sciences, Beijing, China
Zhaofeng He
China University of Mining and Technology, Xuzhou, China
Jun Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, J., Wang, Q., Jiang, S. (2023). CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting. In: Jia, W., et al. Biometric Recognition. CCBR 2023. Lecture Notes in Computer Science, vol 14463. Springer, Singapore. https://doi.org/10.1007/978-981-99-8565-4_40

Download citation

DOI: https://doi.org/10.1007/978-981-99-8565-4_40
Published: 02 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8564-7
Online ISBN: 978-981-99-8565-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting