skip to main content
10.1145/3652583.3658108acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Semantic-guided RGB-Thermal Crowd Counting with Segment Anything Model

Published: 07 June 2024 Publication History

Abstract

RGB-Thermal (RGB-T) crowd counting leverages the complementary nature of visible light and thermal modalities for accurate counting. However, real-world scenarios often introduce challenges, such as misidentifying background elements like trees and lampposts as individuals, leading to inaccurate counts. Existing methods utilize segmentation as a preliminary procedure, which is constrained by segmentation accuracy. In this paper, we propose a novel method, utilizing the Segment Anything Model (SAM), to distinguish between the foreground and background of images. Specifically, we begin by utilizing SAM to obtain the semantic map of the original image. Subsequently, we extract the modality features and semantic features corresponding to the RGB and thermal modalities through multimodal feature extraction. These features are then fused using the Semantic-guide Feature Fusion module. Finally, the Multi-level Decoder is employed to generate the density map and the ultimate counting results. Our approach achieves state-of-the-art performance on the RGBT-CC dataset.

References

[1]
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. 2023. Tracking anything with decoupled video segmentation. In IEEE/CVF International Conference on Computer Vision.
[2]
Feng Dai, Hao Liu, Yike Ma, Xi Zhang, and Qiang Zhao. 2021. Dense scale network for crowd counting. In International Conference on Multimedia Retrieval.
[3]
Zhipeng Du, Miaojing Shi, Jiankang Deng, and Stefanos Zafeiriou. 2023. Redesigning multi-scale neural network for crowd counting. IEEE Transactions on Image Processing, Vol. 32 (2023), 3664--3678.
[4]
Ricardo Guerrero-Gómez-Olmedo, Beatriz Torre-Jiménez, Roberto López-Sastre, Saturnino Maldonado-Bascón, and Daniel Onoro-Rubio. 2015. Extremely overlapping vehicle counting. In Pattern Recognition and Image Analysis.
[5]
Ruichao Hou, Tongwei Ren, and Gangshan Wu. 2022. Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement. In IEEE International Conference on Multimedia and Expo.
[6]
Ruichao Hou, Boyue Xu, Tongwei Ren, and Gangshan Wu. 2023. MTNet: learning modality-aware representation with transformer for RGBT tracking. In IEEE International Conference on Multimedia and Expo.
[7]
Shengqin Jiang, Xiaobo Lu, Yinjie Lei, and Lingqiao Liu. 2019. Mask-aware networks for crowd counting. IEEE Transactions on Circuits and Systems for Video Technology (2019), 3119--3129.
[8]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
[9]
He Li, Junge Zhang, Weihang Kong, Jienan Shen, and Yuguang Shao. 2023. CSA-Net: Cross-modal scale-aware attention-aggregated network for RGB-T crowd counting. Expert Systems with Applications, Vol. 213 (2023), 119038.
[10]
Mingjian Liang, Junjie Hu, Chenyu Bao, Hua Feng, Fuqin Deng, and Tin Lun Lam. 2023. Explicit attention-enhanced fusion for RGB-Thermal perception tasks. IEEE Robotics and Automation Letters, Vol. 8, 7 (2023), 4060--4067.
[11]
Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, and Liang Lin. 2021. Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In IEEE/CVF conference on Computer Vision and Pattern Recognition.
[12]
Lei Liu, Jie Jiang, Wenjing Jia, Saeed Amirgholipour, Michelle Zeibots, and Xiangjian He. 2019. DENet: a universal network for counting crowd with varying densities and scales. arXiv preprint arXiv:1904.08056 (2019).
[13]
Songhua Liu, Jingwen Ye, and Xinchao Wang. 2023 b. Any-to-any style transfer: making Picasso and Da Vinci collaborate. arXiv preprint arXiv:2304.09728 (2023).
[14]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023 c. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
[15]
Yanbo Liu, Guo Cao, Boshan Shi, and Yingxiang Hu. 2023 a. CCANet: A collaborative cross-modal attention network for RGB-D crowd counting. IEEE Transactions on Multimedia, Vol. 26 (2023), 154--165.
[16]
Zhengyi Liu, Wei Wu, Yacheng Tan, and Guanghui Zhang. 2022. RGB-T multi-modal crowd counting based on transformer. In British Machine Vision Conference.
[17]
Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. 2024. Segment anything in medical images. Nature Communications, Vol. 15, 1 (2024), 654.
[18]
Zhiheng Ma, Xiaopeng Hong, and Qinnan Shangguan. 2023. Can sam count anything? an empirical study on sam counting. arXiv preprint arXiv:2304.10817 (2023).
[19]
Davide Modolo, Bing Shuai, Rahul Rama Varior, and Joseph Tighe. 2021. Understanding the impact of mistakes on background regions in crowd counting. In IEEE/CVF Winter Conference on Applications of Computer Vision.
[20]
Lucas Prado Osco, Qiusheng Wu, Eduardo Lopes de Lemos, Wesley Nunes Goncc alves, Ana Paula Marques Ramos, Jonathan Li, and José Marcato Junior. 2023. The segment anything model (sam) for remote sensing applications: From zero to one shot. International Journal of Applied Earth Observation and Geoinformation, Vol. 124 (2023), 103540.
[21]
Yi Pan, Wujie Zhou, Xiaohong Qian, Shanshan Mao, Rongwang Yang, and Lu Yu. 2023. CGINet: Cross-modality grade interaction network for RGB-T crowd counting. Engineering Applications of Artificial Intelligence, 106885.
[22]
Tao Peng, Qing Li, and Pengfei Zhu. 2020. Rgb-t crowd counting from drone: A benchmark and mmccn network. In Asian conference on computer vision.
[23]
Simiao Ren, Francesco Luzi, Saad Lahrichi, Kaleb Kassaw, Leslie M. Collins, Kyle Bradbury, and Jordan M. Malof. 2024 b. Segment anything, From Space?. In IEEE/CVF Winter Conference on Applications of Computer Vision.
[24]
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. 2024 a. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. arXiv preprint arXiv:2401.14159 (2024).
[25]
Haihan Tang, Yi Wang, and Lap-Pui Chau. 2022. TAFNet: A three-stream adaptive fusion network for RGB-T crowd counting. In IEEE International Symposium on Circuits and Systems.
[26]
Boyu Wang, Huidong Liu, Dimitris Samaras, and Minh Hoai Nguyen. 2020. Distribution matching for crowd counting. Advances in neural information processing systems, Vol. 33 (2020), 1595--1607.
[27]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In IEEE/CVF International Conference on Computer Vision.
[28]
Zhengtao Wu, Lingbo Liu, Yang Zhang, Mingzhi Mao, Liang Lin, and Guanbin Li. 2022. Multimodal crowd counting with mutual attention transformers. In IEEE International Conference on Multimedia and Expo.
[29]
Boyue Xu, Ruichao Hou, Jia Bei, Tongwei Ren, and Gangshan Wu. 2024. Jointly modeling association and motion cues for robust infrared UAV tracking. The Visual Computer (2024), 1432--2315.
[30]
Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. 2023. Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023).
[31]
Chunhui Zhang, Li Liu, Yawen Cui, Guanjie Huang, Weilin Lin, Yiqian Yang, and Yuehong Hu. 2023. A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196 (2023).
[32]
Youjia Zhang, Soyun Choi, and Sungeun Hong. 2022. Spatio-channel attention blocks for cross-modal crowd counting. In Asian Conference on Computer Vision.
[33]
Muming Zhao, Jian Zhang, Chongyang Zhang, and Wenjun Zhang. 2019. Leveraging heterogeneous auxiliary tasks to assist crowd counting. In IEEE/CVF conference on Computer Vision and Pattern Recognition.
[34]
Liangfeng Zheng, Yongzhi Li, and Yadong Mu. 2021. Learning factorized cross-view fusion for multi-view crowd counting. In IEEE International Conference on Multimedia and Expo.
[35]
Wujie Zhou, Yi Pan, Jingsheng Lei, Lv Ye, and Lu Yu. 2022. DEFNet: Dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 12 (2022), 24540--24549.
[36]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).

Index Terms

  1. Semantic-guided RGB-Thermal Crowd Counting with Segment Anything Model

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval
    May 2024
    1379 pages
    ISBN:9798400706196
    DOI:10.1145/3652583
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. rgb-t crowd counting
    2. segment anything model
    3. self attention
    4. transformer

    Qualifiers

    • Research-article

    Conference

    ICMR '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 139
      Total Downloads
    • Downloads (Last 12 months)139
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media