skip to main content
10.1145/3508546.3508548acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacaiConference Proceedingsconference-collections
research-article

Multi-level Convolutional Transformer with Adaptive Ranking for Semi-supervised Crowd Counting

Published:25 February 2022Publication History

ABSTRACT

Most existing crowd counting methods have focused on pure convolutional neural network based supervised algorithms. Although these methods have attained good results on some datasets, they still encounter several common problems. The cost of labeling annotations for supervised methods is huge and the shortage of labeled datasets limits the further development of supervised algorithms for crowd counting. Meanwhile, pure CNN-based algorithms have certain limitations in building the connections among these features. To overcome those problems, we proposed a semi-supervised crowd counting algorithm that is a mixture model of CNN and transformer. Specifically, our method consists of two parts Multi-Level Convolutional Transformer (MLCT) and Adaptive Scale Module (ASM). MLCT is the counting branch, with its front end and back end being the CNN and the transformer, respectively. ASM outputs an adaptive scale factor for the unlabeled crowd images. We generate a ranking list based on this factor, which is fed into the MLCT and computes loss by the order of the list. Different from most crowd counting methods, we use a region-level regression target for labeled images, which is a weaker regression approach than the location regression. Furthermore, We train the entire model using a novel loss function that combines L1 loss and ranking loss. Experimental results on the three challenging datasets ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF have all demonstrated the effectiveness of the proposed approach.

References

  1. Yinjie Lei, Yan Liu, Pingping Zhang, and Lingqiao Liu. Towards using count-level weak supervision for crowd counting. Pattern Recognition, 109:107616, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yifan Yang, Guorong Li, Zhe Wu, Li Su, Qingming Huang, and Nicu Sebe. Weakly-supervised crowd counting learns from sorting rather than locations. In The European Conference on Computer Vision. Springer, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. InGoogle ScholarGoogle Scholar
  4. Proceedings of the 24th ACM international conference on Multimedia, pages 640–644, 2016.Google ScholarGoogle Scholar
  5. Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  6. Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu. Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4031–4039. IEEE, 2017.Google ScholarGoogle Scholar
  7. Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  8. Vishwanath A Sindagi and Vishal M Patel. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2017.Google ScholarGoogle Scholar
  9. Mohammad Hossain, Mehrdad Hosseinzadeh, Omit Chanda, and Yang Wang. Crowd counting using scale-aware attention networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1280–1288. IEEE, 2019.Google ScholarGoogle Scholar
  10. Xiaoheng Jiang, Li Zhang, Mingliang Xu, Tianzhu Zhang, Pei Lv, Bing Zhou, Xin Yang, and Yanwei Pang. Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4706–4715, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  11. Matthias von Borstel, Melih Kandemir, Philip Schmidt, Madhavi K Rao, Kumar Rajamani, and Fred A Hamprecht. Gaussian process density counting from weak supervision. In European Conference on Computer Vision, pages 365–380. Springer, 2016.Google ScholarGoogle Scholar
  12. Xiyang Liu, Jie Yang, and Wenrui Ding. Adaptive mixture regression network with local counting map for crowd counting. arXiv preprint arXiv:2005.05776, 2020.Google ScholarGoogle Scholar
  13. Greg Olmschenk, Jin Chen, Hao Tang, and Zhigang Zhu. Dense crowd counting convolutional neural networks with minimal data using semi- supervised dual-goal generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition: Learning with Imperfect Data Workshop, 2019.Google ScholarGoogle Scholar
  14. Yinjie Lei, Yan Liu, Pingping Zhang, and Lingqiao Liu. Towards using count-level weak supervision for crowd counting. Pattern Recognition, 109:107616, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xialei Liu, Joost Van De Weijer, and Andrew D Bagdanov. Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7661–7669, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  16. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.Google ScholarGoogle Scholar
  17. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.Google ScholarGoogle Scholar
  18. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.Google ScholarGoogle Scholar
  19. Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020.Google ScholarGoogle Scholar
  20. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  21. Vishwanath A Sindagi and Vishal M Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE international conference on computer vision, pages 1861–1870, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  22. Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xinya Chen, Yanrui Bin, Nong Sang, and Changxin Gao. Scale pyramid network for crowd counting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1941–1950. IEEE, 2019.Google ScholarGoogle Scholar
  24. Antti Tarvainen and Harri Valpola. Weight-averaged, consistency targets improve semi-supervised deep learning results. CoRR„ vol. abs/1703, 2017, 1780.Google ScholarGoogle Scholar
  25. Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825, 2019.Google ScholarGoogle Scholar
  26. Yan Liu, Lingqiao Liu, Peng Wang, Pingping Zhang, and Yinjie Lei. Semi-supervised crowd counting via self-training on surrogate tasks. In European Conference on Computer Vision, pages 242–259. Springer, 2020.Google ScholarGoogle Scholar
  27. Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–546, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence
    December 2021
    699 pages
    ISBN:9781450385053
    DOI:10.1145/3508546

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 25 February 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate173of395submissions,44%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format