research-article

Dilated Convolution-based Feature Refinement Network for Crowd Localization

Authors:
Xingyu Gao

Institute of Microelectronics, Chinese Academy of Sciences, China

Institute of Microelectronics, Chinese Academy of Sciences, China

0000-0002-4660-8092
View Profile

,
Jinyang Xie

School of Information Science and Engineering, Shandong Normal University, China

School of Information Science and Engineering, Shandong Normal University, China

0000-0002-6668-0710
View Profile

,
Zhenyu Chen

Big Data Center, State Grid Corporation of China, and China Electric Power Research Institute, China

Big Data Center, State Grid Corporation of China, and China Electric Power Research Institute, China

0000-0002-4989-7109
View Profile

,
An-An Liu

School of Electrical and Information Engineering, Tianjin University, China

School of Electrical and Information Engineering, Tianjin University, China

0000-0001-5755-9145
View Profile

,
Zhenan Sun

Institute of Automation, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China

Institute of Automation, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China

0000-0003-4029-9935
View Profile

,
Lei Lyu

School of Information Science and Engineering, Shandong Normal University, China

School of Information Science and Engineering, Shandong Normal University, China

0000-0001-9521-6039
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 6Article No.: 217pp 1–16https://doi.org/10.1145/3571134

Published:12 July 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

As an emerging computer vision task, crowd localization has received increasing attention due to its ability to produce more accurate spatially predictions. However, continuous scale variations in complex crowd scenes lead to tiny individuals at the edges, so that existing methods cannot achieve precise crowd localization. Aiming at alleviating the above problems, we propose a novel Dilated Convolution-based Feature Refinement Network (DFRNet) to enhance the representation learning capability. Specifically, the DFRNet is built with three branches that can capture the information of each individual in crowd scenes more precisely. More specifically, we introduce a Feature Perception Module to model long-range contextual information at different scales by adopting multiple dilated convolutions, thus providing sufficient feature information to perceive tiny individuals at the edge of images. Afterwards, a Feature Refinement Module is deployed at multiple stages of the three branches to facilitate the mutual refinement of feature information at different scales, thus further improving the expression capability of multi-scale contextual information. By incorporating the above modules, DFRNet can locate individuals in complex scenes more precisely. Extensive experiments on multiple datasets demonstrate that the proposed method has more advanced performance compared to existing methods and can be more accurately adapted to complex crowd scenes.

REFERENCES

[1] Abousamra Shahira, Hoai Minh, Samaras Dimitris, and Chen Chao. 2021. Localization in the crowd with topological constraints. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’21).Google ScholarCross Ref
[2] Andriluka Mykhaylo, Roth Stefan, and Schiele Bernt. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1014–1021.Google ScholarCross Ref
[3] Astrid Marcella, Zaheer Muhammad Zaigham, and Lee Seung-Ik. 2021. Synthetic temporal anomaly guided end-to-end video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 207–214.Google ScholarCross Ref
[4] Badrinarayanan Vijay, Kendall Alex, and Cipolla Roberto. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2481–2495.Google ScholarCross Ref
[5] Chen Geng and Guo Peirong. 2021. Enhanced information fusion network for crowd counting. arXiv:2101.04279. Retrieved from https://arxiv.org/abs/2101.04279.Google Scholar
[6] Chen Guangyi, Li Junlong, Lu Jiwen, and Zhou Jie. 2021. Human trajectory prediction via counterfactual analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9824–9833.Google ScholarCross Ref
[7] Chen Xinya, Bin Yanrui, Sang Nong, and Gao Changxin. 2019. Scale pyramid network for crowd counting. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 1941–1950.Google ScholarCross Ref
[8] Chen Xiuqi, Yu Xiao, Di Huijun, and Wang Shunzhou. 2021. SA-InterNet: Scale-aware interaction network for joint crowd counting and localization. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV’21). Springer, 203–215.Google ScholarDigital Library
[9] Cheng Jian, Xiong Haipeng, Cao Zhiguo, and Lu Hao. 2021. Decoupled two-stage crowd counting and beyond. IEEE Trans. Image Process. 30 (2021), 2862–2875.Google ScholarDigital Library
[10] Das Kalyan, Jiang Jiming, and Rao J. N. K.. 2004. Mean squared error of empirical predictor. Ann. Stat. 32, 2 (2004), 818–840.Google ScholarCross Ref
[11] Arruda Mauro dos Santos de, Osco Lucas Prado, Acosta Plabiany Rodrigo, Gonçalves Diogo Nunes, Junior José Marcato, Ramos Ana Paula Marques, Matsubara Edson Takashi, Luo Zhipeng, Li Jonathan, Silva Jonathan de Andrade, et al. 2021. Counting and locating high-density objects using convolutional neural network. arXiv:2102.04366. Retrieved from https://arxiv.org/abs/2102.04366.Google Scholar
[12] Deng Jiankang, Guo Jia, Zhou Yuxiang, Yu Jinke, Kotsia Irene, and Zafeiriou Stefanos. 2019. Retinaface: Single-stage dense face localisation in the wild. arXiv:1905.00641. Retrieved from https://arxiv.org/abs/1905.00641.Google Scholar
[13] Gao Junyu, Gong Maoguo, and Li Xuelong. 2021. Congested crowd instance localization with dilated convolutional swin transformer. arXiv:2108.00584. Retrieved from https://arxiv.org/abs/2108.00584.Google Scholar
[14] Gao Junyu, Han Tao, Wang Qi, and Yuan Yuan. 2019. Domain-adaptive crowd counting via inter-domain features segregation and gaussian-prior reconstruction. arXiv:1912.03677. Retrieved from https:/arxiv.org/abs/1912.03677.Google Scholar
[15] Gao Junyu, Han Tao, Yuan Yuan, and Wang Qi. 2020. Learning independent instance maps for crowd localization. arXiv:2012.04164. Retrieved from https://arxiv.org/abs/2012.04164.Google Scholar
[16] Gu Junru, Sun Chen, and Zhao Hang. 2021. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15303–15312.Google ScholarCross Ref
[17] Guo Dan, Li Kun, Zha Zheng-Jun, and Wang Meng. 2019. Dadnet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia. 1823–1832.Google ScholarDigital Library
[18] Hossain Mohammad, Hosseinzadeh Mehrdad, Chanda Omit, and Wang Yang. 2019. Crowd counting using scale-aware attention networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 1280–1288.Google ScholarCross Ref
[19] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google ScholarCross Ref
[20] Hu Peiyun and Ramanan Deva. 2017. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 951–959.Google ScholarCross Ref
[21] Idrees Haroon, Tayyab Muhmmad, Athrey Kishan, Zhang Dong, Al-Maadeed Somaya, Rajpoot Nasir, and Shah Mubarak. 2018. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV’18). 532–546.Google ScholarDigital Library
[22] Jiang Minyang, Lin Jianzhe, and Wang Z. Jane. 2021. A smartly simple way for joint crowd counting and localization. Neurocomputing 459 (2021), 35–43.Google ScholarDigital Library
[23] Jiang Xiaolong, Xiao Zehao, Zhang Baochang, Zhen Xiantong, Cao Xianbin, Doermann David, and Shao Ling. 2019. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6133–6142.Google ScholarCross Ref
[24] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.Google ScholarDigital Library
[25] Li Xuelong, Chen Mulin, Nie Feiping, and Wang Qi. 2017. A multiview-based parameter free framework for group detection. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
[26] Li Xuelong, Chen Mulin, and Wang Qi. 2020. Quantifying and detecting collective motion in crowd scenes. IEEE Trans. Image Process. 29 (2020), 5571–5583.Google ScholarCross Ref
[27] Li Zhihang, Tang Xu, Han Junyu, Liu Jingtuo, and He Ran. 2019. Pyramidbox++: High performance detector for finding tiny face. arXiv:1904.00386. Retrieved from https://arxiv.org/abs/1904.00386.Google Scholar
[28] Lian Dongze, Li Jing, Zheng Jia, Luo Weixin, and Gao Shenghua. 2019. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1821–1830.Google ScholarCross Ref
[29] Liang Dingkang, Xu Wei, Zhu Yingying, and Zhou Yu. 2021. Focal inverse distance transform maps for crowd localization and counting in dense crowd. arXiv:2102.07925. Retrieved from https://arxiv.org/abs/2102.07925.Google Scholar
[30] Liang Dingkang, Xu Wei, Zhu Yingying, and Zhou Yu. 2021. Reciprocal distance transform maps for crowd counting and people localization in dense crowd (unpublished).Google Scholar
[31] Liu Chenchen, Weng Xinyu, and Mu Yadong. 2019. Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1217–1226.Google ScholarCross Ref
[32] Liu Jiang, Gao Chenqiang, Meng Deyu, and Hauptmann Alexander G.. 2018. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5197–5206.Google ScholarCross Ref
[33] Liu Lingbo, Chen Jiaqi, Wu Hefeng, Li Guanbin, Li Chenglong, and Lin Liang. 2021. Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4823–4833.Google ScholarCross Ref
[34] Liu Lingbo, Qiu Zhilin, Li Guanbin, Liu Shufan, Ouyang Wanli, and Lin Liang. 2019. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1774–1783.Google ScholarCross Ref
[35] Pang Youwei, Zhao Xiaoqi, Zhang Lihe, and Lu Huchuan. 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9413–9422.Google ScholarCross Ref
[36] Redmon Joseph, Divvala Santosh, Girshick Ross, and Farhadi Ali. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google ScholarCross Ref
[37] Sam Deepak Babu, Peri Skand Vishwanath, Mukuntha N. S., and Babu R. Venkatesh. 2020. Going beyond the regression paradigm with accurate dot prediction for dense crowds. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). IEEE, 2853–2861.Google ScholarCross Ref
[38] Sam Deepak Babu, Peri Skand Vishwanath, Sundararaman Mukuntha Narayanan, Kamath Amogh, and Radhakrishnan Venkatesh Babu. 2020. Locate, size and count: Accurately resolving people in dense crowds via detection. IEEE Trans. Pattern Anal. Mach. Intell. (2020).Google Scholar
[39] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
[40] Song Qingyu, Wang Changan, Jiang Zhengkai, Wang Yabiao, Tai Ying, Wang Chengjie, Li Jilin, Huang Feiyue, and Wu Yang. 2021. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3365–3374.Google ScholarCross Ref
[41] Stewart Russell, Andriluka Mykhaylo, and Ng Andrew Y.. 2016. End-to-end people detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2325–2333.Google ScholarCross Ref
[42] Oosterhout Tim Van, Bakkes Sander, Kröse Ben J. A., et al. 2011. Head detection in stereo data for people counting and segmentation. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP’11). 620–625.Google Scholar
[43] Wan Jia and Chan Antoni. 2020. Modeling noisy annotations for crowd counting. Adv. Neural Inf. Process. Syst. 33 (2020).Google Scholar
[44] Wan Jia, Liu Ziquan, and Chan Antoni B.. 2021. A generalized loss function for crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1974–1983.Google ScholarCross Ref
[45] Wang Hengli, Fan Rui, Sun Yuxiang, and Liu Ming. 2021. Dynamic fusion module evolves drivable area and road anomaly detection: A benchmark and algorithms. IEEE Trans. Cybernet. (2021).Google Scholar
[46] Wang Qi, Gao Junyu, Lin Wei, and Li Xuelong. 2020. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 43, 6 (2020), 2141–2149.Google ScholarCross Ref
[47] Wang Yi, Hou Junhui, Hou Xinyu, and Chau Lap-Pui. 2021. A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans. Image Process. 30 (2021), 2876–2887.Google ScholarDigital Library
[48] Wang Yi, Hou Xinyu, and Chau Lap-Pui. 2021. Dense point prediction: A simple baseline for crowd counting and localization. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW’21). IEEE, 1–6.Google ScholarCross Ref
[49] Wang Zhou and Bovik Alan C.. 2009. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Sign. Process. Mag. 26, 1 (2009), 98–117.Google ScholarCross Ref
[50] Willmott Cort J. and Matsuura Kenji. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30, 1 (2005), 79–82.Google ScholarCross Ref
[51] Xu Chenfeng, Liang Dingkang, Xu Yongchao, Bai Song, Zhan Wei, Bai Xiang, and Tomizuka Masayoshi. 2019. Autoscale: Learning to scale for crowd counting (unpublished).Google Scholar
[52] Yu Xuehui, Gong Yuqi, Jiang Nan, Ye Qixiang, and Han Zhenjun. 2020. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1257–1265.Google ScholarCross Ref
[53] Zhang Yingying, Zhou Desen, Chen Siqin, Gao Shenghua, and Ma Yi. 2016. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 589–597.Google ScholarCross Ref

Index Terms

Dilated Convolution-based Feature Refinement Network for Crowd Localization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Interest point and salient region detections

Recommendations

Congested crowd instance localization with dilated convolutional swin transformer
Abstract
Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which ...
Read More
SC2Net: Scale-aware Crowd Counting Network with Pyramid Dilated Convolution
Abstract
Accurate crowd counting is still challenging due to the variations of crowd heads. Most of crowd counting methods adopt multi-branch networks to extract multi-scale information. However, these networks are too complex to be optimized. To solve ...
Read More
Multi-scale dilated convolution of feature Fusion Network for Crowd counting
Abstract
Crowd counting has long been a challenging task due to the perspective distortion and variability in head size. The previous methods ignore the multi-scale information in images or simply use convolutions with different kernel sizes to extract ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 6
November 2023
858 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3599695
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2023
- Online AM: 20 December 2022
- Accepted: 1 September 2022
- Revised: 26 July 2022
- Received: 13 April 2022
Published in tomm Volume 19, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dilated convolution
Feature Refinement
crowd localization
contextual information
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 393
  Total Downloads
- Downloads (Last 12 months)270
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Dilated Convolution-based Feature Refinement Network for Crowd Localization

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Congested crowd instance localization with dilated convolutional swin transformer

SC2Net: Scale-aware Crowd Counting Network with Pyramid Dilated Convolution

Multi-scale dilated convolution of feature Fusion Network for Crowd counting