Abstract
This paper presents a novel method for accurate people counting in highly dense crowd images. The proposed method consists of three modules: extracting foreground regions (EF), pixel-wise attention mechanism (PAM) and single-column density map estimator (S-DME). EF can suppress the disturbance of complex background efficiently with a fully convolutional network, PAM performs pixel-wise classification of crowd images to generate high-quality local crowd density maps, and S-DME is a carefully designed single-column network that can learn more representative features with much fewer parameters. In addition, two new evaluation metrics are introduced to get a comprehensive understanding of the performance of different modules in our algorithm. Experiments demonstrate that our approach can get the state-of-the-art results on several challenging datasets including our dataset with highly cluttered environments and various camera perspectives.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhou B, Tang X, Wang X (2015) Learning collective crowd behaviors with dynamic pedestrian-agents. Int J Comput Vis 111(1):50–68
Huang L, Chen T, Wang Y, Yuan H (2015) Congestion detection of pedestrians using the velocity entropy: a case study of love parade 2010 disaster. Phys A Stat Mech Appl 440:200–209
Li W, Mahadevan V, Vasconcelos N (2014) Anomaly detection and localization in crowded scenes. IEEE Trans Pattern Anal Mach Intell 36(1):18–32
Chaker R, Al Aghbari Z, Junejo IN (2017) Social network model for crowd anomaly detection and localization. Pattern Recognit 61:266–281
Benabbas Y, Ihaddadene N, Djeraba C (2011) Motion pattern extraction and event detection for automatic visual surveillance. EURASIP J Image Video Process 2011(1):1–15
Onoro-Rubio D, L’opez-Sastre RJ (2016) Towards perspective-free object counting with deep learning. In: ECCV
French G, Fisher M, Mackiewicz M, Needle C (2015) Convolutional neural networks for counting fish in fisheries surveillance video. In: British machine vision conference workshop, BMVA Press
Chen K, Loy CC, Gong S, Xiang T (2012) Feature mining for localized crowd counting. In: European conference on computer vision
Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Computer vision and pattern recognition (CVPR)
Zhang Y, Zhou D, Chen S, Gao S, Ma Y (2016) Single-image crowd counting via multi-column convolutional neural network. In: Computer vision and pattern recognition (CVPR)
Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained partbased models. In: PAMI
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Computer vision and pattern recognition (CVPR)
Sindagi VA, Patel VM (2017) Generating high-quality crowd density maps using contextual pyramid CNNs. In: ICCV
Onoro-Rubio D, Lopez-Sastre RJ (2016) Towards perspective-free object counting with deep learning. In: ECCV
Sam DB, Surya S, Babu RV (2017) Switching convolutional neural network for crowd counting. In: CVPR
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842v1
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
Girshick R (2015) Fast R-CNN. In: ICCV
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. In: PAMI
Zhang H, Ji Y, Huang W, Liu L (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 2018:1–20
Zhang H, Cao X, Ho JKL, Chow TWS (2018) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531
Mostajabi M, Yadollahpour P, Shakhnarovich G (2014) Feedforward semantic segmentation with zoom-out features. Arxiv preprint arxiv:1412.0774
Dai J, He K, Sun J (2015) Convolutional feature masking for joint object and stuff segmentation. In: CVPR
Hariharan B, Arbelaez P, Girshick R, Malik J (2014) Simultaneous detection and segmentation. In: ECCV
Hariharan B, Arbelaez P, Girshick R, Malik J (2015) Hyper-columns for object segmentation and fine-grained localization. In: CVPR
Jiang F, Grigorev A, Rho S, Tian Z, Fu Y, Jifara W, Adil K, Liu S (2017) Medical image semantic segmentation based on deep learning. Neural Comput Appl 2017(8):1–7
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Computer vision and pattern recognition (CVPR)
Chen L-C, Papandreou G, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR
Dalal N, Triggs B (2015) Histograms of oriented gradients for human detection. In: CVPR
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Wu B, Nevatia R (2005) Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors. In: ICCV
Li M, Zhang Z, Huang K, Tan T (2008) Estimating the number of people in crowd scenes by mid based foreground segmentation and head-shoulder detection. In: Pattern recognition
Huang S, Xi Li, Zhang Z, Wu F, Gao S, Ji R, Han J (2017) Body structure aware deep crowd counting. IEEE Trans Image Process 27(3):1049–1059
Ryan D, Denman S, Fookes C, Sridharan S (2009) Crowd counting using multiple local features. Digit Image Comput Tech Appl 63(6):81–88
Wang C, Zhang H, Yang L, Liu S, Cao (2015) Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM New York, pp 1299–1302
Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: CVPR
Kong T, Yao A, Chen Y, Sun F (2016) HyperNet: towards accurate region proposal generation and joint object detection. In: CVPR
Leng J, Liu Y (2018) An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput Appl 2018(2):1–10
Wang C, Zhang H, Yang L, Liu S, Cao X (2015) Deep people counting in extremely dense crowds. In: ACM International Conference on Multimedia
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093
Sindagi VA, Patel VM (2017) CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: AVSS
Liu J, Gao C, Meng D, Hauptmann AG (2018) DecideNet: counting varying density crowds through attention guided detection and density estimation. In: CVPR
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Rights and permissions
About this article
Cite this article
Wang, B., Cao, G., Shang, Y. et al. Single-column CNN for crowd counting with pixel-wise attention mechanism. Neural Comput & Applic 32, 2897–2908 (2020). https://doi.org/10.1007/s00521-018-3810-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3810-9