Abstract
The crowded scenario not only contains instances at various scales but also introduces a variety of occlusion patterns ranging from non-occluded situations to heavily occluded cases, making the shapes of the instances different. All of those can result in human detectors being hard to apply to them. Feature pyramid networks (FPN), as an indispensable part of generic object detectors, can significantly boost detection performance involving objects at different scales. As a result, in this paper, we equip FPN with a multi-scale feature fusion technology and attention mechanisms to improve the performance of human detection in crowded scenarios. Firstly, we designed a feature pyramid structure with a refined hierarchical-split block, referred to as Scale-FPN, which can better handle the challenging problem of scale variation across object instances. Secondly, an attention-based lateral connection (ALC) module with spatial and channel attention mechanisms was proposed to replace the lateral connection in the FPN, which enhances the representational ability of feature maps through rich spatial and semantic information and lets detectors be capable of focusing on important features of occlusion patterns. Additionally, a bottom-up path augmentation (BPA) module was adopted to exploit the features of the Scale-FPN and ALC modules. To verify the effectiveness of the proposed method, we combined Scale-FPN, ALC and BPA, namely SA-FPN, and integrated it into the architecture of a crowded human detector. Experiments on the challenging CrowdHuman benchmark sufficiently validate the effectiveness of SA-FPN. Specifically, it improves the state-of-the-art result of CrowdDet from 41.4% to 39.9% \(MR^{-2}\), which indicates that the detector with SA-FPN brings in fewer false positives.
Similar content being viewed by others
References
Nguyen DT, Li W, Ogunbona PO (2016) Human detection from images and videos: A survey. Pattern Recognition 51:148–175
Brunetti A, Buongiorno D, Trotta GF, Bevilacqua V (2018) Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing 300:17–33
Zhang S, Benenson R, Omran M, Hosang J, Schiele B (2016) How far are we from solving pedestrian detection? In: Proceedings of the iEEE conference on computer vision and pattern recognition, pp 1259–1267
Leal-Taixé L, Milan A, Reid I, Roth S, Schindler K (2015) Motchallenge 2015: Towards a benchmark for multi-target tracking. 150401942
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision, Springer, pp 483–499
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481
Xiao T, Li S, Wang B, Lin L, Wang X (2017) Joint detection and identification feature learning for person search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3415–3424
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. nature 521(7553):436–444
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115(3):211–252
Wang X, Xiao T, Jiang Y, Shao S, Sun J, Shen C (2018) Repulsion loss: Detecting pedestrians in a crowd. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7774–7783
Wu X, Sahoo D, Hoi SC (2020) Recent advances in deep learning for object detection. Neurocomputing 396:39–64
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88(2):303–338
Zhang K, Xiong F, Sun P, Hu L, Li B, Yu G (2019) Double anchor r-cnn for human detection in a crowd. 190909998
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. 150601497
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: A benchmark for detecting human in a crowd. 180500123
Chi C, Zhang S, Xing J, Lei Z, Li SZ, Zou X (2019) Relational learning for joint head and human detection. 190910674
Mathias M, Benenson R, Timofte R, Van Gool L (2013) Handling occlusions with franken-classifiers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1505–1512
Tian Y, Luo P, Wang X, Tang X (2015) Deep learning strong parts for pedestrian detection. In: Proceedings of the IEEE international conference on computer vision, pp 1904–1912
Zhou C, Yuan J (2016) Learning to integrate occlusion-specific detectors for heavily occluded pedestrian detection. In: Asian Conference on Computer Vision, Springer, pp 305–320
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision, pp 5561–5569
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 637–653
Jiang B, Luo R, Mao J, Xiao T, Jiang Y (2018) Acquisition of localization confidence for accurate object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 784–799
He Y, Zhu C, Wang J, Savvides M, Zhang X (2019) Bounding box regression with uncertainty for accurate object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2888–2897
Liu S, Huang D, Wang Y (2019) Adaptive nms: Refining pedestrian detection in a crowd. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6459–6468
Rukhovich D, Sofiiuk K, Galeev D, Barinova O, Konushin A (2020) Iterdet: Iterative scheme for objectdetection in crowded environments. 200505708
Ge Z, Jie Z, Huang X, Xu R, Yoshie O (2020) Ps-rcnn: Detecting secondary human instances in a crowd via primary object suppression. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6
Zhu J, Yuan Z, Zhang C, Chi W, Ling Y, et al (2020) Crowded human detection via an anchor-pair network. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1391–1399
Yuan P, Lin S, Cui C, Du Y, Guo R, He D, Ding E, Han S (2020) Hs-resnet: Hierarchical-split block on convolutional neural network. 201007621
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768
Dollár P, Wojek C, Schiele B, Perona P (2009) Pedestrian detection: A benchmark. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 304–311
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37(9):1904–1916
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. 190407850
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 840–849
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790
Liu S, Huang D, Wang Y (2019) Learning spatial fusion for single-shot object detection. 191109516
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. 180402767
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2018) M2det: A single-shot object detector based on multi-level feature pyramid network. arXiv e-prints pp arXiv–1811
Neubeck A, Van Gool L (2006) Efficient non-maximum suppression. In: 18th International Conference on Pattern Recognition (ICPR’06), IEEE, vol 3, pp 850–855
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Misra D, Nalamada T, Arasanipalai AU, Hou Q (2021) Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3139–3148
Pang Y, Xie J, Khan MH, Anwer RM, Khan FS, Shao L (2019) Mask-guided attention network for occluded pedestrian detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4967–4975
Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6995–7003
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Robertson S (2008) A new interpretation of average precision. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 689–690
Chu X, Zheng A, Zhang X, Sun J (2020) Detection in crowded scenes: One proposal, multiple predictions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12214–12223
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Huang X, Ge Z, Jie Z, Yoshie O (2020) Nms by representative region: Towards crowded pedestrian detection by proposal pairing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10750–10759
Zhou P, Zhou C, Peng P, Du J, Sun X, Guo X, Huang F (2020) Noh-nms: Improving pedestrian detection by nearby objects hallucination. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1967–1975
Xu Z, Li B, Yuan Y, Dang A (2020) Beta r-cnn: Looking into pedestrian detection from another perspective. Advances in Neural Information Processing Systems
Acknowledgements
This work was supported by “Thirteenth Five-Year Plan” Science and Technology Project of Jilin Provincial Education Department (No. JJKH20190710KJ) and Jilin Science and Technology Innovation Development Program Projects (No. 20190302202).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Zhou, X., Zhang, L. SA-FPN: An effective feature pyramid network for crowded human detection. Appl Intell 52, 12556–12568 (2022). https://doi.org/10.1007/s10489-021-03121-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03121-8