Abstract
With the development of high-resolution camera technology, the shooting scene coverage has reached the square kilometer level, thousands of people can be observed at the same time, and the faces of people from a hundred meters away are clearly recognizable. The images captured by high-resolution cameras are very different from those captured by conventional cameras. In the face of many detection targets in high-resolution images, large differences in target scales due to spatial position, as well as difficulties in extracting features and poor detection results caused by target overlap and concealment phenomena, this paper proposes a multi-target detection method SARNet that combined with spatial attention optimization feature extraction. Use spatial attention to optimize the backbone network, expand the local receptive field, thereby enhance the representation ability, and enhance the feature extraction ability of small targets; the different scale features of the dilated feature pyramid network are subjected to the deformable region of interest pooling operation, which effectively improves the different scales detection accuracy. The experimental results show that the method proposed in this paper can get 51.9% mAP on the PANDA dataset, which is superior to the existing detection algorithms. At the same time, experimental verification of pedestrians and vehicles on the COCO2017 dataset fully proves the feasibility of the method in this paper.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Dollár P, Appel R, Belongie S, et al. (2014) Fast feature pyrTADNetids for object detection[J]. IEEE Trans Pattern Anal Mach Intell 36(8):1532–1545
NTADNet W, Dollár P, Han J. H. (2014) Local decorrelation for improved detection[J]. arXiv:1406.1134
Zhang S, Benenson R, Schiele B. (2015) Filtered channel features for pedestrian detection[C]. CVPR 1(2):4
Dollár P, Tu Z, Perona P, et al. (2009) Integral channel features[J]
Wang X, Xiao T, Jiang Y, et al. (2018) Repulsion loss: Detecting pedestrians in a crowd[C]// Proc IEEE Conf Comput Vis Pattern Recogn:7774–7783
Cao X, Wu C, Yan P, Li X (2011) Linear SVM classification using boosting HOG features for vehicle detection in low-altitude airborne videos. In: proceedings of the 2011 IEEE international conference image processing(ICIP), Brussels, pp 2421– 2424
Guo E., Bai L., Zhang Y, Han J (2017) Vehicle Detection Based on Superpixel and Improved HOG in Aerial Images. In: proceedings of the international conference on image and graphics, Shanghai, pp 362–373
Laopracha N., Sunat K (2017) Comparative Study of Computational Time that HOG-based Features Used for Vehicle Detection. In: proceedings of the international conference on computing and information technology, Helsinki, pp 275–284
Wang W., et al. (2019) Quantication of full left ventricular metrics via deep regression learning with contour-guidance. IEEE Access 7:47918–47928
KTADNetilaris A, Prenafeta-boldú FX (2018) Deep learning in agriculture: A survey[J]. Comput Electron Agricul 147:70–90
Zou Z, Shi Z, Guo Y, et al. (2019) Object detection in 20 years: A survey[J]. arXiv:1905.05055
Jiao L, Zhang F, Liu F, et al. (2019) A survey of deep learning-based object detection[J]. IEEE Access 7:128837–128868
Liu L, Ouyang W, Wang X, et al. (2020) Deep learning for generic object detection: A survey[J]. Int J Comput Vis 128(2):261–318
Sang J, Wu Z, Guo P, et al. (2018) An improved YOLOv2 for vehicle detection[J]. Sensors 18(12):4272
Redmon J., Farhadi A. (2018) YOLOV3: An incremental improvement, computer vision and pattern recognition (CVPR). IEEE, Salt Lake City), pp 126–134
Liu W., et al. (2016) SSD: Single Shot multibox detector, European Conf. Computer Vision ECCV. Springer, ChTADNet, pp 21–37
Lin T. Y., et al. (2017) Focal loss for dense object detection. In: Proc. IEEE Int. Conf. Computer Vision ICCV, Venice, pp 2980–2988
Ren S., et al. (2015) Faster r-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Dai J., et al. (2016) R-FCN: Object detection via region-based fully convolutional networks. advances in neural information processing systems (NIPS) (Barcelona), pp 379–387
He K, et al., Gkioxari G (2017) Pdollár Mask r-CNN[c]. IEEE
Zhang J., et al. (2020) A cascaded r-CNN with multiscale attention and imbalanced sTADNetples for traffic sign detection. IEEE Access 8:29742–29754
Chen X, Gupta A. (2017) An implementation of faster rcnn with study for region sTADNetpling[J]. arXiv:1702.02138
Shao S, Zhao Z, Li B, et al. (2018) Crowdhuman: A benchmark for detecting human in a crowd[J]. arXiv:1805.00123
Wang M, et al., Chen H, Li Y (2021) Multi-scale pedestrian detection based on self-attention and adaptively spatial feature fusion[J]. IET Intelligent Transport Systems
Panigrahi S, Raju U S N (2021) Pedestrian Detection Based on Hand-crafted Features and Multi-layer Feature Fused-ResNet Model[J]. Int J Artif Intell Tools
Wanchaitanawong N, Tanaka M, Shibata T et al (2021) Multi-modal Pedestrian Detection with Large Misalignment Based on Modal-Wise Regression and Multi-Modal IoU[J]
Li Q, Qiang H, Li J (2021) Conditional random fields as message passing mechanism in anchor-free network for multi-scale pedestrian detection[J]. Inf Sci 550:1–12
Chen W, Guo Y, Yang S et al (2021) Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection[J]
Jiao Y, Yao H, Xu C (2021) SAN: Selective alignment network for Cross-Domain pedestrian Detection[J]. IEEE Trans Image Processing
Wang X, Xiao T, Jiang y et al (2018) Repulsion loss: Detecting pedestrians in a crowd[C]// Proceedings of the IEEE Conf Comput Vis Pattern Recognit, pp 7774–7783
Zhao M, Zhong Y, Sun D, et al. (2021) Accurate and efficient vehicle detection framework based on SSD algorithm[J]. IET Image Processing
Ghosh R (2021) On-road vehicle detection in varying weather conditions using faster r-CNN with several region proposal networks[J]. Multimed Tools Appl:1–15
Wang B, Xu B (2021) A feature fusion deep-projection convolution neural network for vehicle detection in aerial images[J] PLOS One 16
Bello I. et al (2019) Attention augmented convolutional networks. In: Proceedings IEEE Int Conf Comput Vis ICCV:3286–3295
Hu J., Shen L., Sun G. (2018) Squeeze-and-excitation networks. In Proceedings of IEEE Conf. Computer Vision and Pattern Recognition CVPR. IEEE, Salt Lake City, pp 7132–7141
Fan B B, Yang H. (2021) Multi-scale traffic sign detection model with attention[J]. Proc Inst Mech Eng Part D J Automobile Eng 235(2-3):708–720
Liu F, Qian Y, Li H, et al. (2021) CAFFNet: Channel Attention and Feature Fusion Network for Multi-target Traffic Sign Detection[J]. Intern J Pattern Recognit Artif Intell
Zhu X, Cheng D, Zhang Z et al (2019) An empirical study of spatial attention mechanisms in deep networks[C]// Proc IEEE/CVF Int Conf Comput Vis:6688–6697
Xiao F, Liu B, Li R. (2020) Pedestrian object detection with fusion of visual attention mechanism and semantic computation[J]. Multimed Tools Appl 79(21):14593–14607
Ma J, Wan H, Wang J, et al. (2021) An improved one-stage pedestrian detection method based on multi-scale attention feature extraction[J]. J Real-Time Image Proc:1–14
Chen X, Liu L, Deng Y, et al. (2019) Vehicle detection based on visual attention mechanism and adaboost cascade classifier in intelligent transportation systems[J]. Opt Quant Electron 51(8): 1–18
Dai J et al, Qi H, xiong Y (2017) Deformable convolutional Networks[C]// IEEE
Dai Z, Yang Z, Yang Y et al (2019) Transformer-XL: Attentive Language Models beyond a Fixed-Length Context[J]
Lin T Y et al, Dollar P, Girshick R (2017) Feature pyramid networks for object Detection[C]// 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE Computer Society
Yu F, Koltun V (2016) Multi-Scale context aggregation by dilated Convolutions[C]// ICLR
Wang X, Zhang X, Zhu Y et al (2020) PANDA: A Gigapixel-level Human-centric Video Dataset[C]// arXiv. arXiv
Lin T Y, Maire M, Belongie S et al (2014) Microsoft COCO: Common Objects in Context[J]. European Conf Comput Vis
Zhu X, Cheng D, Zhang Z, et al. (2019) An empirical study of spatial attention mechanisms in deep networks[C]// Proc IEEE/CVF Int Conf Comput Vis:6688–6697
Carion N, Massa F, Synnaeve G et al (2020) End-to-end Object Detection with Transformers[M]
Pang J et al, Chen K, Shi J (2020) libra r-CNN: Towards balanced learning for object Detection[C]// 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE
Wu Y, Chen Y (2020) Yuan L,othersRethinking Classification and Localization for Object Detection[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
Chen Q, Wang Y, Yang T et al (2021) You Only Look One-level Feature[J]
Ge Z, Liu S, Wang F et al (2021) Yolox: Exceeding yolo series in 2021[J]. arXiv:2107.08430
Acknowledgements
This work was supported by the National Science Foundation of China under Grant U1803261. Funded by the National Natural Science Foundation of China (61562086). Funded by the National Natural Science Foundation of China (61966035). the Funds for Creative Research Groups of Higher Education of Xinjiang Uygur Autonomous Region under Grant No.XJEDU2017T002. Autonomous Region Graduate Innovation Project (XJ2019G072). Tianshan Innovation Team Plan Project of Xinjiang Uygur Autonomous Region under Grant No. 202101642.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wei, H., Zhang, Q., Han, J. et al. SARNet: Spatial Attention Residual Network for pedestrian and vehicle detection in large scenes. Appl Intell 52, 17718–17733 (2022). https://doi.org/10.1007/s10489-022-03217-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03217-9