PVDet: Towards pedestrian and vehicle detection on gigapixel-level images

https://doi.org/10.1016/j.engappai.2022.105705Get rights and content

Abstract

Recently, gigapixel photography has been developed considerably and gradually put into remote sensing, video surveillance, etc. Gigapixel images have a visible field of view area at the square-kilometer level (containing thousands of targets) and up to 100 times the scale variation. Among them, the differences in target pose, scale, and occlusion are huge, and most existing target detection algorithms cannot directly process them. To solve these problems, we propose a new multi-target pedestrian and vehicle detector PVDet (Towards Pedestrian and Vehicle Detection on Gigapixel-level images) for gigapixel-level images. First, the DPRNet (Deformable deeP Residual Network) is designed as the backbone network to enhance the effective perceptual field and improve the feature representation of pose varying and occluded targets. Then, the PAFPN (Path Aggregation Feature Pyramid Network) is adopted to process the multi-scale features extracted by the backbone, boosting the multi-scale target modeling capability and the localization of small targets. Finally, the DyHead module is introduced to enhance the detection head’s scale, spatial and task awareness, further optimizing pedestrian and vehicle classification and localization. Compared with other State-of-the-Art methods on the PANDA dataset, the experimental results show that the proposed method dramatically improves AP of pedestrian and vehicle detection in gigapixel-level images by 10.4 AP over baseline, which is better than the existing target detection algorithms. We also conducted experiments on the PASCAL VOC 2012 dataset to further demonstrate the generalization capability and effectiveness of the proposed method.

Introduction

Object detection is one of the most central tasks in computer vision, aiming to distinguish classes of objects and locate positions in images, as well as being the technical support for many practical applications. Pedestrian and vehicle detection are popular research topics in target detection, with rich applications in assisted driving systems, intelligent monitoring, and other fields. The frequent mutual occlusion of vehicles in the city, the large-scale variation of vehicle pictures, and the non-rigid characteristics of pedestrians make the multi-pose and occlusion problems more severe and pose a significant challenge for pedestrian and vehicle detection.

As photographic technology advances quickly, gigapixel-level photography equipment is progressively integrated into various application scenarios. The COCO 2014 dataset (Lin et al., 2014) commonly used in previous studies only has 640 × 640 resolution. Larger resolution datasets are only available such as VisDrone 2018 (Zhu et al., 2020) (2k × 1.5k) and DOTA 2018 (Xia et al., 2018) (4k × 4k), and only a few or tens of targets are included in one image.

The PANDA (Wang et al., 2020b), a video image dataset of the gigapixel-level resolution, was suggested by academics at Tsinghua University to advance the development of high-resolution images and videos in computer vision. This dataset features wide FoV and high resolution( 26k × 15k), the number of targets up to 4k in a scene, and significant size changes between various targets ( 100 × scale variation). Gigapixel-level images bring more challenges to object detection, and several studies based on PANDA datasets have emerged recently. The literature (Li et al., 2022) uses a two-step cropping strategy to process original high-resolution images and then uses the Region NMS algorithm to reduce the impact caused by cropped targets. The setting of the threshold directly influences the accuracy of the test results. However, the optimal IOU threshold was not found, and the two-step cropping method was not fast. To boost the speed of gigapixel-level detection, a real-time detector, GigaDet, is proposed in the literature (Chen et al., 2022), which uses PGN (Patch Generation Network) modules to filter out regions unrelated to the target of interest to improve detection. However, the training process is not end-to-end; pedestrian posture and occlusion features are not considered. Literature (Wei et al., 2022) extends the detection to people and vehicles. The authors proposed SARNet, using transformer attention to optimize Faster rcnn, and obtained practical improvements. The use of a two-stage detection algorithm leads to its model having a large computational overhead and parameters. These methods have achieved good results. However, there is enormous room for enhancement in gigapixel-level images.

To address the problems mentioned above, this paper proposes a new end-to-end detector PVDet (Pedestrian and Vehicle Detection on Gigapixel-level Images), and the main contributions of this paper are as follows.

  • (1)

    Firstly, a novel backbone called DPRNet (Deformable deeP Residual Network) is proposed for improving the feature extraction capability for different shaped and occluded targets.

  • (2)

    For large inter-target scale differences and small targets in gigapixel-level images, PAFPN (Path Aggregation Feature Pyramid Network) is used to process the multi-layer features extracted by the backbone, delivering high resolution feature information through shorter paths. It iteratively fuses the multi-layer features to obtain a high resolution feature map with richer semantic information and to augment the detection accuracy for targets of different scales and small targets.

  • (3)

    To further utilize the information from PAFPN, multiple DyHead modules are introduced, which possess learning and sensing capabilities for scale, space, and task. And it can usefully enhance the detection head’s ability to classify and localize pedestrians and vehicles in high resolution images.

  • (4)

    After extensive experiments, it has been proved that the proposed method acquires the best performance on the PANDA dataset compared with other State-of-the-Art methods. We experiment adequately in verifying the generalizability and effectiveness of the proposed method on PASCAL VOC 2007.

The rest of the paper is organized as follows: the next section presents the related work of the paper, and the third section describes the method proposed in this paper. The fourth section compares and analyzes the method of this paper with advanced detection methods in a number of experiments. Finally, the fifth section provides a comprehensive summary of the paper.

Section snippets

Traditional methods

The literature (Ren and Li, 2015, Mao et al., 2015, Wang et al., 2019, Zhou and Yu, 2021, Hua et al., 2021, Kim et al., 2015, Ali and Bayoumi, 2016, Satzoda and Trivedi, 2015, Yuan et al., 2016) presented traditional pedestrian and vehicle detection methods. The literature (Ren and Li, 2015) adopted the LogitBoost algorithm combined with a mapped HOG (Histogram Orientation Gradient) descriptor to train the classifier, which improves the pedestrian detector training efficiency. Considering that

PVDet

In gigapixel-level resolution images, pedestrian and vehicle targets are characterized by large-scale variations, wide distribution, severe target occlusion, and deformation problems, and smaller targets are more challenging to detect. We construct a new pedestrian and vehicle detection model called PVDet, based on the basic idea of adaptive sample selection to cope with these problems. Section 3.1 presents the proposed overall framework for pedestrian and vehicle detection. In Section 3.2, a

Dataset

PANDA-Image (Wang et al., 2020b) is the first human-centric gigapixel-level dataset. The PANDA-Image dataset consists of 600 live images in a variety of scenes at a resolution of approximately 26k x 15k, with a field of view coverage of up to 1km2, allowing thousands of targets to be observed simultaneously over a scale variation of nearly a hundred times. As can be observed in Fig. 5, there is a considerable variation in scale between targets and irregular population distribution. The dataset

Conclusion

This paper proposes a pedestrian-vehicle detector PVDet for gigapixel resolution images. First, we use the Deformable ConvNets v2 to improve the modeling capability of the backbone for deformed targets to extract pose variant pedestrian features better. Then higher resolution feature information is aggregated using PAFPN to improve the detection performance for multi-scale and small targets. Subsequent multiple DyHead modules with scale aware, spatially aware, and task-aware capabilities are

CRediT authorship contribution statement

Wanghao Mo: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Visualization. Wendong Zhang: Resources, Supervision, Funding acquisition. Hongyang Wei: Methodology, Formal analysis. Ruyi Cao: Validation, Data curation. Yan Ke: Validation. Yiwen Luo: Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region, China (2020D01C033) and Doctoral Research Fund Project of Xinjiang University, China (202112120001).

Wanghao Mo received the B.E. degree from Wuhan university institute of splendid, Wuhan, China, in 2019. He is currently pursuing the M.S. degree with the Institute of Software in Xinjiang University. His research interests include computer vision, automated driving, and deep learning.

References (59)

  • Chen, Qiang, Wang, Yingming, Yang, Tong, Zhang, Xiangyu, Cheng, Jian, Sun, Jian, 2021. You only look one-level feature....
  • Dai, Xiyang, Chen, Yinpeng, Xiao, Bin, Chen, Dongdong, Liu, Mengchen, Yuan, Lu, Zhang, Lei, 2021. Dynamic head:...
  • Dai, Jifeng, Qi, Haozhi, Xiong, Yuwen, Li, Yi, Zhang, Guodong, Hu, Han, Wei, Yichen, 2017. Deformable convolutional...
  • Everingham, Mark, Winn, John, 2012. The PASCAL visual object classes challenge 2012 (VOC2012) development kit. In:...
  • FengChengjian et al.

    Tood: Task-aligned one-stage object detection

  • Girshick, Ross, Donahue, Jeff, Darrell, Trevor, Malik, Jitendra, 2014. Rich feature hierarchies for accurate object...
  • He, Kaiming, Gkioxari, Georgia, Dollar, Piotr, Girshick, Ross, 2017. Mask R-CNN. In: Proceedings of the IEEE...
  • Hou, Qibin, Zhou, Daquan, Feng, Jiashi, 2021. Coordinate attention for efficient mobile network design. In: Proceedings...
  • HsuWei-Yen et al.

    Ratio-and-scale-aware YOLO for pedestrian detection

    IEEE Trans. Image Process.

    (2020)
  • Hu, Jie, Shen, Li, Sun, Gang, 2018a. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on...
  • HuXiaowei et al.

    SINet: A scale-insensitive convolutional neural network for fast vehicle detection

    IEEE Trans. Intell. Transp. Syst.

    (2018)
  • HuaJie et al.

    Pedestrian-and vehicle-detection algorithm based on improved aggregated channel features

    IEEE Access

    (2021)
  • Huang, Xin, Ge, Zheng, Jie, Zequn, Yoshie, Osamu, 2020. Nms by representative region: Towards crowded pedestrian...
  • KimJisu et al.

    A novel on-road vehicle detection method using πHOG

    IEEE Trans. Intell. Transp. Syst.

    (2015)
  • LiGuofa et al.

    Deep learning approaches on pedestrian detection in Hazy weather

    IEEE Trans. Ind. Electron.

    (2020)
  • LiJianxiang et al.

    Target-guided feature super-resolution for vehicle detection in remote sensing images

    IEEE Geosci. Remote Sens. Lett.

    (2021)
  • Lin, Tsung-Yi, Goyal, Priya, Girshick, Ross, He, Kaiming, Dollár, Piotr, 2017. Focal loss for dense object detection....
  • LinChe-Tsung et al.

    GAN-based day-to-night image style transfer for nighttime vehicle detection

    IEEE Trans. Intell. Transp. Syst.

    (2020)
  • LinTsung-Yi et al.

    Microsoft coco: Common objects in context

  • Cited by (7)

    View all citing articles on Scopus

    Wanghao Mo received the B.E. degree from Wuhan university institute of splendid, Wuhan, China, in 2019. He is currently pursuing the M.S. degree with the Institute of Software in Xinjiang University. His research interests include computer vision, automated driving, and deep learning.

    Wendong Zhang received his B.S. and master’s degrees from Xinjiang University, Urumqi, China, in 1998 and 2005 respectively, and Ph.D. degree from Xi’an Jiaotong University, China, in 2019. He is currently working as an Associate Professor with Xinjiang University. His research interests include Edge Computing, IoT technology, Ad Hoc networks and Machine Learning.

    Hongyang Wei graduated from Chongqing University of Technology in Chongqing, China, with a bachelor’s degree in engineering, in 2018. Now he is a postgraduate majoring in software engineering at Xinjiang University, China, and his main research fields are target detection and semantic segmentation.

    Ruyi Cao received the B.S. degree in computer science and technology from Yanshan University, Qinhuangdao, China, in 2021. She is currently pursuing a master’s degree in engineering at the School of Software, Xinjiang University. Her research interests include computer vision and deep learning.

    Yan Ke received the B.B.A. degree from Jiangxi Normal University, Nanchang, China, in 2020. She is currently pursuing the M.S. degree with the Institute of Software in Xinjiang University. Her research interests include computer vision and deep learning.

    Yiwen Luo received the B.E. degree in computer science and technology from Hunan University of Finance and Economics, Changsha, China, in 2021. He is currently pursuing the M.S. degree with the Institute of Software in Xinjiang University. His research interests include machine learning and the vehicle routing problem with drones.

    View full text