Full Length Article
Dynamic Dual-Peak Network: A real-time human detection network in crowded scenes

https://doi.org/10.1016/j.jvcir.2021.103195Get rights and content

Abstract

Human detection in crowded scenes is challenging since the objects occlude and overlap each other. Compared to general pedestrian detection, there is also more variation in human posture. This paper proposes a real-time human detection network, Dynamic Dual-Peak Network (DDPNet), which specifically addresses human object detection in overlapping and crowded scenes. We design a deep cascade fusion module to enhance the feature extraction capability of the anchor-free model for small objects in crowded scenes. In the meantime, the head–body dual-peak activation module is used to improve the prediction score of the central region of the occluded individual through low occlusion components. By this improvement strategy, the network’s ability is enhanced to discriminate individuals in crowded scenes and alleviate the problem caused by individual posture variation. Ultimately, we propose a novel Exhale–Inhale method to adjust the feature mapping ranges for various scale objects dynamically. In the process of ground truth mapping, the overlapping of individual feature information is reduced. Our DDPNet achieves competitive performance on the CrowdHuman dataset and executes real-time inference of almost 3x7x faster than competitive methods.

Introduction

Human detection has a crucial role in most real-world vision applications, such as abnormal behavior analysis in security surveillance, autonomous driving systems, and human pose detection in crowded scenes, which aims to predict a series of bounding boxes enclosing human instances in an image. In recent works, numerous solutions have been presented to handle this prediction process. Similar to general object detection, the past decades have witnessed its technical development from models relying on hand-crafted features [1], [2], [3] to Convolutional Neural Network (CNN) methods [4], [5], [6], [7], [8].

In the deployment phase of the above-mentioned actual applications, human instances in crowded scenes will occlude each other inevitably. This occlusion problem is always one of the most challenging problems in human detection [5]. Due to the crowding between objects in dense regions, or overlapping with other categories of objects, humans often have problems of inter-class or intra-class occlusion. Moreover, detectors may treat the crowd mistakenly as a whole or change the object bounding box to another person. Thus, human imposes variations in scales and poses in crowd scenes.

Recent occlusion pedestrian detection approaches can be classified into two main types: (1) Designing a novel loss function additional penalties to create more compact bounding boxes [7], [8]; (2) Proposing an exclusive Non-Maximum Suppression (NMS) algorithm to make it more suitable for coping with pedestrian detection in crowded scenes [11], [12]. However, specific differences exist between traditional pedestrian detection and the crowded human detection mentioned in these articles. According to [13], [14], the difference in the posture changes of pedestrian targets in the scene is smaller compared to the human category. In crowded scenes, the scale of the human body is more extensive than a pedestrian object, as shown in Fig. 1. The figure shows the scale distribution obtained by counting the aspect ratio of all the bounding boxes in training set on the pedestrian dataset CityPersons [9] and the human dataset CrowdHuman [10], respectively. Moreover, such a different scale change also highlights the particularity of the human target compared to the pedestrian. Thus, we cannot simply use the pedestrian detector to obtain the human detection task. Furthermore, these research works are all redesigned on the anchor-based detection pipeline. Therefore, in the network training process, a set of hyper-parameter anchor boxes carefully designed by experts are required. At the same time, the forward inference time of the model is relatively slow. By these defects, deploying the model directly in real-world applications becomes difficult.

The branch (A) in above Fig. 2 represents the general anchor-free detector head components. When the detector head received input image features from the feature extractor, the prediction bounding boxes are created by the central heatmap and scaling regression through the joint contribution. Unlike the previous procedure, our method (branch (B)) substitutes the central heatmap and generates human center position prediction by combining the human center and human head via the dual-peak activation module. The yellow bounding box in the image shows the false negative of the human category missed in the general anchor-free method. We enhance the object central area responses through our Dynamic Dual-Peak Network (DDPNet) and alleviated the problem that anchor-free pipelines cannot cope with overlapping the central points in crowded areas.

Considering such disadvantages, we propose a real-time DDPNet as a one-stage anchor-free detection network to enhance human detection in crowded scenes by uniting a dual-peak module to precisely locate the center of humans partly occluded. When a full-body detector fails to identify an occluded person, the visible part may present higher confidences and guide the detector to discriminate instances. According to Fig. 2, compared to the human body, the human head in real-world images has a smaller scale and less overlap typically. Hence, it becomes more robust to pose variations and crowed occlusion. This is especially effective in crowded scenarios. The human detector may not be able to distinguish the boundaries of the humans due to the partial overlap of the various instances resulting in false positives. In this case, features of heads may considerably help discriminate various instances. Therefore, such false-positive human detections that are not consistent with the head detections can be eliminated. Our model facilitates using the anchor-free method with faster inference and non-inferior performance to the two-stage model. It enhances the shortcomings of the anchor-free method for using object detection in crowded scenes to realize real-time and concise human detection in crowded scenes. The effectiveness of the proposed method is demonstrated by extensive experimental results on anchor-free detection benchmarks and widely used pedestrian detection benchmarks. Our DDPNet method obtains competitive performance on the Crowdhuman dataset and executes almost 3x7x faster than competitive techniques.

As a result, the main contributions of this work are as follows:

  • We propose a real-time human detection network, DDPNet, explicitly addressing the human object detection in overlapping and crowded scenes. Therefore, using the advantage of guaranteeing the detection by a one-stage anchor-free detection network over feature maps with a low down-sampling rate, a deep cascade fusion module is designed to enhance the small objects feature extraction capability and detection performance of the anchor-free model in crowded scenes.

  • We design a head–body dual-peak activation module to substitute the general anchor-free detector head. By improving the human’s central response via low occlusion components, the network’s ability is improved to discriminate against human individuals in crowded scenes, alleviating the problem results from the individual posture variation.

  • We propose a novel Exhale–Inhale method to dynamically adjust the feature mapping ranges for various scale objects to decrease the feature boundary blurring caused by crowding during the model’s extraction of human features. Ultimately, several experiments are performed on a challenging CrowdHuman [10] detection dataset to demonstrate the proposed method superiority.

Section snippets

Anchor-based detectors

In anchor-based detectors, the anchor boxes can be considered as predetermined sliding windows or proposals. The object’s location on the input image is regarded as the center of multiple anchor boxes, and the bounding box with these anchor boxes is considered as references. The design of anchor boxes is popularized by Faster R-CNN [15] in its RPNs, SSD [16] and YOLOv2[17]. Moreover, it has become the convention in a modern detector. RetinaNet [18] enhances the imbalance of negative and

Our approach

In this section, we present the proposed DDPNet for human detection in detail. The overall framework design of our proposed DDPNet is composed of two essential parts: a feature extractor and an anchor-free detector head. First, we briefly introduce the Deep Cascading Fusion (DCF) module proposed in the extractor, illustrating its ability to enhance the network’s capacity to extract features of small objects. Then, we introduce the Head–Body Dual-Peak activation module, which is added to the

Experiments

In this section, we use our DDPNet detection method in Crowdhuman, a human detection dataset proposed specifically for crowded scenarios, to verify the performance of the innovative improvements proposed in this paper. Thus, we first present the introduction of the dataset, then describe the implementation details of our DDPNet during the training and inference phases of the experiment. Ultimately, the performance of our DDPNet is compared with state-of-the-art human detection networks.

Conclusion

We proposed an anchor-free one-stage detector DDPNet, in the present work. As shown in experiments, DDPNet acts favorably against the popular anchor-free one-stage detectors, in solving the crowded scenes object detection applications. Utilizing an anchor-free design, our model avoids all computation and hyper-parameters associated with anchor boxes. It is hoped that the innovative work in our paper can contribute to the real-time anchor-free detect application in crowded scenes.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China , under Project 61972321.

References (46)

  • DollárP. et al.

    Fast feature pyramids for object detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • NamW. et al.

    Local decorrelation for improved detection

    (2014)
  • S. Zhang, R. Benenson, B. Schiele, et al. Filtered channel features for pedestrian detection, in: Proceedings of the...
  • ZhangL. et al.

    Is faster R-CNN doing well for pedestrian detection?

  • S. Zhang, R. Benenson, M. Omran, J. Hosang, B. Schiele, How far are we from solving pedestrian detection? in:...
  • S. Zhang, J. Yang, B. Schiele, Occluded pedestrian detection through guided attention in cnns, in: Proceedings of the...
  • X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, C. Shen, Repulsion loss: Detecting pedestrians in a crowd, in: Proceedings...
  • S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Occlusion-aware R-CNN: detecting pedestrians in a crowd, in: Proceedings of...
  • S. Zhang, R. Benenson, B. Schiele, Citypersons: A diverse dataset for pedestrian detection, in: Proceedings of the...
  • ShaoS. et al.

    Crowdhuman: A benchmark for detecting human in a crowd

    (2018)
  • N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS: Improving object detection with one line of code, in:...
  • S. Liu, D. Huang, Y. Wang, Adaptive nms: Refining pedestrian detection in a crowd, in: Proceedings of the IEEE/CVF...
  • ZhangK. et al.

    Double anchor R-CNN for human detection in a crowd

    (2019)
  • C. Lin, J. Lu, G. Wang, J. Zhou, Graininess-aware deep feature learning for pedestrian detection, in: Proceedings of...
  • RenS. et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

  • LiuW. et al.

    Ssd: Single shot multibox detector

  • J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: Proceedings of the IEEE/CVF Conference on Computer...
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the...
  • RedmonJ. et al.

    Yolov3: An incremental improvement

    (2018)
  • S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Single-shot refinement neural network for object detection, in: Proceedings...
  • Z. Qin, Z. Li, Z. Zhang, Y. Bao, G. Yu, Y. Peng, J. Sun, Thundernet: Towards real-time generic object detection on...
  • R. Li, Y. Wang, F. Liang, H. Qin, J. Yan, R. Fan, Fully quantized network for object detection, in: Proceedings of the...
  • Z. Wang, Z. Wu, J. Lu, J. Zhou, BiDet: An efficient binarized object detector, in: Proceedings of the IEEE/CVF...
  • Cited by (4)

    • Accumulated micro-motion representations for lightweight online action detection in real-time

      2023, Journal of Visual Communication and Image Representation

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text