Dynamic Dual-Peak Network: A real-time human detection network in crowded scenes

doi:10.1016/j.jvcir.2021.103195

Journal of Visual Communication and Image Representation

Volume 79, August 2021, 103195

https://doi.org/10.1016/j.jvcir.2021.103195 Get rights and content

Abstract

Human detection in crowded scenes is challenging since the objects occlude and overlap each other. Compared to general pedestrian detection, there is also more variation in human posture. This paper proposes a real-time human detection network, Dynamic Dual-Peak Network (DDPNet), which specifically addresses human object detection in overlapping and crowded scenes. We design a deep cascade fusion module to enhance the feature extraction capability of the anchor-free model for small objects in crowded scenes. In the meantime, the head–body dual-peak activation module is used to improve the prediction score of the central region of the occluded individual through low occlusion components. By this improvement strategy, the network’s ability is enhanced to discriminate individuals in crowded scenes and alleviate the problem caused by individual posture variation. Ultimately, we propose a novel Exhale–Inhale method to adjust the feature mapping ranges for various scale objects dynamically. In the process of ground truth mapping, the overlapping of individual feature information is reduced. Our DDPNet achieves competitive performance on the CrowdHuman dataset and executes real-time inference of almost 3x $\sim$ 7x faster than competitive methods.

Introduction

Human detection has a crucial role in most real-world vision applications, such as abnormal behavior analysis in security surveillance, autonomous driving systems, and human pose detection in crowded scenes, which aims to predict a series of bounding boxes enclosing human instances in an image. In recent works, numerous solutions have been presented to handle this prediction process. Similar to general object detection, the past decades have witnessed its technical development from models relying on hand-crafted features [1], [2], [3] to Convolutional Neural Network (CNN) methods [4], [5], [6], [7], [8].

In the deployment phase of the above-mentioned actual applications, human instances in crowded scenes will occlude each other inevitably. This occlusion problem is always one of the most challenging problems in human detection [5]. Due to the crowding between objects in dense regions, or overlapping with other categories of objects, humans often have problems of inter-class or intra-class occlusion. Moreover, detectors may treat the crowd mistakenly as a whole or change the object bounding box to another person. Thus, human imposes variations in scales and poses in crowd scenes.

Recent occlusion pedestrian detection approaches can be classified into two main types: (1) Designing a novel loss function additional penalties to create more compact bounding boxes [7], [8]; (2) Proposing an exclusive Non-Maximum Suppression (NMS) algorithm to make it more suitable for coping with pedestrian detection in crowded scenes [11], [12]. However, specific differences exist between traditional pedestrian detection and the crowded human detection mentioned in these articles. According to [13], [14], the difference in the posture changes of pedestrian targets in the scene is smaller compared to the human category. In crowded scenes, the scale of the human body is more extensive than a pedestrian object, as shown in Fig. 1. The figure shows the scale distribution obtained by counting the aspect ratio of all the bounding boxes in training set on the pedestrian dataset CityPersons [9] and the human dataset CrowdHuman [10], respectively. Moreover, such a different scale change also highlights the particularity of the human target compared to the pedestrian. Thus, we cannot simply use the pedestrian detector to obtain the human detection task. Furthermore, these research works are all redesigned on the anchor-based detection pipeline. Therefore, in the network training process, a set of hyper-parameter anchor boxes carefully designed by experts are required. At the same time, the forward inference time of the model is relatively slow. By these defects, deploying the model directly in real-world applications becomes difficult.

The branch (A) in above Fig. 2 represents the general anchor-free detector head components. When the detector head received input image features from the feature extractor, the prediction bounding boxes are created by the central heatmap and scaling regression through the joint contribution. Unlike the previous procedure, our method (branch (B)) substitutes the central heatmap and generates human center position prediction by combining the human center and human head via the dual-peak activation module. The yellow bounding box in the image shows the false negative of the human category missed in the general anchor-free method. We enhance the object central area responses through our Dynamic Dual-Peak Network (DDPNet) and alleviated the problem that anchor-free pipelines cannot cope with overlapping the central points in crowded areas.

Considering such disadvantages, we propose a real-time DDPNet as a one-stage anchor-free detection network to enhance human detection in crowded scenes by uniting a dual-peak module to precisely locate the center of humans partly occluded. When a full-body detector fails to identify an occluded person, the visible part may present higher confidences and guide the detector to discriminate instances. According to Fig. 2, compared to the human body, the human head in real-world images has a smaller scale and less overlap typically. Hence, it becomes more robust to pose variations and crowed occlusion. This is especially effective in crowded scenarios. The human detector may not be able to distinguish the boundaries of the humans due to the partial overlap of the various instances resulting in false positives. In this case, features of heads may considerably help discriminate various instances. Therefore, such false-positive human detections that are not consistent with the head detections can be eliminated. Our model facilitates using the anchor-free method with faster inference and non-inferior performance to the two-stage model. It enhances the shortcomings of the anchor-free method for using object detection in crowded scenes to realize real-time and concise human detection in crowded scenes. The effectiveness of the proposed method is demonstrated by extensive experimental results on anchor-free detection benchmarks and widely used pedestrian detection benchmarks. Our DDPNet method obtains competitive performance on the Crowdhuman dataset and executes almost 3x $\sim$ 7x faster than competitive techniques.

As a result, the main contributions of this work are as follows:

•
We propose a real-time human detection network, DDPNet, explicitly addressing the human object detection in overlapping and crowded scenes. Therefore, using the advantage of guaranteeing the detection by a one-stage anchor-free detection network over feature maps with a low down-sampling rate, a deep cascade fusion module is designed to enhance the small objects feature extraction capability and detection performance of the anchor-free model in crowded scenes.
•
We design a head–body dual-peak activation module to substitute the general anchor-free detector head. By improving the human’s central response via low occlusion components, the network’s ability is improved to discriminate against human individuals in crowded scenes, alleviating the problem results from the individual posture variation.
•
We propose a novel Exhale–Inhale method to dynamically adjust the feature mapping ranges for various scale objects to decrease the feature boundary blurring caused by crowding during the model’s extraction of human features. Ultimately, several experiments are performed on a challenging CrowdHuman [10] detection dataset to demonstrate the proposed method superiority.

Section snippets

Anchor-based detectors

In anchor-based detectors, the anchor boxes can be considered as predetermined sliding windows or proposals. The object’s location on the input image is regarded as the center of multiple anchor boxes, and the bounding box with these anchor boxes is considered as references. The design of anchor boxes is popularized by Faster R-CNN [15] in its RPNs, SSD [16] and YOLOv2[17]. Moreover, it has become the convention in a modern detector. RetinaNet [18] enhances the imbalance of negative and

Our approach

In this section, we present the proposed DDPNet for human detection in detail. The overall framework design of our proposed DDPNet is composed of two essential parts: a feature extractor and an anchor-free detector head. First, we briefly introduce the Deep Cascading Fusion (DCF) module proposed in the extractor, illustrating its ability to enhance the network’s capacity to extract features of small objects. Then, we introduce the Head–Body Dual-Peak activation module, which is added to the

Experiments

In this section, we use our DDPNet detection method in Crowdhuman, a human detection dataset proposed specifically for crowded scenarios, to verify the performance of the innovative improvements proposed in this paper. Thus, we first present the introduction of the dataset, then describe the implementation details of our DDPNet during the training and inference phases of the experiment. Ultimately, the performance of our DDPNet is compared with state-of-the-art human detection networks.

Conclusion

We proposed an anchor-free one-stage detector DDPNet, in the present work. As shown in experiments, DDPNet acts favorably against the popular anchor-free one-stage detectors, in solving the crowded scenes object detection applications. Utilizing an anchor-free design, our model avoids all computation and hyper-parameters associated with anchor boxes. It is hoped that the innovative work in our paper can contribute to the real-time anchor-free detect application in crowded scenes.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China , under Project 61972321.

References (46)

DollárP. et al.
Fast feature pyramids for object detection
IEEE Trans. Pattern Anal. Mach. Intell.
(2014)
NamW. et al.
Local decorrelation for improved detection
(2014)
S. Zhang, R. Benenson, B. Schiele, et al. Filtered channel features for pedestrian detection, in: Proceedings of the...
ZhangL. et al.
Is faster R-CNN doing well for pedestrian detection?
S. Zhang, R. Benenson, M. Omran, J. Hosang, B. Schiele, How far are we from solving pedestrian detection? in:...
S. Zhang, J. Yang, B. Schiele, Occluded pedestrian detection through guided attention in cnns, in: Proceedings of the...
X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, C. Shen, Repulsion loss: Detecting pedestrians in a crowd, in: Proceedings...
S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Occlusion-aware R-CNN: detecting pedestrians in a crowd, in: Proceedings of...
S. Zhang, R. Benenson, B. Schiele, Citypersons: A diverse dataset for pedestrian detection, in: Proceedings of the...
ShaoS. et al.
Crowdhuman: A benchmark for detecting human in a crowd
(2018)

N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS: Improving object detection with one line of code, in:...

S. Liu, D. Huang, Y. Wang, Adaptive nms: Refining pedestrian detection in a crowd, in: Proceedings of the IEEE/CVF...

ZhangK. et al.

Double anchor R-CNN for human detection in a crowd

(2019)

C. Lin, J. Lu, G. Wang, J. Zhou, Graininess-aware deep feature learning for pedestrian detection, in: Proceedings of...

RenS. et al.

Faster r-cnn: Towards real-time object detection with region proposal networks

LiuW. et al.

Ssd: Single shot multibox detector

J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: Proceedings of the IEEE/CVF Conference on Computer...

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the...

RedmonJ. et al.

Yolov3: An incremental improvement

(2018)

S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Single-shot refinement neural network for object detection, in: Proceedings...

Z. Qin, Z. Li, Z. Zhang, Y. Bao, G. Yu, Y. Peng, J. Sun, Thundernet: Towards real-time generic object detection on...

R. Li, Y. Wang, F. Liang, H. Qin, J. Yan, R. Fan, Fully quantized network for object detection, in: Proceedings of the...

Z. Wang, Z. Wu, J. Lu, J. Zhou, BiDet: An efficient binarized object detector, in: Proceedings of the IEEE/CVF...

Cited by (4)

Accumulated micro-motion representations for lightweight online action detection in real-time
2023, Journal of Visual Communication and Image Representation
In the last decade, the explosive growth of vision sensors and video content has driven numerous application demands for automating human action detection in space and time. Aside from reliable precision, vast real-world scenarios also mandate continuous and instantaneous processing of actions under limited computational budgets. However, existing studies often rely on heavy operations such as 3D convolution and fine-grained optical flow, therefore are hindered in practical deployment. Aiming strictly at a better mixture of detection accuracy, speed, and complexity for online detection, we customize a cost-effective 2D-CNN-based tubelet detection framework coined Accumulated Micro-Motion Action detector (AMMA). It sparsely extracts and fuses visual-dynamic cues of actions spanning a longer temporal window. To lift reliance on expensive optical flow estimation, AMMA efficiently encodes actions’ short-term dynamics as accumulated micro-motion from RGB frames on-the-fly. On top of AMMA’s motion-aware 2D backbone, we adopt an anchor-free detector to cooperatively model action instances as moving points in the time span. The proposed action detector achieves highly competitive accuracy as state-of-the-arts while substantially reducing model size, computational cost, and processing time (6 million parameters, 1 GMACs, and 100 FPS respectively), making it much more appealing under stringent speed and computational constraints. Codes are available on https://github.com/alphadadajuju/AMMA.
Deep collaborative learning with class-rebalancing for semi-supervised change detection in SAR images
2023, Knowledge-Based Systems
Deep learning reveals excellent potential for accomplishing change detection in SAR imagery. Yet, it suffers from the problem of requiring large amounts of labeled samples, whilst labeling SAR imagery for change detection requires experts to label individual images at the pixel level, which is extremely tedious and time-consuming. Also, sample imbalance continues to present a serious challenge for the existing change detection techniques. To tackle these problems, in this study, a Deep Collaborative semi-supervised learning Framework with Class-Rebalancing (DCF-CRe) is proposed for SAR imagery change detection, by exploiting Convolutional Neural Network (CNN) and deep clustering. In particular, a Siamese Difference Fusion Network (SDFNet) is devised to implement change detection while effectively reducing the information loss due to the generation of difference images and highlighting features of the changed regions.In so doing, only a tiny batch of labeled samples is utilized to train SDFNet in order to obtain predicted change map and deep features. In addition, the Approximate Rank-Order Clustering (AROC) algorithm is employed to cluster the deep features, generating pseudo-labels for abundant unlabeled samples. DCF-CRe is then applied to select appropriate pseudo-labels and to add labeled samples to train SDFNet. Experimental results evaluated on six challenging datasets show that this proposed approach can achieve performance superior to state-of-the-art change detection methods for SAR imagery.
Detection, tracking, and recognition of isolated multi-stroke gesticulated characters
2023, Pattern Analysis and Applications
Accumulated Micro-Motion Representations for Lightweight Online Action Detection in Real-Time
2022, SSRN

^☆: This paper has been recommended for acceptance by Zicheng Liu.

View full text

Full Length ArticleDynamic Dual-Peak Network: A real-time human detection network in crowded scenes☆

Abstract

Introduction

Section snippets

Anchor-based detectors

Our approach

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Fast feature pyramids for object detection

IEEE Trans. Pattern Anal. Mach. Intell.

Local decorrelation for improved detection

Is faster R-CNN doing well for pedestrian detection?

Crowdhuman: A benchmark for detecting human in a crowd

Double anchor R-CNN for human detection in a crowd

Faster r-cnn: Towards real-time object detection with region proposal networks

Ssd: Single shot multibox detector

Yolov3: An incremental improvement

Full Length Article
Dynamic Dual-Peak Network: A real-time human detection network in crowded scenes☆