D-NMS: A dynamic NMS network for general object detection
Introduction
Object detection has been widely researched and applied over the past decades, the main task of which is to obtain a tight bounding box covering the predicted instance. Driven by the great progress of Deep Convolutional Neural Network (DCNN), there are lots of outstanding object detection frameworks having emerged [37], [49], such as Faster R-CNN [26], YOLO-Vx [23], [1], [24], [25], SSD [21], RetinaNet [18] and FCOS[33]. In these detection pipelines, Non-maximum Suppression (NMS) is an essential post-processing step to remove the redundant bounding boxes around a real object. The standard NMS ranks all candidate bounding boxes by their confidence score and iteratively removes candidates that exceeds a manually chosen Intersection-over-Union (IoU). Although this mechanism is ingenious and straightforward, there are two potential inherent limitations.
The first one is the mismatch problem between the confidence score and the overlap rate of bounding boxes. The inference box with a high classification confidence score is expected to have a high overlap rate with the ground-truth box, while it is not always satisfied. To alleviate this problem, some works attempt to design the novel loss function by taking the different overlap of samples into account, which is more suitable for the bounding boxes regression task [39], [27], [41]. Besides, the prediction branch for overlap rate is also a widely used solution to handle the mismatch problem [9], [12]. What’s more, the correlation between the classification confidence score and the overlap rate of bounding boxes can be effectively improved by learning ways [28]. However, the impact of the mismatch problem is not the main challenge and can be greatly reduced by using more refined feature extractors [37], [16], [13].
The second limitation is the NMS threshold selection mechanism. The fixed NMS threshold determined by experience ignores the uniqueness of each input image, which leads to the detector only obtaining sub-optimal detection performance on the test dataset. To solve this problem, some methods transform the NMS threshold selection task into a convex optimization problem [30], [36]. The optimal solution of the convex problem is obtained by various swarm intelligence optimization algorithms. Moreover, the adaptive NMS schemes based on the specific complexity metric are exploited to deal with the crowded scenes, such as pedestrian detection [20], [4], and vehicle detection [7]. The NMS-free mechanism [3], [48], [8], [31], [10], [40] is another solution improved from the network structure, but the performance of NMS-free detectors is still inferior to NMS-based ones in practical application. Though the previous works [20], [4], [7], [22] have demonstrated that the adaptive NMS strategies indicated by complexity can achieve better detection performance in crowed scenes, there is not a unified scene complexity measurement for general object detection task. As a result, the scene complexity defined in the specific tasks is not suitable for the general object detection.
In this paper, we mainly investigate the relationship between the performance of object detector and the NMS threshold, which is named scene complexity. And an effective dynamic NMS mechanism is proposed for the generalized object detection. As shown in Fig. 1, the number of preserved inference bounding boxes increases with the rise of NMS threshold. Meanwhile, the trade-off between Recall and Precision can be achieved with a smaller NMS threshold in a simple picture, while with a larger NMS threshold in complicated scenes. Inspired by this observation, we first construct a unified scene complexity measurement based on the relationship between the detection performance and the NMS threshold. Different from other scene complexity definition [20], [4], [7], the proposed scene complexity measurement is only related to the NMS threshold. And then, a lightweight regression network embedded in the detection framework is built to predict the NMS threshold for each input image dynamically. In order to train this prediction branch network, the ground-truth of NMS threshold is constructed according to the proposed scene complexity. Experimental results show that the performance of object detector is steadily improved by the dynamic NMS.
In summary, the main contributions of this paper can be concluded as follows:
- •
We propose a unified scene complexity measurement for general object detection, which only depends on the NMS threshold.
- •
We construct a lightweight regression branch network based on the unified scene complexity measurement to predict the NMS threshold for each input image dynamically.
- •
Extensive experiments conducted on the widely used and challenging datasets show that the proposed NMS scheme outperforms the standard NMS and its variants.
The rest of this paper is organized as follows. Section 2 introduces a brief review on recent works. Section 3 elaborates the structure of NMS prediction branch network and its training details. Extensive experiments are conducted on Pascal VOC and MS-COCO dataset to verify the proposed method’s effectiveness in Section 4. Section 5 draws a conclusion.
Section snippets
Related work
In this section, we briefly review relevant works including general object detection and non-maximum suppression.
Methodology
In this section, we first introduce a unified scene complexity measurement based on the relationship between the detection performance and the NMS threshold for general object detection. Then a lightweight regression network, which is parallel to the detection head, is constructed to predict the NMS threshold for each input image dynamically. To train the prediction branch network, the supervision label of the NMS threshold is calculated by the unified scene complexity measurement. The overview
Experiments
In this section, we evaluate the proposed method and compare with the state-of-the-art NMS and its variants. We also embed our method into the object detectors and execute the comparison of detection performance. The comparative experiments are thoroughly evaluated on the Pascal VOC [6] and MS-COCO [19] datasets. We elaborate the experimental configuration in Section 4.1. The comparison results about D-NMS, standard NMS and its variants are described in Section 4.2. The experimental results on
Conclusion
In this paper, we have presented a dynamic NMS scheme for general object detection, named D-NMS. The relationship between the detection performance and the NMS threshold is used to define a unified scene complexity. And we construct a lightweight regression network supervised by a label originated from the scene complexity to dynamically predict the NMS threshold for each image. Through the comparison experiments, our method obviously outperforms the standard NMS and its variants. Meanwhile,
CRediT authorship contribution statement
Hao Zhao: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft. Jikai Wang: Conceptualization, Writing – review & editing. Deyun Dai: Conceptualization, Writing – review & editing. Shiqi Lin: Conceptualization, Writing – review & editing. Zonghai Chen: Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by the National Natural Science Found of China (Grant No. 91848111).
Hao Zhao received his B.S. and M.S. degrees from the Southwest University of Science and Technology (SWUST) in 2014 and 2017. He is now a Ph.D. candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include object detection, scene perception, meta-learning, and knowledge representation.
References (49)
- et al.
Improved non-maximum suppression for object detection using harmony search algorithm
Appl. Soft Comput.
(2019) - et al.
Recent advances in deep learning for object detection
Neurocomputing
(2020) - et al.
Iou-uniform r-cnn: Breaking through the limitations of rpn
Pattern Recogn.
(2021) - A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, Yolov4: Optimal speed and accuracy of object detection, 2020. arXiv preprint...
- et al.
Soft-nms–improving object detection with one line of code
- et al.
End-to-end object detection with transformers
European conference on computer vision, Springer
(2020) - et al.
Detection in crowded scenes: One proposal, multiple predictions
- et al.
Centernet: Keypoint triplets for object detection
- et al.
The pascal visual object classes (voc) challenge
International journal of computer vision
(2010) - N. Gählert, N. Hanselmann, U. Franke, J. Denzler, Visibility guided nms: Efficient boosting of amodal object detection...
Fast convergence of detr with spatially modulated co-attention
Relation networks for object detection
Lightweight adversarial network for salient object detection
Neurocomputing
Acquisition of localization confidence for accurate object detection
Fine-grained vehicle type detection and recognition based on dense attention network
Neurocomputing
Cornernet: Detecting objects as paired keypoints
Novel up-scale feature aggregation for object detection in aerial images
Neurocomputing
Feature pyramid networks for object detection
Focal loss for dense object detection
Microsoft coco: Common objects in context
European conference on computer vision, Springer
Adaptive nms: Refining pedestrian detection in a crowd
Ssd: Single shot multibox detector
European conference on computer vision, Springer
Cited by (6)
Field-matching attention network for object detection
2023, NeurocomputingUnmanned aerial vehicles general aerial person-vehicle recognition based on improved YOLOv8s algorithm
2024, Computers, Materials and ContinuaResearch on Real-time Detection of Stacked Objects Based on Deep Learning
2023, Journal of Intelligent and Robotic Systems: Theory and Applications
Hao Zhao received his B.S. and M.S. degrees from the Southwest University of Science and Technology (SWUST) in 2014 and 2017. He is now a Ph.D. candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include object detection, scene perception, meta-learning, and knowledge representation.
Jikai Wang received his B.S. and Ph.D. degrees respectively from the University of Yanshan in 2014 and the University of Science and Technology of China in 2020. He is now a post-doctoral in the Department of Automation, University of Science and Technology of China (USTC), China. His research interests include knowledge representation, intelligent information processing, robotics, visual SLAM, and machine learning.
Deyun Dai received her B.S. degree from Harbin Engineering University in 2016. She is currently the Ph.D. candidate for Control Science and Engineering in department of automation, University of Science and Technology of China, Hefei, China. Her research interests include computer vision and environment perception in autonomous driving scenarios.
Shiqi Lin received his B.S. degree from Dalian Minzu University, Dalian, China, in 2017. He is now a Ph.D candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include state estimation, visual localization, and semantic scene understanding.
Zonghai Chen was born in Anhui, China, in 1963. He received the B.S. degree in automation and the M.E. degree in control theory and control engineering from the University of Science and Technology of China (USTC), Hefei, China, in 1988 and 1991, respectively. He has been a Professor with the Department of Automation, USTC, since 1998. His research interests include modeling and control of complex systems, intelligent robotic and information processing, energy management technologies for electric vehicles, and smart microgrids. Prof. Chen is a member of the Robotics Technical Committee and Modelling, Identification and Signal Processing Technical Committee of the International Federation of Automation Control (IFAC). He was a recipient of special allowances from the State Council of PR China.