Elsevier

Neurocomputing

Volume 512, 1 November 2022, Pages 225-234
Neurocomputing

D-NMS: A dynamic NMS network for general object detection

https://doi.org/10.1016/j.neucom.2022.09.080Get rights and content

Highlights

  • We propose a unified scene complexity metric only depending on NMS threshold.

  • We build a lightweight regression branch to dynamically predict NMS threshold.

  • Extensive experiments have proved the effectiveness of our proposed NMS scheme.

Abstract

Non-maximum Suppression (NMS), which is used to find the optimal inferences among all candidate bounding boxes, is a significant post-processing step in most state-of-the-art object detectors. The fixed threshold scheme in the standard NMS equally treats each input image, which leads to the neglect of uniqueness. Recently, several adaptive NMS methods have been proposed and demonstrated to be superior to the standard NMS with a fixed threshold. However, the adaptability performance of these methods is limited due to the deficiency of measuring the complexity of the input image. In this paper, we propose a dynamic NMS network (D-NMS net) to predict the best NMS threshold for each input image, which can be embedded into most state-of-the-art single-stage object detectors. Concretely, we first propose a unified scene complexity definition for a single image according to the relationship between the P-R curve and the changing NMS threshold. Secondly, we calculate the optimal NMS threshold for each image according to the proposed definition, which is then applied as the supervision label in the training stage. Lastly, we embed the lightweight regression network, D-NMS net, into the mainstream object detectors. Extensive experiments are conducted on challenging datasets. With the help of our D-NMS net, the accuracy and efficiency of detectors have achieved obvious improvements. On Pascal VOC, the mean Average Precision (mAP) of RetinaNet is boosted from 81.60% to 84.74%, and the mAP of FCOS is improved from 79.12% to 84.20%. On MS-COCO, the Average Precision(AP) of RetinaNet is boosted from 36.4% to 38.5%, and the AP of FCOS is improved from 37.2% to 39.1%. Meanwhile, the inference speed of our method is increased by 62% at most.

Introduction

Object detection has been widely researched and applied over the past decades, the main task of which is to obtain a tight bounding box covering the predicted instance. Driven by the great progress of Deep Convolutional Neural Network (DCNN), there are lots of outstanding object detection frameworks having emerged [37], [49], such as Faster R-CNN [26], YOLO-Vx [23], [1], [24], [25], SSD [21], RetinaNet [18] and FCOS[33]. In these detection pipelines, Non-maximum Suppression (NMS) is an essential post-processing step to remove the redundant bounding boxes around a real object. The standard NMS ranks all candidate bounding boxes by their confidence score and iteratively removes candidates that exceeds a manually chosen Intersection-over-Union (IoU). Although this mechanism is ingenious and straightforward, there are two potential inherent limitations.

The first one is the mismatch problem between the confidence score and the overlap rate of bounding boxes. The inference box with a high classification confidence score is expected to have a high overlap rate with the ground-truth box, while it is not always satisfied. To alleviate this problem, some works attempt to design the novel loss function by taking the different overlap of samples into account, which is more suitable for the bounding boxes regression task [39], [27], [41]. Besides, the prediction branch for overlap rate is also a widely used solution to handle the mismatch problem [9], [12]. What’s more, the correlation between the classification confidence score and the overlap rate of bounding boxes can be effectively improved by learning ways [28]. However, the impact of the mismatch problem is not the main challenge and can be greatly reduced by using more refined feature extractors [37], [16], [13].

The second limitation is the NMS threshold selection mechanism. The fixed NMS threshold determined by experience ignores the uniqueness of each input image, which leads to the detector only obtaining sub-optimal detection performance on the test dataset. To solve this problem, some methods transform the NMS threshold selection task into a convex optimization problem [30], [36]. The optimal solution of the convex problem is obtained by various swarm intelligence optimization algorithms. Moreover, the adaptive NMS schemes based on the specific complexity metric are exploited to deal with the crowded scenes, such as pedestrian detection [20], [4], and vehicle detection [7]. The NMS-free mechanism [3], [48], [8], [31], [10], [40] is another solution improved from the network structure, but the performance of NMS-free detectors is still inferior to NMS-based ones in practical application. Though the previous works [20], [4], [7], [22] have demonstrated that the adaptive NMS strategies indicated by complexity can achieve better detection performance in crowed scenes, there is not a unified scene complexity measurement for general object detection task. As a result, the scene complexity defined in the specific tasks is not suitable for the general object detection.

In this paper, we mainly investigate the relationship between the performance of object detector and the NMS threshold, which is named scene complexity. And an effective dynamic NMS mechanism is proposed for the generalized object detection. As shown in Fig. 1, the number of preserved inference bounding boxes increases with the rise of NMS threshold. Meanwhile, the trade-off between Recall and Precision can be achieved with a smaller NMS threshold in a simple picture, while with a larger NMS threshold in complicated scenes. Inspired by this observation, we first construct a unified scene complexity measurement based on the relationship between the detection performance and the NMS threshold. Different from other scene complexity definition [20], [4], [7], the proposed scene complexity measurement is only related to the NMS threshold. And then, a lightweight regression network embedded in the detection framework is built to predict the NMS threshold for each input image dynamically. In order to train this prediction branch network, the ground-truth of NMS threshold is constructed according to the proposed scene complexity. Experimental results show that the performance of object detector is steadily improved by the dynamic NMS.

In summary, the main contributions of this paper can be concluded as follows:

  • We propose a unified scene complexity measurement for general object detection, which only depends on the NMS threshold.

  • We construct a lightweight regression branch network based on the unified scene complexity measurement to predict the NMS threshold for each input image dynamically.

  • Extensive experiments conducted on the widely used and challenging datasets show that the proposed NMS scheme outperforms the standard NMS and its variants.

The rest of this paper is organized as follows. Section 2 introduces a brief review on recent works. Section 3 elaborates the structure of NMS prediction branch network and its training details. Extensive experiments are conducted on Pascal VOC and MS-COCO dataset to verify the proposed method’s effectiveness in Section 4. Section 5 draws a conclusion.

Section snippets

Related work

In this section, we briefly review relevant works including general object detection and non-maximum suppression.

Methodology

In this section, we first introduce a unified scene complexity measurement based on the relationship between the detection performance and the NMS threshold for general object detection. Then a lightweight regression network, which is parallel to the detection head, is constructed to predict the NMS threshold for each input image dynamically. To train the prediction branch network, the supervision label of the NMS threshold is calculated by the unified scene complexity measurement. The overview

Experiments

In this section, we evaluate the proposed method and compare with the state-of-the-art NMS and its variants. We also embed our method into the object detectors and execute the comparison of detection performance. The comparative experiments are thoroughly evaluated on the Pascal VOC [6] and MS-COCO [19] datasets. We elaborate the experimental configuration in Section 4.1. The comparison results about D-NMS, standard NMS and its variants are described in Section 4.2. The experimental results on

Conclusion

In this paper, we have presented a dynamic NMS scheme for general object detection, named D-NMS. The relationship between the detection performance and the NMS threshold is used to define a unified scene complexity. And we construct a lightweight regression network supervised by a label originated from the scene complexity to dynamically predict the NMS threshold for each image. Through the comparison experiments, our method obviously outperforms the standard NMS and its variants. Meanwhile,

CRediT authorship contribution statement

Hao Zhao: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft. Jikai Wang: Conceptualization, Writing – review & editing. Deyun Dai: Conceptualization, Writing – review & editing. Shiqi Lin: Conceptualization, Writing – review & editing. Zonghai Chen: Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Found of China (Grant No. 91848111).

Hao Zhao received his B.S. and M.S. degrees from the Southwest University of Science and Technology (SWUST) in 2014 and 2017. He is now a Ph.D. candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include object detection, scene perception, meta-learning, and knowledge representation.

References (49)

  • Y. Song et al.

    Improved non-maximum suppression for object detection using harmony search algorithm

    Appl. Soft Comput.

    (2019)
  • X. Wu et al.

    Recent advances in deep learning for object detection

    Neurocomputing

    (2020)
  • L. Zhu et al.

    Iou-uniform r-cnn: Breaking through the limitations of rpn

    Pattern Recogn.

    (2021)
  • A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, Yolov4: Optimal speed and accuracy of object detection, 2020. arXiv preprint...
  • N. Bodla et al.

    Soft-nms–improving object detection with one line of code

  • N. Carion et al.

    End-to-end object detection with transformers

    European conference on computer vision, Springer

    (2020)
  • X. Chu et al.

    Detection in crowded scenes: One proposal, multiple predictions

  • K. Duan et al.

    Centernet: Keypoint triplets for object detection

  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    International journal of computer vision

    (2010)
  • N. Gählert, N. Hanselmann, U. Franke, J. Denzler, Visibility guided nms: Efficient boosting of amodal object detection...
  • P. Gao et al.

    Fast convergence of detr with spatially modulated co-attention

  • Y. He, X. Zhang, M. Savvides, K. Kitani, Softer-nms: Rethinking bounding box regression for accurate object detection,...
  • H. Hu et al.

    Relation networks for object detection

  • L. Huang et al.

    Lightweight adversarial network for salient object detection

    Neurocomputing

    (2019)
  • B. Jiang et al.

    Acquisition of localization confidence for accurate object detection

  • X. Ke et al.

    Fine-grained vehicle type detection and recognition based on dense attention network

    Neurocomputing

    (2020)
  • H. Law et al.

    Cornernet: Detecting objects as paired keypoints

  • Z. Li, F. Zhou, Fssd: feature fusion single shot multibox detector, 2017. arXiv preprint...
  • H. Lin et al.

    Novel up-scale feature aggregation for object detection in aerial images

    Neurocomputing

    (2020)
  • T.Y. Lin et al.

    Feature pyramid networks for object detection

  • T.Y. Lin et al.

    Focal loss for dense object detection

  • T.Y. Lin et al.

    Microsoft coco: Common objects in context

    European conference on computer vision, Springer

    (2014)
  • S. Liu et al.

    Adaptive nms: Refining pedestrian detection in a crowd

  • W. Liu et al.

    Ssd: Single shot multibox detector

    European conference on computer vision, Springer

    (2016)
  • Cited by (6)

    Hao Zhao received his B.S. and M.S. degrees from the Southwest University of Science and Technology (SWUST) in 2014 and 2017. He is now a Ph.D. candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include object detection, scene perception, meta-learning, and knowledge representation.

    Jikai Wang received his B.S. and Ph.D. degrees respectively from the University of Yanshan in 2014 and the University of Science and Technology of China in 2020. He is now a post-doctoral in the Department of Automation, University of Science and Technology of China (USTC), China. His research interests include knowledge representation, intelligent information processing, robotics, visual SLAM, and machine learning.

    Deyun Dai received her B.S. degree from Harbin Engineering University in 2016. She is currently the Ph.D. candidate for Control Science and Engineering in department of automation, University of Science and Technology of China, Hefei, China. Her research interests include computer vision and environment perception in autonomous driving scenarios.

    Shiqi Lin received his B.S. degree from Dalian Minzu University, Dalian, China, in 2017. He is now a Ph.D candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include state estimation, visual localization, and semantic scene understanding.

    Zonghai Chen was born in Anhui, China, in 1963. He received the B.S. degree in automation and the M.E. degree in control theory and control engineering from the University of Science and Technology of China (USTC), Hefei, China, in 1988 and 1991, respectively. He has been a Professor with the Department of Automation, USTC, since 1998. His research interests include modeling and control of complex systems, intelligent robotic and information processing, energy management technologies for electric vehicles, and smart microgrids. Prof. Chen is a member of the Robotics Technical Committee and Modelling, Identification and Signal Processing Technical Committee of the International Federation of Automation Control (IFAC). He was a recipient of special allowances from the State Council of PR China.

    View full text