D-NMS: A dynamic NMS network for general object detection

doi:10.1016/j.neucom.2022.09.080

Neurocomputing

Volume 512, 1 November 2022, Pages 225-234

https://doi.org/10.1016/j.neucom.2022.09.080 Get rights and content

Highlights

•
We propose a unified scene complexity metric only depending on NMS threshold.
•
We build a lightweight regression branch to dynamically predict NMS threshold.
•
Extensive experiments have proved the effectiveness of our proposed NMS scheme.

Abstract

Non-maximum Suppression (NMS), which is used to find the optimal inferences among all candidate bounding boxes, is a significant post-processing step in most state-of-the-art object detectors. The fixed threshold scheme in the standard NMS equally treats each input image, which leads to the neglect of uniqueness. Recently, several adaptive NMS methods have been proposed and demonstrated to be superior to the standard NMS with a fixed threshold. However, the adaptability performance of these methods is limited due to the deficiency of measuring the complexity of the input image. In this paper, we propose a dynamic NMS network (D-NMS net) to predict the best NMS threshold for each input image, which can be embedded into most state-of-the-art single-stage object detectors. Concretely, we first propose a unified scene complexity definition for a single image according to the relationship between the P-R curve and the changing NMS threshold. Secondly, we calculate the optimal NMS threshold for each image according to the proposed definition, which is then applied as the supervision label in the training stage. Lastly, we embed the lightweight regression network, D-NMS net, into the mainstream object detectors. Extensive experiments are conducted on challenging datasets. With the help of our D-NMS net, the accuracy and efficiency of detectors have achieved obvious improvements. On Pascal VOC, the mean Average Precision (mAP) of RetinaNet is boosted from 81.60% to 84.74%, and the mAP of FCOS is improved from 79.12% to 84.20%. On MS-COCO, the Average Precision(AP) of RetinaNet is boosted from 36.4% to 38.5%, and the AP of FCOS is improved from 37.2% to 39.1%. Meanwhile, the inference speed of our method is increased by 62% at most.

Introduction

Object detection has been widely researched and applied over the past decades, the main task of which is to obtain a tight bounding box covering the predicted instance. Driven by the great progress of Deep Convolutional Neural Network (DCNN), there are lots of outstanding object detection frameworks having emerged [37], [49], such as Faster R-CNN [26], YOLO-Vx [23], [1], [24], [25], SSD [21], RetinaNet [18] and FCOS[33]. In these detection pipelines, Non-maximum Suppression (NMS) is an essential post-processing step to remove the redundant bounding boxes around a real object. The standard NMS ranks all candidate bounding boxes by their confidence score and iteratively removes candidates that exceeds a manually chosen Intersection-over-Union (IoU). Although this mechanism is ingenious and straightforward, there are two potential inherent limitations.

The first one is the mismatch problem between the confidence score and the overlap rate of bounding boxes. The inference box with a high classification confidence score is expected to have a high overlap rate with the ground-truth box, while it is not always satisfied. To alleviate this problem, some works attempt to design the novel loss function by taking the different overlap of samples into account, which is more suitable for the bounding boxes regression task [39], [27], [41]. Besides, the prediction branch for overlap rate is also a widely used solution to handle the mismatch problem [9], [12]. What’s more, the correlation between the classification confidence score and the overlap rate of bounding boxes can be effectively improved by learning ways [28]. However, the impact of the mismatch problem is not the main challenge and can be greatly reduced by using more refined feature extractors [37], [16], [13].

The second limitation is the NMS threshold selection mechanism. The fixed NMS threshold determined by experience ignores the uniqueness of each input image, which leads to the detector only obtaining sub-optimal detection performance on the test dataset. To solve this problem, some methods transform the NMS threshold selection task into a convex optimization problem [30], [36]. The optimal solution of the convex problem is obtained by various swarm intelligence optimization algorithms. Moreover, the adaptive NMS schemes based on the specific complexity metric are exploited to deal with the crowded scenes, such as pedestrian detection [20], [4], and vehicle detection [7]. The NMS-free mechanism [3], [48], [8], [31], [10], [40] is another solution improved from the network structure, but the performance of NMS-free detectors is still inferior to NMS-based ones in practical application. Though the previous works [20], [4], [7], [22] have demonstrated that the adaptive NMS strategies indicated by complexity can achieve better detection performance in crowed scenes, there is not a unified scene complexity measurement for general object detection task. As a result, the scene complexity defined in the specific tasks is not suitable for the general object detection.

In this paper, we mainly investigate the relationship between the performance of object detector and the NMS threshold, which is named scene complexity. And an effective dynamic NMS mechanism is proposed for the generalized object detection. As shown in Fig. 1, the number of preserved inference bounding boxes increases with the rise of NMS threshold. Meanwhile, the trade-off between Recall and Precision can be achieved with a smaller NMS threshold in a simple picture, while with a larger NMS threshold in complicated scenes. Inspired by this observation, we first construct a unified scene complexity measurement based on the relationship between the detection performance and the NMS threshold. Different from other scene complexity definition [20], [4], [7], the proposed scene complexity measurement is only related to the NMS threshold. And then, a lightweight regression network embedded in the detection framework is built to predict the NMS threshold for each input image dynamically. In order to train this prediction branch network, the ground-truth of NMS threshold is constructed according to the proposed scene complexity. Experimental results show that the performance of object detector is steadily improved by the dynamic NMS.

In summary, the main contributions of this paper can be concluded as follows:

•
We propose a unified scene complexity measurement for general object detection, which only depends on the NMS threshold.
•
We construct a lightweight regression branch network based on the unified scene complexity measurement to predict the NMS threshold for each input image dynamically.
•
Extensive experiments conducted on the widely used and challenging datasets show that the proposed NMS scheme outperforms the standard NMS and its variants.

The rest of this paper is organized as follows. Section 2 introduces a brief review on recent works. Section 3 elaborates the structure of NMS prediction branch network and its training details. Extensive experiments are conducted on Pascal VOC and MS-COCO dataset to verify the proposed method’s effectiveness in Section 4. Section 5 draws a conclusion.

Section snippets

Related work

In this section, we briefly review relevant works including general object detection and non-maximum suppression.

Methodology

In this section, we first introduce a unified scene complexity measurement based on the relationship between the detection performance and the NMS threshold for general object detection. Then a lightweight regression network, which is parallel to the detection head, is constructed to predict the NMS threshold for each input image dynamically. To train the prediction branch network, the supervision label of the NMS threshold is calculated by the unified scene complexity measurement. The overview

Experiments

In this section, we evaluate the proposed method and compare with the state-of-the-art NMS and its variants. We also embed our method into the object detectors and execute the comparison of detection performance. The comparative experiments are thoroughly evaluated on the Pascal VOC [6] and MS-COCO [19] datasets. We elaborate the experimental configuration in Section 4.1. The comparison results about D-NMS, standard NMS and its variants are described in Section 4.2. The experimental results on

Conclusion

In this paper, we have presented a dynamic NMS scheme for general object detection, named D-NMS. The relationship between the detection performance and the NMS threshold is used to define a unified scene complexity. And we construct a lightweight regression network supervised by a label originated from the scene complexity to dynamically predict the NMS threshold for each image. Through the comparison experiments, our method obviously outperforms the standard NMS and its variants. Meanwhile,

CRediT authorship contribution statement

Hao Zhao: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft. Jikai Wang: Conceptualization, Writing – review & editing. Deyun Dai: Conceptualization, Writing – review & editing. Shiqi Lin: Conceptualization, Writing – review & editing. Zonghai Chen: Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Found of China (Grant No. 91848111).

Hao Zhao received his B.S. and M.S. degrees from the Southwest University of Science and Technology (SWUST) in 2014 and 2017. He is now a Ph.D. candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include object detection, scene perception, meta-learning, and knowledge representation.

References (49)

Y. Song et al.
Improved non-maximum suppression for object detection using harmony search algorithm
Appl. Soft Comput.
(2019)
X. Wu et al.
Recent advances in deep learning for object detection
Neurocomputing
(2020)
L. Zhu et al.
Iou-uniform r-cnn: Breaking through the limitations of rpn
Pattern Recogn.
(2021)
A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, Yolov4: Optimal speed and accuracy of object detection, 2020. arXiv preprint...
N. Bodla et al.
Soft-nms–improving object detection with one line of code
N. Carion et al.
End-to-end object detection with transformers
European conference on computer vision, Springer
(2020)
X. Chu et al.
Detection in crowded scenes: One proposal, multiple predictions
K. Duan et al.
Centernet: Keypoint triplets for object detection
M. Everingham et al.
The pascal visual object classes (voc) challenge
International journal of computer vision
(2010)
N. Gählert, N. Hanselmann, U. Franke, J. Denzler, Visibility guided nms: Efficient boosting of amodal object detection...

P. Gao et al.

Fast convergence of detr with spatially modulated co-attention

Y. He, X. Zhang, M. Savvides, K. Kitani, Softer-nms: Rethinking bounding box regression for accurate object detection,...

H. Hu et al.

Relation networks for object detection

L. Huang et al.

Lightweight adversarial network for salient object detection

Neurocomputing

(2019)

B. Jiang et al.

Acquisition of localization confidence for accurate object detection

X. Ke et al.

Fine-grained vehicle type detection and recognition based on dense attention network

Neurocomputing

(2020)

H. Law et al.

Cornernet: Detecting objects as paired keypoints

Z. Li, F. Zhou, Fssd: feature fusion single shot multibox detector, 2017. arXiv preprint...

H. Lin et al.

Novel up-scale feature aggregation for object detection in aerial images

Neurocomputing

(2020)

T.Y. Lin et al.

Feature pyramid networks for object detection

T.Y. Lin et al.

Focal loss for dense object detection

T.Y. Lin et al.

Microsoft coco: Common objects in context

European conference on computer vision, Springer

(2014)

S. Liu et al.

Adaptive nms: Refining pedestrian detection in a crowd

W. Liu et al.

Ssd: Single shot multibox detector

European conference on computer vision, Springer

(2016)

Cited by (6)

Field-matching attention network for object detection
2023, Neurocomputing
Feature pyramid network (FPN) is widely used in object detection in order to divide and conquer objects of different scales and to fuse high and low-level features, and it has achieved encouraging achievements in multi-scale object processing. However, due to the mismatch between receptive fields at different stages, the direct fusion of the two features from different receptive fields may be unable to achieve satisfactory results. Moreover, simple lateral connections in FPN may lead to loss of spatial relationships and details. To alleviate these problems, in this paper we propose a field-matching attention network (FMANet) for object detection. Particularly, we first propose a receptive field dilated module (RFDM), which is used to normalize receptive fields between features at different stages to the same scale. Furthermore, to capture the spatial informations and details, we build a dual attention module (DAM) by employing the spatial attention and channel attention. Utilizing both spatial and channel attention mechanisms simultaneously improves performance while maintaining speed. Finally, experimental results reveal that our proposed FMANet with DSPDarkNet-53 as backbone achieves a competitive detection performance.
Unmanned aerial vehicles general aerial person-vehicle recognition based on improved YOLOv8s algorithm
2024, Computers, Materials and Continua
Research on Real-time Detection of Stacked Objects Based on Deep Learning
2023, Journal of Intelligent and Robotic Systems: Theory and Applications
Do We Still Need Non-Maximum Suppression? Accurate Confidence Estimates and Implicit Duplication Modeling with IoU-Aware Calibration
2023, arXiv
Fs: Score Fused with Density as the Basis for Nms
2023, SSRN
Peak Nms: A Cluster-Based Approach to Replace Traditional Nms
2023, SSRN

Jikai Wang received his B.S. and Ph.D. degrees respectively from the University of Yanshan in 2014 and the University of Science and Technology of China in 2020. He is now a post-doctoral in the Department of Automation, University of Science and Technology of China (USTC), China. His research interests include knowledge representation, intelligent information processing, robotics, visual SLAM, and machine learning.

Deyun Dai received her B.S. degree from Harbin Engineering University in 2016. She is currently the Ph.D. candidate for Control Science and Engineering in department of automation, University of Science and Technology of China, Hefei, China. Her research interests include computer vision and environment perception in autonomous driving scenarios.

Shiqi Lin received his B.S. degree from Dalian Minzu University, Dalian, China, in 2017. He is now a Ph.D candidate in the Department of Automation, University of Science and Technology of China (USTC). His research interests include state estimation, visual localization, and semantic scene understanding.

Zonghai Chen was born in Anhui, China, in 1963. He received the B.S. degree in automation and the M.E. degree in control theory and control engineering from the University of Science and Technology of China (USTC), Hefei, China, in 1988 and 1991, respectively. He has been a Professor with the Department of Automation, USTC, since 1998. His research interests include modeling and control of complex systems, intelligent robotic and information processing, energy management technologies for electric vehicles, and smart microgrids. Prof. Chen is a member of the Robotics Technical Committee and Modelling, Identification and Signal Processing Technical Committee of the International Federation of Automation Control (IFAC). He was a recipient of special allowances from the State Council of PR China.

View full text

D-NMS: A dynamic NMS network for general object detection

Highlights

Abstract

Introduction

Section snippets

Related work

Methodology

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Appl. Soft Comput.

Neurocomputing

Pattern Recogn.

Soft-nms–improving object detection with one line of code

End-to-end object detection with transformers

European conference on computer vision, Springer

Detection in crowded scenes: One proposal, multiple predictions

Centernet: Keypoint triplets for object detection

The pascal visual object classes (voc) challenge

International journal of computer vision

Fast convergence of detr with spatially modulated co-attention

Relation networks for object detection

Lightweight adversarial network for salient object detection

Neurocomputing

Acquisition of localization confidence for accurate object detection

Fine-grained vehicle type detection and recognition based on dense attention network

Neurocomputing

Cornernet: Detecting objects as paired keypoints

Novel up-scale feature aggregation for object detection in aerial images

Neurocomputing

Feature pyramid networks for object detection

Focal loss for dense object detection

Microsoft coco: Common objects in context

European conference on computer vision, Springer

Adaptive nms: Refining pedestrian detection in a crowd

Ssd: Single shot multibox detector

European conference on computer vision, Springer