DrlNet: Blind object proposal quality assessment with discriminative response learning

https://doi.org/10.1016/j.dsp.2020.102810Get rights and content

Abstract

Object proposal quality assessment without ground truth as reference is a challenging task. Some existing methods measure the quality with hand-crafted metrics for subjective metrics, such as objectness and foreground confidence. Recently, deep learning is adopted for direct assessment for quantifiable metric, such as Intersection over Union (IoU). However, we find that IoU, the commonly used quality metric, is far from fully describing the quality of an object proposal. Proposals with the same IoU score may carry totally different amount of discriminative attribute. We introduce a new metric named Discriminative Information Richness (DIR) to characterize the discriminative degree of the given object proposal. DIR is derived from the response intensity of the projected deep feature maps, whose high correlation response indicates the discriminative regions. Besides, we design a convolutional neural network named DrlNet to simultaneously predict IoU scores and perceive the richness of the identification information. DrlNet is defined as a multi-metric joint deep regression network for both spatial covering prediction and discriminative information richness perception. Compared with the solely IoU based models, DrlNet can provide more comprehensive quality assessment. We perform comprehensive experiments on both PASCAL VOC dataset and COCO dataset. The experimental results show that our DrlNet performs well on both proposal selection and object detection tasks. Particularly, experimental results on COCO dataset demonstrate the good generalization ability of the proposed model.

Introduction

In the past few decades, academia has witnessed rapid development of object detection [1], especially with the boost of deep learning [2]. In Convolutional Neural Network (CNN) based two-stage object detection approaches, such as R-CNN [3], SPP-Net [4], Fast R-CNN [5] and Faster R-CNN [6], proposal algorithms are widely used for generating object candidates.

Actually, besides object detection, object proposal algorithms, which aim to provide bounding box candidates with high object covering confidence, have played important roles in many other high-level computer vision tasks, such as object segmentation [7], [8], visual tracking [9], [10], action detection [11], [12], et al. Like region-wise processing strategy [13], [14], [15], [16], [17], which are widely used in computer vision tasks, proposal based processing strategy is another commonly adopted processing manner. The object proposal algorithms generate massive target hypothesises, which can specify the processing targets and greatly narrow the information space to be processed. While, it also shows the facts that, the quality of object proposals will directly influence the performance of such subsequent tasks. A small candidate pool with high quality proposals will greatly boost the performance and efficiency of subsequent steps of the application algorithm. Conversely, poor proposals may even significantly degrade application performance.

Given ground truth file, proposal quality can be easily estimated with the commonly used IoU metric. However, there is usually no available ground truth file for direct quality assessment in real-world applications, which makes it a blind assessment problem. Actually, no-reference object proposal quality ranking has been studied in proposal generating methods [18], [19], [20], [21] for good proposal suggestion. However, these build-in ranking modules can only provide relative superiority orders, but hardly give a metric based prediction, such as predicting the IoU score. To cover this concern, Wu et al. [22] proposed a generic proposal evaluator (GPE), which can directly predict the IoU score of the given object proposal. However, taking IoU as the only metric can hardly distinguish the quality difference of proposals covering the different parts of the object.

In Fig. 1 (a), we present a cat image, whose ground truth object box is shown in Fig. 1 (b). Two bounding box candidates, which share the same IoU values and equally cover the right and left body of the cat, are presented in Fig. 1 (c) and Fig. 1 (d), respectively. Apparently, GPE can not effectively evaluate the relative superiority of the two candidate boxes since they have the same IoU scores. While, we can easily notice that the proposal in Fig. 1 (c) can provide more discriminative information which can help distinguish the category of target object better than that in Fig. 1 (d), so proposal in Fig. 1 (c) should be given higher quality evaluation. Richness of discriminative information is a vital factor in proposal quality assessment, which also has broad prospects of application to high-level tasks.

Taking unsupervised and weakly supervised detection tasks for example, discriminative candidates can definitely provide more non-redundant and valuable information for training. Accordingly, the introduction of discriminative judgment helps the detector to capture the key attributes efficiently and reduces the cost of computation and storage.

Hence, screening out the region proposal with rich discriminative information from massive candidates is an important and promising task. However, richness of discriminative information has never been comprehensively explored, especially as a quality indicator. Actually, there are no existing detection or recognition databases which have annotated samples including quantitative scores for discriminative degree. One of the main obstacles is how to define a suitable metric to quantitatively compute the discrimination of the given proposal. And then, another subsequent problem is how to predict this metric without ground truth as reference. In this paper, our contribution mainly reflects as follows:

(1) We introduce a new metric named Discriminative Information Richness (DIR) to characterize the discriminative degree of the given proposal.

(2) A blind quality evaluation method within discriminative response learning framework is proposed, which can simultaneously perceive the richness of the identification information and the target covering in the candidate area.

Section snippets

Blind image and saliency quality assessment

In computer vision society, blind quality assessment is not a new research topic and a group of promising Blind Image Quality Assessment (BIQA) methods have been proposed in the last several years [23], [24], [25], [26], [27]. Oszustet et al. [23] proposed to extract local features via derivative filters and adopted support vector regression technique for blind image quality assessment. Liu et al. [24] creatively proposed to extract both low-level and high-level statistical features, and then

Blind discrimination assessment with deep response learning

To fully describe the quality of an object proposal, we propose a proposal quality assessment framework, namely Discriminative Response Learning Network (DrlNet), which can pick out the optimal proposals by inferring two complementary quality metrics. The first metric is IoU, which expresses the spatial consistency between candidate boxes and the ground truth. And the second, namely DIR, shows discriminative information richness of the candidate boxes. Intuitively, discriminative information

Experimental results and analysis

In this section, we present experimental results to demonstrate the performance of the proposed DrlNet from three aspects. In Sec. 4.1, we verify the effectiveness of our trained model with both qualitative analysis and quantitative evaluation. In Sec. 4.2, we test the proposal selection performance. Finally, in Sec. 4.3, we give evaluation and analysis about the generalization ability of the trained DrlNet to show whether it is suitable for images outside the training categories.

Conclusion

In this paper, we propose an object proposal quality assessment network, namely DrlNet, within discriminative response learning framework. The proposed method can simultaneously perceive the richness of the identification information and the target covering in the candidate area without ground truth information. We conduct experiments on publicly available datasets and other images containing the categories unseen by the trained models to verify the effectiveness and generalization of DrlNet.

CRediT authorship contribution statement

Qi Qi: Investigation, Methodology, Software. Kunqian Li: Conceptualization, Funding acquisition, Methodology, Visualization, Writing - original draft. Xinning Wang: Conceptualization, Writing - review & editing. Xin Luan: Resources, Supervision, Writing - review & editing. Dalei Song: Project administration, Resources, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The research has been supported by the National Natural Science Foundation of China under Grant 61906177, in part by the Natural Science Foundation of Shandong Province under Grant ZR2019BF034, in part by Fundamental Research Funds for the Central Universities under Grants 201813022 and 201964013.

Qi Qi received the B.S. and M.S. degrees from University of Jinan, Jinan, China, in 2015 and 2017, respectively. He is currently working toward the Ph.D. degree in College of Information Science and Engineering, Ocean University of China, Qingdao, China. His research interests include image processing and computer vision.

References (45)

  • Y. Zhu et al.

    The prediction of head and eye movement for 360 degree images

    Signal Process. Image Commun.

    (2018)
  • Y. Zhu et al.

    The prediction of saliency map for head and eye movements in 360 degree images

    IEEE Trans. Multimed.

    (2019)
  • Z. Zou et al.

    Object detection in 20 years: a survey

  • L. Jiao et al.

    A survey of deep learning-based object detection

    IEEE Access

    (2019)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • K. He et al.

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • R. Girshick

    Fast r-cnn

  • S. Ren et al.

    Faster r-cnn: towards real-time object detection with region proposal networks

  • K. Li et al.

    Unsupervised co-segmentation for indefinite number of common foreground objects

    IEEE Trans. Image Process.

    (2016)
  • J. Zhang et al.

    Multivideo object cosegmentation for irrelevant frames involved videos

    IEEE Signal Process. Lett.

    (2016)
  • Y. Zhao et al.

    Temporal action detection with structured segment networks

  • W. Tao et al.

    Unified mean shift segmentation and graph region merging algorithm for infrared ship target segmentation

    Opt. Eng.

    (2007)
  • Cited by (1)

    Qi Qi received the B.S. and M.S. degrees from University of Jinan, Jinan, China, in 2015 and 2017, respectively. He is currently working toward the Ph.D. degree in College of Information Science and Engineering, Ocean University of China, Qingdao, China. His research interests include image processing and computer vision.

    Kunqian Li received his B.S. degree in China University of Petroleum (UPC), Qingdao, China, in 2012. In 2018, he received his Ph.D. degree in Huazhong University of Science and Technology (HUST), Wuhan, China. He is currently a lecturer in College of Engineering, Ocean University of China, Qingdao, China. His research interests include image processing and visual recognition.

    Xinning Wang received her B.S. degree and M.E. degree from Ocean University of China, Qingdao City, China, in 2009 and 2012, respectively, and the Ph.D. degree with the Department of Computer Science and Software Engineering in Auburn University in 2017. She is currently a post-doctoral research fellow in Ocean University of China, Qingdao, China. Her research interests include spanning data mining and analytics, computer architecture and systems, cloud computing, machine learning and cybersecurity.

    Xin Luan received the B.S. and M.S. degrees from the School of Computer Science and Technology of Harbin Engineering University. She has been a lecturer, associate professor, professor, doctoral tutor of College of Information Science and Engineering, Ocean University of China, Qingdao, China. She is currently an extramural doctoral tutor of Ocean University of China. She is mainly engaged in research on ocean observation technology and artificial intelligence.

    Dalei Song received his Ph.D. degree from Harbin Industrial University, Harbin, China, in 1999. From 1999 to 2001, he was a senior engineer with Lucent Technologies. He is currently a full professor with College of Engineering, Ocean University of China, Qingdao, China. His research interests include machine intelligent perception, ocean observation technology, robot control technology and computer vision.

    View full text