Elsevier

Pattern Recognition

Volume 110, February 2021, 107593
Pattern Recognition

Multi-scale structural kernel representation for object detection

https://doi.org/10.1016/j.patcog.2020.107593Get rights and content

Highlights

  • The first attempt to integrate high-order statistics into deep CNNs for effective object detection.

  • The proposed high-order statistical module preserves the spatial information while taking account into their special geometry structures.

  • Performing favorably in comparison to the state-of-the-art methods and showing good generalization ability to other dense prediction tasks.

Abstract

Existing high-performance object detection methods greatly benefit from the powerful representation ability of deep convolutional neural networks (CNNs). Recent researches show that integration of high-order statistics remarkably improves the representation ability of deep CNNs. However, high-order statistics for object detection lie in two challenges. Firstly, previous methods insert high-order statistics into deep CNNs as global representations, which lose spatial information of inputs, and so are not applicable to object detection. Furthermore, high-order statistics have special structures, which should be considered for proper use of high-order statistics. To overcome above challenges, this paper proposes a Multi-scale Structural Kernel Representation (MSKR) for improving performance of object detection. Our MSKR is developed based on the polynomial kernel approximation, which does not only draw into high-order statistics but also preserve the spatial information of input. To consider geometry structures of high-order representations, a feature power normalization method is introduced before computation of kernel representation. Comparing with the most commonly used first-order statistics in existing CNN-based detectors, our MSKR can generate more discriminative representations, and so be flexibly integrated into deep CNNs for improving performance of object detection. By adopting the proposed MSKR to existing object detection methods (i.e., Faster R-CNN, FPN, Mask R-CNN and RetinaNet), it achieves clear improvement on three widely used benchmarks, while obtaining very competitive performance with state-of-the-art methods.

Introduction

Object detection has attracted a lot of attentions in past decades. As one of the fundamental problems in computer vision, it plays a key role in a widely spreading of applications [1], [2]. From the approaches [3] based on traditional handcrafted features to ones [4], [5] based on deep convolutional features, the rapid development of convolutional neural networks (CNNs) [6], [7] greatly improves the performance of object detection. R-CNN [4] is among the first which exploits the powerful representation ability of deep CNNs to characterize the object proposals and achieves a significant improvement compared with the traditional methods. Subsequently, Region of Interest (RoI) pooling layer and Region Proposal Network (RPN) are respectively introduced by Fast R-CNN [8] and Faster R-CNN [5], allowing object detection can be designed into an end-to-end architecture. Such methods require no pre-generated proposals [9], thereby leading to better performance and faster training/testing.

Although Faster R-CNN yields promising performance, it obtains representations by simply performing average pooling on the outputs of one single convolution (conv) layer (i.e., the last conv layer), limiting the robustness and accuracy of detection. As illustrated in Fig. 1(a), Faster R-CNN fails to detect birds with large pose changes, blur and similar background. One solution to improve performance of object detection is extraction of multi-scale feature maps as representations from different conv layers [10], and such methods can be concluded as two parts: concatenation of multi-scale feature maps [11], [12] and pyramidal feature hierarchy [13], [14]. Generally speaking, the feature maps from bottom layers have higher resolutions but weaker semantic information, while feature maps in top layers have high-level semantic information but lower resolutions. The concatenation-based methods can obtain a coarse-to-fine representation for each object proposal by concatenating the outputs of different convolution layers into one single feature map. For the methods based on pyramidal feature hierarchy, the outputs of different convolution layers are employed in a pyramid manner. Moreover, each combination gives its own prediction, and all detection results are fused by using non-maximum suppression. As shown in Fig. 1(b), the methods based on concatenation of multi-scale feature maps (e.g., RON [13]) are able to improve detection performance by enhancing the representations.

All aforementioned methods focus on improving detection performance by extracting multi-scale feature maps, after where the simple first-order pooling (i.e., RoI-Pooling) is performed on feature maps to generate representations. Recently, some researchers show integration of high-order statistics can significantly improve the representation ability of deep CNNs [16], [17]. Among them, B-CNN [16] inserts a second-order noncentral moment into deep CNNs, and element-wise power normalization followed by ℓ2-normalization is performed. Wang et al. [17]. embed a global Gaussian distribution into deep CNNs. Zhang et al. [18] propose a second-order locality-constrained affine subspace coding method to perform both image classification and image retrieval tasks. These methods obtain promising improvement over first-order pooling based CNN models on challenging fine-grained visual categorization. Li et al. [19] propose a matrix power normalized second-order pooling, showing consistent superiority over various CNN models on large-scale ImageNet classification [20].

Above discussion clearly encourages us to exploit high-order statistics for improving the performance of object detection. However, there exist two challenges to exploit high-order statistics for object detection. First of all, aforementioned high-order methods totally compute global representations for the whole images, which completely lose spatial information of images, and so are not applicable to object detection. On the other hand, high-order statistics have special structures, and previous works [17], [19] have demonstrated that geometry structures should be considered for achieving favorable performance. To handle the first challenge, we introduce a polynomial kernel approximation method inspired by Cai et al. [21] in our previous work [15], where the weight of the high-order statistics inherent in a polynomial kernel can be approximated by rank-1 tensors decomposition [22], and high-order representations can be computed by learning weight parameters. In deep architectures, we can learn weight parameters by using a series of 1 × 1 convolutions and element-wise product operations, and all these operations preserve spatial information. Therefore, the introduced polynomial kernel method can capture high-order statistics while preserving spatial information, having ability to improve performance of dense predication tasks (e.g., object detection).

Recent work [19] shows that matrix power normalization can effectively exploit the geometry of second-order statistics in deep CNN architectures to improve classification performance. Given a set of convolutional features XRC×n,matrix power normalization of second-order pooling of X (i.e., M=1CXX) can be computed by shrinking eigenvalues of Mwith a power function through eigenvalue decomposition (EIG) or singular value decomposition (SVD). However, such kinds of methods can not be directly adopted to approximated kernel representations, where no explicit high-order statistic (i.e., XX) is computed. Inspired by success of [19], we introduce a feature power normalization method, which can be regarded as transferring matrix power normalization from high-order representations to the original convolutional features. In this way, we can effectively consider geometry of high-order representations based on matrix power normalization, while avoiding computation of explicit high-order statistics. Accordingly, our Multi-scale Structural Kernel Representation (MSKR) considers geometry of high-order kernel representations by performing a feature power normalization before the polynomial kernel approximation. Besides, we embed an attention module [23] into our kernel representations for considering the importance of each convolutional feature. The attention module can jointly encode spatial and channel information.

This paper is an extension of our previous work [15]. There exist two significant differences between MSKR and our previous work (MLKP) in terms of techniques and experiments. Specifically, from technique perspective, MSKR introduces a novel feature power normalization into MLKP, which appropriately makes use of geometry of high-order statistics captured by polynomial kernel approximation of MLKP. Besides, MSKR extends the location-weight network (only spatial information is considered) of MLKP to an attention network, where the latter jointly takes spatial and channel information into consideration. From experiment perspective, we conduct much more experiments to verify the effectiveness of MSKP in terms of detectors, backbone models and tasks comparing with previous work. First, we adopt MSKR to more detectors besides Faster R-CNN [5] used in Wang et al. [15]. Second, we employ light-weight MobileNet [24] as backbone models to assess effect of MSKR in the mobile settings. Additionally, we evaluate the generalization ability of the proposed MSKR on instance segmentation task.

The overview of our proposed MSKR is illustrated in Fig. 2. Given a multi-scale feature map X,MSKR first performs feature power normalization, and then kernel representations are computed using 1 × 1 convolution operation and element-wise product. Finally, an attention module is used for re-weighting kernel representations. As shown in Fig. 1(d), MSKR can significantly improve detection performance, especially for objects with complex variations (e.g., large pose changes, blur and similar background). The experiments are conducted on three widely used benchmarks, i.e., PASCAL VOC 2007, PASCAL VOC 2012 [25] and MS COCO [26]. The contributions of this paper are summarized as follows:

  • 1.

    In this paper, we make an attempt to integrate high-order statistics into deep CNNs as representations for effective object detection. To this end, we propose a novel Multi-scale Structural Kernel Representation (MSKR). The proposed MSKR can preserve the spatial information while taking geometry structure of high-order statistics into account.

  • 2.

    To consider the geometry of our high-order kernel representations, we introduce a feature power normalization method before computation of kernel representations, approximately performing matrix power normalization on high-order representations. It can further improve the performance of kernel representations.

  • 3.

    Extensive experiments are conducted on three widely used object detection benchmarks, and the results show MSKR clearly improves performance of many existing deep detectors (e.g., Faster R-CNN [5], FPN [14], Mask R-CNN [27] and RetinaNet [28]), while performing favorably in comparison to the state-of-the-art methods. Besides, the results on instance segmentation task demonstrate that our MSKR has the great potential to improve performance of other dense predication task.

Section snippets

Related work

Recently, many advanced detectors based on Faster R-CNN have been proposed in community of region-based detection methods. R-FCN [29] is among the first to address the dilemma between invariance in classification and variance in detection by replacing the RoI-Pooling layer with position-sensitive score map. FPN [14] and Mask R-CNN [27] deploy a top-down multi-scale pyramidal hierarchy structure to leverage the features at all scales. Moveover, Mask R-CNN replaces the RoI-Pooling layer with

Proposed method

In this section, we introduce the proposed Multi-scale Structural Kernel Representation (MSKR). Firstly, we present a modified multi-scale feature map to effectively utilize multi-resolution information. Then, a structural kernel representation is proposed to incorporate high-order statistics while maintaining spatial information and considering geometry structure, which is achieved by feature power normalization followed by polynomial kernel approximation. Besides, we use an attention module

Experiments

In this section, we evaluate performance of our proposed MSKR. Specifically, we first describe implementation details of our method. Then, we conduct ablation studies on key components of our method using Faster R-CNN [5] with ResNet-101 [7] on PASCAL VOC 2007 [25]. Additionally, we compare with other methods on three widely used benchmarks (i.e., PASCAL VOC 2007, PASCAL VOC 2012 [25] and MS COCO [26]) using four state-of-the-arts detectors (i.e., Faster R-CNN [5], FPN [14], Mask R-CNN [27] and

Conclusion

In this paper, we propose a novel Multi-scale Structural Kernel Representation (MSKR) method to effectively exploit high-order statistics for improving performance of object detection. The proposed MSKR can generate informative presentations, which preserve the spatial information while considering geometry structures of high-order statistics, thereby being suitable for dense predication. Our MSKR can be flexibly integrated into various object detection approaches and the experimental results

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by NSFC grant under nos. 61671182, U19A2073, 61806140 61971086, and 2019YFB210901, Project of State Key Laboratory of Robotics and System (HIT) under grant no.SKLRS202004D.

Hao Wang received B.S. and M.S. degree from Northeastern University, Shenyang, China, in 2012 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. His research interests include object detection, object segmentation and related problems.

References (38)

  • X. Li et al.

    Accurate object detection using memory-based models in surveillance scenes

    Pattern Recognit.

    (2017)
  • X. Huang et al.

    Water flow driven salient object detection at 180 fps

    Pattern Recognit.

    (2018)
  • B. Zhang et al.

    Locality-constrained affine subspace coding for image classification and retrieval

    Pattern Recognit.

    (2020)
  • P.F. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    International Conference on Learning Representations

    (2015)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • R. Girshick

    Fast R-CNN

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J.R. Uijlings et al.

    Selective search for object recognition

    Int. J. Comput. Vis.

    (2013)
  • W. Ma et al.

    Mdcn: multi-scale, deep inception convolutional neural networks for efficient object detection

    24th International Conference on Pattern Recognition

    (2018)
  • P. Zhang et al.

    Hyperfusion-net: hyper-densely reflective feature fusion for salient object detection

    Pattern Recognit.

    (2019)
  • H. Wang et al.

    Multi-scale fusion with context-aware network for object detection

    24th International Conference on Pattern Recognition

    (2018)
  • T. Kong et al.

    RON: reverse connection with objectness prior networks for object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • T.-Y. Lin et al.

    Feature pyramid networks for object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • H. Wang et al.

    Multi-scale location-aware kernel representation for object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • T. Lin et al.

    Bilinear convolutional neural networks for fine-grained visual recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Q. Wang et al.

    G2denet: global gaussian distribution embedding network and its application to visual recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • P. Li et al.

    Is second-order information helpful for large-scale visual recognition?

    Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • Cited by (20)

    • Control Distance IoU and Control Distance IoU Loss for Better Bounding Box Regression

      2023, Pattern Recognition
      Citation Excerpt :

      DIoU loss cannot distinguish which RPs is more similar to GT when the center points of RPs are at the same position. We can know that the calculation process of CIoU loss is more time-consuming, which will eventually drag down the overall training and test time [31]. In this section, we systematically explain the disadvantages of traditional IoUs and loss functions in the formal article.

    • Feature reconstruction and metric based network for few-shot object detection

      2023, Computer Vision and Image Understanding
      Citation Excerpt :

      Benefiting from RPN, two-stage methods usually perform better than one-stage methods in the object detection task. MSKR (Wang et al., 2021) proposes a multiscale structural kernel method, which not only draws into high-order statistics but also preserves the spatial information of input. Our model is based on Faster R-CNN because its second stage can easily design for few-shot object detection.

    • Spatial information enhancement network for 3D object detection from point cloud

      2022, Pattern Recognition
      Citation Excerpt :

      The past few decades have witnessed remarkable progress of deep learning on 2D vision tasks, such as segmentation, pose estimation, and object detection [1–7].

    • AccLoc: Anchor-Free and two-stage detector for accurate object localization

      2022, Pattern Recognition
      Citation Excerpt :

      Other works involve improving the feature fusion method, such as those of [27,27,28]. Additional works [22,23,29–31] try to extract features that contribute to high-quality object detection. For instance, BorderDet [31] extracts border features for accurate detection.

    View all citing articles on Scopus

    Hao Wang received B.S. and M.S. degree from Northeastern University, Shenyang, China, in 2012 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. His research interests include object detection, object segmentation and related problems.

    Qilong Wang received the Ph.D. degree in School of Information and Communication Engineering with Dalian University of Technology, China, in 2018. He currently joins Tianjin University as a lecturer at the College of Intelligence and Computing. His research interests include computer vision and pattern recognition, particularly visual classification and deep probability distribution modeling. He has published more than thirty academic papers in top conferences and referred journal including ICCV/CVPR/NIPS/ECCV/IJCAI and IEEE TPAMI/TIP/TCSVT.

    Peihua Li received Ph.D. degree from Harbin Institute of Technology in 2003, Harbin, China, and then worked for one year as a postdoctoral fellow at INRIA/IRISA, France. He achieved the honorary nomination of National Excellent Doctoral dissertation in China. He is currently a professor of Dalian University of Technology, Dalian, China. He was supported by Program for New Century Excellent Talents in University of Chinese Ministry of Education. His team won the 1st place in large-scale iNaturalist Challenge spanning 8000 species at FGVC5 CVPR2018, 2nd place in Alibaba Large-scale Image Search Challenge 2015, and 4th place in Noisy Iris Challenge Evaluation I. His research topics include deep learning, computer vision and pattern recognition, focusing on image/video recognition, object detection and semantic segmentation. He has published papers in top journals such as IEEE TPAMI/TIP/TCSVT and top conferences including ICCV/CVPR/ECCV.

    Wangmeng Zuo is currently a Professor in the School of Computer Science and Technology, Harbin Institute of Technology. His research interests include image video enhancement, image video generation, visual tracking, and image classification. He has published over 90 papers in top-tier academic journals and conferences. He has served as area chairs of CVPR 2020/ICCV 2019, a Tutorial Organizer in ECCV 2016. He is also an Associate Editor of the IET Biometrics, The Visual Computers, Journal of Electronic Imaging, and the Guest Editor of Neurocomputing, Pattern Recognition, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Neural Networks and Learning Systems.

    View full text