Multi-scale structural kernel representation for object detection
Introduction
Object detection has attracted a lot of attentions in past decades. As one of the fundamental problems in computer vision, it plays a key role in a widely spreading of applications [1], [2]. From the approaches [3] based on traditional handcrafted features to ones [4], [5] based on deep convolutional features, the rapid development of convolutional neural networks (CNNs) [6], [7] greatly improves the performance of object detection. R-CNN [4] is among the first which exploits the powerful representation ability of deep CNNs to characterize the object proposals and achieves a significant improvement compared with the traditional methods. Subsequently, Region of Interest (RoI) pooling layer and Region Proposal Network (RPN) are respectively introduced by Fast R-CNN [8] and Faster R-CNN [5], allowing object detection can be designed into an end-to-end architecture. Such methods require no pre-generated proposals [9], thereby leading to better performance and faster training/testing.
Although Faster R-CNN yields promising performance, it obtains representations by simply performing average pooling on the outputs of one single convolution (conv) layer (i.e., the last conv layer), limiting the robustness and accuracy of detection. As illustrated in Fig. 1(a), Faster R-CNN fails to detect birds with large pose changes, blur and similar background. One solution to improve performance of object detection is extraction of multi-scale feature maps as representations from different conv layers [10], and such methods can be concluded as two parts: concatenation of multi-scale feature maps [11], [12] and pyramidal feature hierarchy [13], [14]. Generally speaking, the feature maps from bottom layers have higher resolutions but weaker semantic information, while feature maps in top layers have high-level semantic information but lower resolutions. The concatenation-based methods can obtain a coarse-to-fine representation for each object proposal by concatenating the outputs of different convolution layers into one single feature map. For the methods based on pyramidal feature hierarchy, the outputs of different convolution layers are employed in a pyramid manner. Moreover, each combination gives its own prediction, and all detection results are fused by using non-maximum suppression. As shown in Fig. 1(b), the methods based on concatenation of multi-scale feature maps (e.g., RON [13]) are able to improve detection performance by enhancing the representations.
All aforementioned methods focus on improving detection performance by extracting multi-scale feature maps, after where the simple first-order pooling (i.e., RoI-Pooling) is performed on feature maps to generate representations. Recently, some researchers show integration of high-order statistics can significantly improve the representation ability of deep CNNs [16], [17]. Among them, B-CNN [16] inserts a second-order noncentral moment into deep CNNs, and element-wise power normalization followed by ℓ2-normalization is performed. Wang et al. [17]. embed a global Gaussian distribution into deep CNNs. Zhang et al. [18] propose a second-order locality-constrained affine subspace coding method to perform both image classification and image retrieval tasks. These methods obtain promising improvement over first-order pooling based CNN models on challenging fine-grained visual categorization. Li et al. [19] propose a matrix power normalized second-order pooling, showing consistent superiority over various CNN models on large-scale ImageNet classification [20].
Above discussion clearly encourages us to exploit high-order statistics for improving the performance of object detection. However, there exist two challenges to exploit high-order statistics for object detection. First of all, aforementioned high-order methods totally compute global representations for the whole images, which completely lose spatial information of images, and so are not applicable to object detection. On the other hand, high-order statistics have special structures, and previous works [17], [19] have demonstrated that geometry structures should be considered for achieving favorable performance. To handle the first challenge, we introduce a polynomial kernel approximation method inspired by Cai et al. [21] in our previous work [15], where the weight of the high-order statistics inherent in a polynomial kernel can be approximated by rank-1 tensors decomposition [22], and high-order representations can be computed by learning weight parameters. In deep architectures, we can learn weight parameters by using a series of 1 × 1 convolutions and element-wise product operations, and all these operations preserve spatial information. Therefore, the introduced polynomial kernel method can capture high-order statistics while preserving spatial information, having ability to improve performance of dense predication tasks (e.g., object detection).
Recent work [19] shows that matrix power normalization can effectively exploit the geometry of second-order statistics in deep CNN architectures to improve classification performance. Given a set of convolutional features matrix power normalization of second-order pooling of X (i.e., ) can be computed by shrinking eigenvalues of Mwith a power function through eigenvalue decomposition (EIG) or singular value decomposition (SVD). However, such kinds of methods can not be directly adopted to approximated kernel representations, where no explicit high-order statistic (i.e., X⊤X) is computed. Inspired by success of [19], we introduce a feature power normalization method, which can be regarded as transferring matrix power normalization from high-order representations to the original convolutional features. In this way, we can effectively consider geometry of high-order representations based on matrix power normalization, while avoiding computation of explicit high-order statistics. Accordingly, our Multi-scale Structural Kernel Representation (MSKR) considers geometry of high-order kernel representations by performing a feature power normalization before the polynomial kernel approximation. Besides, we embed an attention module [23] into our kernel representations for considering the importance of each convolutional feature. The attention module can jointly encode spatial and channel information.
This paper is an extension of our previous work [15]. There exist two significant differences between MSKR and our previous work (MLKP) in terms of techniques and experiments. Specifically, from technique perspective, MSKR introduces a novel feature power normalization into MLKP, which appropriately makes use of geometry of high-order statistics captured by polynomial kernel approximation of MLKP. Besides, MSKR extends the location-weight network (only spatial information is considered) of MLKP to an attention network, where the latter jointly takes spatial and channel information into consideration. From experiment perspective, we conduct much more experiments to verify the effectiveness of MSKP in terms of detectors, backbone models and tasks comparing with previous work. First, we adopt MSKR to more detectors besides Faster R-CNN [5] used in Wang et al. [15]. Second, we employ light-weight MobileNet [24] as backbone models to assess effect of MSKR in the mobile settings. Additionally, we evaluate the generalization ability of the proposed MSKR on instance segmentation task.
The overview of our proposed MSKR is illustrated in Fig. 2. Given a multi-scale feature map MSKR first performs feature power normalization, and then kernel representations are computed using 1 × 1 convolution operation and element-wise product. Finally, an attention module is used for re-weighting kernel representations. As shown in Fig. 1(d), MSKR can significantly improve detection performance, especially for objects with complex variations (e.g., large pose changes, blur and similar background). The experiments are conducted on three widely used benchmarks, i.e., PASCAL VOC 2007, PASCAL VOC 2012 [25] and MS COCO [26]. The contributions of this paper are summarized as follows:
- 1.
In this paper, we make an attempt to integrate high-order statistics into deep CNNs as representations for effective object detection. To this end, we propose a novel Multi-scale Structural Kernel Representation (MSKR). The proposed MSKR can preserve the spatial information while taking geometry structure of high-order statistics into account.
- 2.
To consider the geometry of our high-order kernel representations, we introduce a feature power normalization method before computation of kernel representations, approximately performing matrix power normalization on high-order representations. It can further improve the performance of kernel representations.
- 3.
Extensive experiments are conducted on three widely used object detection benchmarks, and the results show MSKR clearly improves performance of many existing deep detectors (e.g., Faster R-CNN [5], FPN [14], Mask R-CNN [27] and RetinaNet [28]), while performing favorably in comparison to the state-of-the-art methods. Besides, the results on instance segmentation task demonstrate that our MSKR has the great potential to improve performance of other dense predication task.
Section snippets
Related work
Recently, many advanced detectors based on Faster R-CNN have been proposed in community of region-based detection methods. R-FCN [29] is among the first to address the dilemma between invariance in classification and variance in detection by replacing the RoI-Pooling layer with position-sensitive score map. FPN [14] and Mask R-CNN [27] deploy a top-down multi-scale pyramidal hierarchy structure to leverage the features at all scales. Moveover, Mask R-CNN replaces the RoI-Pooling layer with
Proposed method
In this section, we introduce the proposed Multi-scale Structural Kernel Representation (MSKR). Firstly, we present a modified multi-scale feature map to effectively utilize multi-resolution information. Then, a structural kernel representation is proposed to incorporate high-order statistics while maintaining spatial information and considering geometry structure, which is achieved by feature power normalization followed by polynomial kernel approximation. Besides, we use an attention module
Experiments
In this section, we evaluate performance of our proposed MSKR. Specifically, we first describe implementation details of our method. Then, we conduct ablation studies on key components of our method using Faster R-CNN [5] with ResNet-101 [7] on PASCAL VOC 2007 [25]. Additionally, we compare with other methods on three widely used benchmarks (i.e., PASCAL VOC 2007, PASCAL VOC 2012 [25] and MS COCO [26]) using four state-of-the-arts detectors (i.e., Faster R-CNN [5], FPN [14], Mask R-CNN [27] and
Conclusion
In this paper, we propose a novel Multi-scale Structural Kernel Representation (MSKR) method to effectively exploit high-order statistics for improving performance of object detection. The proposed MSKR can generate informative presentations, which preserve the spatial information while considering geometry structures of high-order statistics, thereby being suitable for dense predication. Our MSKR can be flexibly integrated into various object detection approaches and the experimental results
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by NSFC grant under nos. 61671182, U19A2073, 61806140 61971086, and 2019YFB210901, Project of State Key Laboratory of Robotics and System (HIT) under grant no.SKLRS202004D.
Hao Wang received B.S. and M.S. degree from Northeastern University, Shenyang, China, in 2012 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. His research interests include object detection, object segmentation and related problems.
References (38)
- et al.
Accurate object detection using memory-based models in surveillance scenes
Pattern Recognit.
(2017) - et al.
Water flow driven salient object detection at 180 fps
Pattern Recognit.
(2018) - et al.
Locality-constrained affine subspace coding for image classification and retrieval
Pattern Recognit.
(2020) - et al.
Object detection with discriminatively trained part-based models
IEEE Trans. Pattern Anal. Mach. Intell.
(2010) - et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014) - et al.
Faster R-CNN: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
Very deep convolutional networks for large-scale image recognition
International Conference on Learning Representations
(2015) - et al.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) Fast R-CNN
Proceedings of the IEEE International Conference on Computer Vision
(2015)- et al.
Selective search for object recognition
Int. J. Comput. Vis.
(2013)
Mdcn: multi-scale, deep inception convolutional neural networks for efficient object detection
24th International Conference on Pattern Recognition
Hyperfusion-net: hyper-densely reflective feature fusion for salient object detection
Pattern Recognit.
Multi-scale fusion with context-aware network for object detection
24th International Conference on Pattern Recognition
RON: reverse connection with objectness prior networks for object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Feature pyramid networks for object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Multi-scale location-aware kernel representation for object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Bilinear convolutional neural networks for fine-grained visual recognition
IEEE Trans. Pattern Anal. Mach. Intell.
G2denet: global gaussian distribution embedding network and its application to visual recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Is second-order information helpful for large-scale visual recognition?
Proceedings of the IEEE International Conference on Computer Vision
Cited by (20)
Control Distance IoU and Control Distance IoU Loss for Better Bounding Box Regression
2023, Pattern RecognitionCitation Excerpt :DIoU loss cannot distinguish which RPs is more similar to GT when the center points of RPs are at the same position. We can know that the calculation process of CIoU loss is more time-consuming, which will eventually drag down the overall training and test time [31]. In this section, we systematically explain the disadvantages of traditional IoUs and loss functions in the formal article.
CrossRectify: Leveraging disagreement for semi-supervised object detection
2023, Pattern RecognitionFeature reconstruction and metric based network for few-shot object detection
2023, Computer Vision and Image UnderstandingCitation Excerpt :Benefiting from RPN, two-stage methods usually perform better than one-stage methods in the object detection task. MSKR (Wang et al., 2021) proposes a multiscale structural kernel method, which not only draws into high-order statistics but also preserves the spatial information of input. Our model is based on Faster R-CNN because its second stage can easily design for few-shot object detection.
Spatial information enhancement network for 3D object detection from point cloud
2022, Pattern RecognitionCitation Excerpt :The past few decades have witnessed remarkable progress of deep learning on 2D vision tasks, such as segmentation, pose estimation, and object detection [1–7].
AccLoc: Anchor-Free and two-stage detector for accurate object localization
2022, Pattern RecognitionCitation Excerpt :Other works involve improving the feature fusion method, such as those of [27,27,28]. Additional works [22,23,29–31] try to extract features that contribute to high-quality object detection. For instance, BorderDet [31] extracts border features for accurate detection.
Hao Wang received B.S. and M.S. degree from Northeastern University, Shenyang, China, in 2012 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. His research interests include object detection, object segmentation and related problems.
Qilong Wang received the Ph.D. degree in School of Information and Communication Engineering with Dalian University of Technology, China, in 2018. He currently joins Tianjin University as a lecturer at the College of Intelligence and Computing. His research interests include computer vision and pattern recognition, particularly visual classification and deep probability distribution modeling. He has published more than thirty academic papers in top conferences and referred journal including ICCV/CVPR/NIPS/ECCV/IJCAI and IEEE TPAMI/TIP/TCSVT.
Peihua Li received Ph.D. degree from Harbin Institute of Technology in 2003, Harbin, China, and then worked for one year as a postdoctoral fellow at INRIA/IRISA, France. He achieved the honorary nomination of National Excellent Doctoral dissertation in China. He is currently a professor of Dalian University of Technology, Dalian, China. He was supported by Program for New Century Excellent Talents in University of Chinese Ministry of Education. His team won the 1st place in large-scale iNaturalist Challenge spanning 8000 species at FGVC5 CVPR2018, 2nd place in Alibaba Large-scale Image Search Challenge 2015, and 4th place in Noisy Iris Challenge Evaluation I. His research topics include deep learning, computer vision and pattern recognition, focusing on image/video recognition, object detection and semantic segmentation. He has published papers in top journals such as IEEE TPAMI/TIP/TCSVT and top conferences including ICCV/CVPR/ECCV.
Wangmeng Zuo is currently a Professor in the School of Computer Science and Technology, Harbin Institute of Technology. His research interests include image video enhancement, image video generation, visual tracking, and image classification. He has published over 90 papers in top-tier academic journals and conferences. He has served as area chairs of CVPR 2020/ICCV 2019, a Tutorial Organizer in ECCV 2016. He is also an Associate Editor of the IET Biometrics, The Visual Computers, Journal of Electronic Imaging, and the Guest Editor of Neurocomputing, Pattern Recognition, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Neural Networks and Learning Systems.