Multi-scale structural kernel representation for object detection

doi:10.1016/j.patcog.2020.107593

Pattern Recognition

Volume 110, February 2021, 107593

https://doi.org/10.1016/j.patcog.2020.107593 Get rights and content

Highlights

•
The first attempt to integrate high-order statistics into deep CNNs for effective object detection.
•
The proposed high-order statistical module preserves the spatial information while taking account into their special geometry structures.
•
Performing favorably in comparison to the state-of-the-art methods and showing good generalization ability to other dense prediction tasks.

Abstract

Existing high-performance object detection methods greatly benefit from the powerful representation ability of deep convolutional neural networks (CNNs). Recent researches show that integration of high-order statistics remarkably improves the representation ability of deep CNNs. However, high-order statistics for object detection lie in two challenges. Firstly, previous methods insert high-order statistics into deep CNNs as global representations, which lose spatial information of inputs, and so are not applicable to object detection. Furthermore, high-order statistics have special structures, which should be considered for proper use of high-order statistics. To overcome above challenges, this paper proposes a Multi-scale Structural Kernel Representation (MSKR) for improving performance of object detection. Our MSKR is developed based on the polynomial kernel approximation, which does not only draw into high-order statistics but also preserve the spatial information of input. To consider geometry structures of high-order representations, a feature power normalization method is introduced before computation of kernel representation. Comparing with the most commonly used first-order statistics in existing CNN-based detectors, our MSKR can generate more discriminative representations, and so be flexibly integrated into deep CNNs for improving performance of object detection. By adopting the proposed MSKR to existing object detection methods (i.e., Faster R-CNN, FPN, Mask R-CNN and RetinaNet), it achieves clear improvement on three widely used benchmarks, while obtaining very competitive performance with state-of-the-art methods.

Introduction

Object detection has attracted a lot of attentions in past decades. As one of the fundamental problems in computer vision, it plays a key role in a widely spreading of applications [1], [2]. From the approaches [3] based on traditional handcrafted features to ones [4], [5] based on deep convolutional features, the rapid development of convolutional neural networks (CNNs) [6], [7] greatly improves the performance of object detection. R-CNN [4] is among the first which exploits the powerful representation ability of deep CNNs to characterize the object proposals and achieves a significant improvement compared with the traditional methods. Subsequently, Region of Interest (RoI) pooling layer and Region Proposal Network (RPN) are respectively introduced by Fast R-CNN [8] and Faster R-CNN [5], allowing object detection can be designed into an end-to-end architecture. Such methods require no pre-generated proposals [9], thereby leading to better performance and faster training/testing.

Although Faster R-CNN yields promising performance, it obtains representations by simply performing average pooling on the outputs of one single convolution (conv) layer (i.e., the last conv layer), limiting the robustness and accuracy of detection. As illustrated in Fig. 1(a), Faster R-CNN fails to detect birds with large pose changes, blur and similar background. One solution to improve performance of object detection is extraction of multi-scale feature maps as representations from different conv layers [10], and such methods can be concluded as two parts: concatenation of multi-scale feature maps [11], [12] and pyramidal feature hierarchy [13], [14]. Generally speaking, the feature maps from bottom layers have higher resolutions but weaker semantic information, while feature maps in top layers have high-level semantic information but lower resolutions. The concatenation-based methods can obtain a coarse-to-fine representation for each object proposal by concatenating the outputs of different convolution layers into one single feature map. For the methods based on pyramidal feature hierarchy, the outputs of different convolution layers are employed in a pyramid manner. Moreover, each combination gives its own prediction, and all detection results are fused by using non-maximum suppression. As shown in Fig. 1(b), the methods based on concatenation of multi-scale feature maps (e.g., RON [13]) are able to improve detection performance by enhancing the representations.

All aforementioned methods focus on improving detection performance by extracting multi-scale feature maps, after where the simple first-order pooling (i.e., RoI-Pooling) is performed on feature maps to generate representations. Recently, some researchers show integration of high-order statistics can significantly improve the representation ability of deep CNNs [16], [17]. Among them, B-CNN [16] inserts a second-order noncentral moment into deep CNNs, and element-wise power normalization followed by ℓ₂-normalization is performed. Wang et al. [17]. embed a global Gaussian distribution into deep CNNs. Zhang et al. [18] propose a second-order locality-constrained affine subspace coding method to perform both image classification and image retrieval tasks. These methods obtain promising improvement over first-order pooling based CNN models on challenging fine-grained visual categorization. Li et al. [19] propose a matrix power normalized second-order pooling, showing consistent superiority over various CNN models on large-scale ImageNet classification [20].

Above discussion clearly encourages us to exploit high-order statistics for improving the performance of object detection. However, there exist two challenges to exploit high-order statistics for object detection. First of all, aforementioned high-order methods totally compute global representations for the whole images, which completely lose spatial information of images, and so are not applicable to object detection. On the other hand, high-order statistics have special structures, and previous works [17], [19] have demonstrated that geometry structures should be considered for achieving favorable performance. To handle the first challenge, we introduce a polynomial kernel approximation method inspired by Cai et al. [21] in our previous work [15], where the weight of the high-order statistics inherent in a polynomial kernel can be approximated by rank-1 tensors decomposition [22], and high-order representations can be computed by learning weight parameters. In deep architectures, we can learn weight parameters by using a series of 1 × 1 convolutions and element-wise product operations, and all these operations preserve spatial information. Therefore, the introduced polynomial kernel method can capture high-order statistics while preserving spatial information, having ability to improve performance of dense predication tasks (e.g., object detection).

Recent work [19] shows that matrix power normalization can effectively exploit the geometry of second-order statistics in deep CNN architectures to improve classification performance. Given a set of convolutional features $X \in R^{C \times n},$ matrix power normalization of second-order pooling of X (i.e., $M = \frac{1}{C} X^{⊤} X$ ) can be computed by shrinking eigenvalues of Mwith a power function through eigenvalue decomposition (EIG) or singular value decomposition (SVD). However, such kinds of methods can not be directly adopted to approximated kernel representations, where no explicit high-order statistic (i.e., X^⊤X) is computed. Inspired by success of [19], we introduce a feature power normalization method, which can be regarded as transferring matrix power normalization from high-order representations to the original convolutional features. In this way, we can effectively consider geometry of high-order representations based on matrix power normalization, while avoiding computation of explicit high-order statistics. Accordingly, our Multi-scale Structural Kernel Representation (MSKR) considers geometry of high-order kernel representations by performing a feature power normalization before the polynomial kernel approximation. Besides, we embed an attention module [23] into our kernel representations for considering the importance of each convolutional feature. The attention module can jointly encode spatial and channel information.

This paper is an extension of our previous work [15]. There exist two significant differences between MSKR and our previous work (MLKP) in terms of techniques and experiments. Specifically, from technique perspective, MSKR introduces a novel feature power normalization into MLKP, which appropriately makes use of geometry of high-order statistics captured by polynomial kernel approximation of MLKP. Besides, MSKR extends the location-weight network (only spatial information is considered) of MLKP to an attention network, where the latter jointly takes spatial and channel information into consideration. From experiment perspective, we conduct much more experiments to verify the effectiveness of MSKP in terms of detectors, backbone models and tasks comparing with previous work. First, we adopt MSKR to more detectors besides Faster R-CNN [5] used in Wang et al. [15]. Second, we employ light-weight MobileNet [24] as backbone models to assess effect of MSKR in the mobile settings. Additionally, we evaluate the generalization ability of the proposed MSKR on instance segmentation task.

The overview of our proposed MSKR is illustrated in Fig. 2. Given a multi-scale feature map $X,$ MSKR first performs feature power normalization, and then kernel representations are computed using 1 × 1 convolution operation and element-wise product. Finally, an attention module is used for re-weighting kernel representations. As shown in Fig. 1(d), MSKR can significantly improve detection performance, especially for objects with complex variations (e.g., large pose changes, blur and similar background). The experiments are conducted on three widely used benchmarks, i.e., PASCAL VOC 2007, PASCAL VOC 2012 [25] and MS COCO [26]. The contributions of this paper are summarized as follows:

1.
In this paper, we make an attempt to integrate high-order statistics into deep CNNs as representations for effective object detection. To this end, we propose a novel Multi-scale Structural Kernel Representation (MSKR). The proposed MSKR can preserve the spatial information while taking geometry structure of high-order statistics into account.
2.
To consider the geometry of our high-order kernel representations, we introduce a feature power normalization method before computation of kernel representations, approximately performing matrix power normalization on high-order representations. It can further improve the performance of kernel representations.
3.
Extensive experiments are conducted on three widely used object detection benchmarks, and the results show MSKR clearly improves performance of many existing deep detectors (e.g., Faster R-CNN [5], FPN [14], Mask R-CNN [27] and RetinaNet [28]), while performing favorably in comparison to the state-of-the-art methods. Besides, the results on instance segmentation task demonstrate that our MSKR has the great potential to improve performance of other dense predication task.

Section snippets

Related work

Recently, many advanced detectors based on Faster R-CNN have been proposed in community of region-based detection methods. R-FCN [29] is among the first to address the dilemma between invariance in classification and variance in detection by replacing the RoI-Pooling layer with position-sensitive score map. FPN [14] and Mask R-CNN [27] deploy a top-down multi-scale pyramidal hierarchy structure to leverage the features at all scales. Moveover, Mask R-CNN replaces the RoI-Pooling layer with

Proposed method

In this section, we introduce the proposed Multi-scale Structural Kernel Representation (MSKR). Firstly, we present a modified multi-scale feature map to effectively utilize multi-resolution information. Then, a structural kernel representation is proposed to incorporate high-order statistics while maintaining spatial information and considering geometry structure, which is achieved by feature power normalization followed by polynomial kernel approximation. Besides, we use an attention module

Experiments

In this section, we evaluate performance of our proposed MSKR. Specifically, we first describe implementation details of our method. Then, we conduct ablation studies on key components of our method using Faster R-CNN [5] with ResNet-101 [7] on PASCAL VOC 2007 [25]. Additionally, we compare with other methods on three widely used benchmarks (i.e., PASCAL VOC 2007, PASCAL VOC 2012 [25] and MS COCO [26]) using four state-of-the-arts detectors (i.e., Faster R-CNN [5], FPN [14], Mask R-CNN [27] and

Conclusion

In this paper, we propose a novel Multi-scale Structural Kernel Representation (MSKR) method to effectively exploit high-order statistics for improving performance of object detection. The proposed MSKR can generate informative presentations, which preserve the spatial information while considering geometry structures of high-order statistics, thereby being suitable for dense predication. Our MSKR can be flexibly integrated into various object detection approaches and the experimental results

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by NSFC grant under nos. 61671182, U19A2073, 61806140 61971086, and 2019YFB210901, Project of State Key Laboratory of Robotics and System (HIT) under grant no.SKLRS202004D.

Hao Wang received B.S. and M.S. degree from Northeastern University, Shenyang, China, in 2012 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. His research interests include object detection, object segmentation and related problems.

References (38)

X. Li et al.
Accurate object detection using memory-based models in surveillance scenes
Pattern Recognit.
(2017)
X. Huang et al.
Water flow driven salient object detection at 180 fps
Pattern Recognit.
(2018)
B. Zhang et al.
Locality-constrained affine subspace coding for image classification and retrieval
Pattern Recognit.
(2020)
P.F. Felzenszwalb et al.
Object detection with discriminatively trained part-based models
IEEE Trans. Pattern Anal. Mach. Intell.
(2010)
R. Girshick et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014)
S. Ren et al.
Faster R-CNN: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
K. Simonyan et al.
Very deep convolutional networks for large-scale image recognition
International Conference on Learning Representations
(2015)
K. He et al.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
R. Girshick
Fast R-CNN
Proceedings of the IEEE International Conference on Computer Vision
(2015)
J.R. Uijlings et al.
Selective search for object recognition
Int. J. Comput. Vis.
(2013)

W. Ma et al.

Mdcn: multi-scale, deep inception convolutional neural networks for efficient object detection

24th International Conference on Pattern Recognition

(2018)

P. Zhang et al.

Hyperfusion-net: hyper-densely reflective feature fusion for salient object detection

Pattern Recognit.

(2019)

H. Wang et al.

Multi-scale fusion with context-aware network for object detection

24th International Conference on Pattern Recognition

(2018)

T. Kong et al.

RON: reverse connection with objectness prior networks for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

T.-Y. Lin et al.

Feature pyramid networks for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

H. Wang et al.

Multi-scale location-aware kernel representation for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

T. Lin et al.

Bilinear convolutional neural networks for fine-grained visual recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

Q. Wang et al.

G2denet: global gaussian distribution embedding network and its application to visual recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

P. Li et al.

Is second-order information helpful for large-scale visual recognition?

Proceedings of the IEEE International Conference on Computer Vision

(2017)

Cited by (20)

Control Distance IoU and Control Distance IoU Loss for Better Bounding Box Regression
2023, Pattern Recognition
Citation Excerpt :
DIoU loss cannot distinguish which RPs is more similar to GT when the center points of RPs are at the same position. We can know that the calculation process of CIoU loss is more time-consuming, which will eventually drag down the overall training and test time [31]. In this section, we systematically explain the disadvantages of traditional IoUs and loss functions in the formal article.
Numerous improvements in feedback mechanisms have contributed to the great progress in object detection. In this paper, we first present an evaluation-feedback module, which consists of an evaluation system and feedback mechanism. Then we analyze and summarize traditional evaluation-feedback modules. We focus on both the evaluation system and the feedback mechanism, and propose Control Distance IoU and Control Distance IoU loss function (CDIoU and CDIoU loss) without increasing parameters in models, which make significant enhancements on several classical and emerging models. Finally, we propose Automatic Ground Truth Clustering (AGTC) and Floating Learning Rate Decay (FLRD) for faster regression in object detection. Experiments show that a coordinated evaluation-feedback module can effectively improve model performance. Both CNN and transformer-based detectors with CDIoU + CDIoU loss, AGTC, and FLRD achieve excellent performances. There are a maximum AP improvement of 2.9%, an average AP of 1.1% improvement on MS COCO, a maximum AP improvement of 8.2%, and an average AP improvement of 3.7% on Visdrone dataset.
CrossRectify: Leveraging disagreement for semi-supervised object detection
2023, Pattern Recognition
Semi-supervised object detection has recently achieved substantial progress. As a mainstream solution, the self-labeling-based methods train the detector on both labeled data and unlabeled data with pseudo labels predicted by the detector itself, but their performances are always limited. Through experimental analysis, we reveal the underlying reason is that the detector is misguided by the incorrect pseudo labels predicted by itself (dubbed self-errors). These self-errors can hurt performance even worse than random-errors, and can be neither discerned nor rectified during the self-labeling process. In this paper, we propose an effective detection framework named CrossRectify, to obtain accurate pseudo labels by simultaneously training two detectors with different initial parameters. Specifically, the proposed approach leverages the disagreements between detectors to discern the self-errors and refines the pseudo label quality by the proposed cross-rectifying mechanism. Extensive experiments show that CrossRectify achieves outperforming performances over various detector structures on 2D and 3D detection benchmarks.
Feature reconstruction and metric based network for few-shot object detection
2023, Computer Vision and Image Understanding
Citation Excerpt :
Benefiting from RPN, two-stage methods usually perform better than one-stage methods in the object detection task. MSKR (Wang et al., 2021) proposes a multiscale structural kernel method, which not only draws into high-order statistics but also preserves the spatial information of input. Our model is based on Faster R-CNN because its second stage can easily design for few-shot object detection.
In the object detection task, deep learning-based methods always need a large amount of annotated training data. However, annotating a large number of images is labor-intensive. In order to reduce the dependency of expensive annotations, we propose a novel end-to-end feature reconstruction and metric based network for few-shot object detection (FM-FSOD). FM-FSOD integrates metric learning and meta-learning to tackle the few-shot object detection task. FM-FSOD is a class-agnostic detection model that can accurately recognize novel categories without fine-tuning on novel categories. Specifically, to quickly learn the characteristics of novel categories, we propose a meta-representation module (MR module) to learn from intra-class mean prototypes and acquire the ability to reconstruct high-level features with the meta-learning method. To further conduct the similarity of features between support prototypes and query ROI features, we propose Pearson metric module (PR module), which serves as a classifier. Compared with the previous standard cosine distance module, the PR module enables the model to acquire robust ability for large bias features. We have conducted extensive experiments on benchmark datasets FSOD, MS COCO, and PASCAL VOC to demonstrate the feasibility and efficiency of our model. Comparing with the previous methods, FM-FSOD obtains comparable results.
Spatial information enhancement network for 3D object detection from point cloud
2022, Pattern Recognition
Citation Excerpt :
The past few decades have witnessed remarkable progress of deep learning on 2D vision tasks, such as segmentation, pose estimation, and object detection [1–7].
LiDAR-based 3D object detection pushes forward an immense influence on autonomous vehicles. Due to the limitation of the intrinsic properties of LiDAR, fewer points are collected at the objects farther away from the sensor. This imbalanced density of point clouds degrades the detection accuracy but is generally neglected by previous works. To address the challenge, we propose a novel two-stage 3D object detection framework, named SIENet. Specifically, we design the Spatial Information Enhancement (SIE) module to predict the spatial shapes of the foreground points within proposals, and extract the structure information to learn the representative features for further box refinement. The predicted spatial shapes are complete and dense point sets, thus the extracted structure information contains more semantic representation. Besides, we design the Hybrid-Paradigm Region Proposal Network (HP-RPN) which includes multiple branches to learn discriminate features and generate accurate proposals for the SIE module. Extensive experiments on the KITTI 3D object detection benchmark show that our elaborately designed SIENet outperforms the state-of-the-art methods by a large margin. Codes will be publicly available at https://github.com/Liz66666/SIENet.
AccLoc: Anchor-Free and two-stage detector for accurate object localization
2022, Pattern Recognition
Citation Excerpt :
Other works involve improving the feature fusion method, such as those of [27,27,28]. Additional works [22,23,29–31] try to extract features that contribute to high-quality object detection. For instance, BorderDet [31] extracts border features for accurate detection.
Current anchor-free object detectors have obtained detection performances comparable to those of anchor-based object detectors while avoiding the weaknesses of anchor designs. However, two challenges limit the localization performance. First, such anchor-free detectors have one stage that predicts the classification and localization results directly. A large regression space reduces the localization performance of such methods. Second, most of the existing detectors extract features which are ineffective for accurate localization. In this paper, for the first challenge, we propose two-stage networks to predict regression results stage by stage, thereby reducing the scope of the prediction space. For the second challenge, we design two novel modules with the aim of extracting effective features for accurate localization. Experimental results validate that each module in our approach is effective and validate that our approach has better object localization performance than previous related and advanced methods.
mSODANet: A network for multi-scale object detection in aerial images using hierarchical dilated convolutions
2022, Pattern Recognition
The object detection in aerial images is one of the most commonly used tasks in the wide-range of computer vision applications. However, the object detection is more challenging due to the following issues: (a) the pixel occupancy vary among the different scales of objects, (b) the distribution of objects is not uniform in aerial images, (c) the appearance of an object varies with different view-points and illumination conditions, and (d) the number of objects, even though they belong to same type, vary across the images. To address these issues, we propose a novel network for multi-scale object detection in aerial images using hierarchical dilated convolutions, called as mSODANet. In particular, we probe hierarchical dilated network using parallel dilated convolutions to learn the contextual information of different types of objects at multiple scales and multiple field-of-views. The introduced hierarchical dilated network captures the visual information of aerial image more effectively and enhances the detection capability of the model. Further, the extensive experiments conducted on three challenging publicly available datasets, i.e., Visdrone2019, DOTA (OBB & HBB), NWPU VHR-10, demonstrate the effectiveness of the proposed mSODANet and achieve the state-of-the-art performance on all three datasets.

View all citing articles on Scopus

Qilong Wang received the Ph.D. degree in School of Information and Communication Engineering with Dalian University of Technology, China, in 2018. He currently joins Tianjin University as a lecturer at the College of Intelligence and Computing. His research interests include computer vision and pattern recognition, particularly visual classification and deep probability distribution modeling. He has published more than thirty academic papers in top conferences and referred journal including ICCV/CVPR/NIPS/ECCV/IJCAI and IEEE TPAMI/TIP/TCSVT.

Peihua Li received Ph.D. degree from Harbin Institute of Technology in 2003, Harbin, China, and then worked for one year as a postdoctoral fellow at INRIA/IRISA, France. He achieved the honorary nomination of National Excellent Doctoral dissertation in China. He is currently a professor of Dalian University of Technology, Dalian, China. He was supported by Program for New Century Excellent Talents in University of Chinese Ministry of Education. His team won the 1st place in large-scale iNaturalist Challenge spanning 8000 species at FGVC5 CVPR2018, 2nd place in Alibaba Large-scale Image Search Challenge 2015, and 4th place in Noisy Iris Challenge Evaluation I. His research topics include deep learning, computer vision and pattern recognition, focusing on image/video recognition, object detection and semantic segmentation. He has published papers in top journals such as IEEE TPAMI/TIP/TCSVT and top conferences including ICCV/CVPR/ECCV.

Wangmeng Zuo is currently a Professor in the School of Computer Science and Technology, Harbin Institute of Technology. His research interests include image video enhancement, image video generation, visual tracking, and image classification. He has published over 90 papers in top-tier academic journals and conferences. He has served as area chairs of CVPR 2020/ICCV 2019, a Tutorial Organizer in ECCV 2016. He is also an Associate Editor of the IET Biometrics, The Visual Computers, Journal of Electronic Imaging, and the Guest Editor of Neurocomputing, Pattern Recognition, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Neural Networks and Learning Systems.

View full text

Multi-scale structural kernel representation for object detection

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed method

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Object detection with discriminatively trained part-based models

IEEE Trans. Pattern Anal. Mach. Intell.

Rich feature hierarchies for accurate object detection and semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Faster R-CNN: towards real-time object detection with region proposal networks

IEEE Trans. Pattern Anal. Mach. Intell.

Very deep convolutional networks for large-scale image recognition

International Conference on Learning Representations

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Fast R-CNN

Proceedings of the IEEE International Conference on Computer Vision

Selective search for object recognition

Int. J. Comput. Vis.

Mdcn: multi-scale, deep inception convolutional neural networks for efficient object detection

24th International Conference on Pattern Recognition

Hyperfusion-net: hyper-densely reflective feature fusion for salient object detection

Pattern Recognit.

Multi-scale fusion with context-aware network for object detection

24th International Conference on Pattern Recognition

RON: reverse connection with objectness prior networks for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Feature pyramid networks for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Multi-scale location-aware kernel representation for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Bilinear convolutional neural networks for fine-grained visual recognition

IEEE Trans. Pattern Anal. Mach. Intell.

G2denet: global gaussian distribution embedding network and its application to visual recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Is second-order information helpful for large-scale visual recognition?

Proceedings of the IEEE International Conference on Computer Vision