Elsevier

Neural Networks

Volume 147, March 2022, Pages 25-41
Neural Networks

Feature Correlation-Steered Capsule Network for object detection

https://doi.org/10.1016/j.neunet.2021.12.003Get rights and content

Abstract

Despite Convolutional Neural Networks (CNNs) based approaches have been successful in objects detection, they predominantly focus on positioning discriminative regions while overlooking the internal holistic part-whole associations within objects. This would ultimately lead to the neglect of feature relationships between object and its parts as well as among those parts, both of which are significantly helpful for detecting discriminative parts. In this paper, we propose to “look insider the objects” by digging into part-whole feature correlations and take the attempts to leverage those correlations endowed by the Capsule Network (CapsNet) for robust object detection. Actually, highly correlated capsules across adjacent layers share high familiarity, which will be more likely to be routed together. In light of this, we take such correlations between different capsules of the preceding training samples as an awareness to constrain the subsequent candidate voting scope during the routing procedure, and a Feature Correlation-Steered CapsNet (FCS-CapsNet) with Locally-Constrained Expectation-Maximum (EM) Routing Agreement (LCEMRA) is proposed. Different from conventional EM routing, LCEMRA stipulates that only those relevant low-level capsules (parts) meeting the requirement of quantified intra-object cohesiveness can be clustered to make up high-level capsules (objects). In doing so, part-object associations can be dug by transformation weighting matrixes between capsules layers during such “part backtracking” procedure. LCEMRA enables low-level capsules to selectively gather projections from a non-spatially-fixed set of high-level capsules. Experiments on VOC2007, VOC2012, HKU-IS, DUTS, and COCO show that FCS-CapsNet can achieve promising object detection effects across multiple evaluation metrics, which are on-par with state-of-the-arts.

Introduction

Existing Convolutional Neural Network (CNN) (Krizhevsky, Sutskever, & Hinton, 2017) based object detectors, such as one-stage frameworks (e.g., YOLO (Redmon, Divvala, Girshick, & Farhadi, 2016), SSD (Liu et al., 2016), and FCOS (Tian, Shen, Chen, & He, 2019)) with relatively higher efficiency and two-stage frameworks (e.g., SPP-Net (He, Zhang, Ren, & Sun, 2015), Fast R-CNN (Girshick, 2015), and FPN (Lin, Dollár, Girshick, He, Hariharan, & Belongie, 2017)) with relatively higher accuracy, have made substantial progress (Carion et al., 2020, Fan et al., 2020, Liu et al., 2020a, Luo et al., 2017) and made up for the defects of hand-crafted visual features (e.g., HOG, SIFT, ORB, SURF). Despite automatically mining rich deep features in local respective field, pooling operation discards lots of informative descriptions such as the relative spatial constraint and the potential pose of objects which are meaningful for interest area localization and target recognition purpose (Xiang, Zhang, Tang, Zou, & Xu, 2018). Thereby, CNN-based methods end up learning limited spatial association among abstracted entities (Liu, Zhang, Zhang, & Han, 2019).

Actually, there always exist intrinsic correlations between an object and its parts, which have long been ignored by current CNN-based object detectors. Because of this, most CNN-based detectors can be easily deceived in many specific object detection tasks, e.g., distinguishing the left leg of a pedestrian from his right leg, positioning the limbs of a cat among the bushes. It is mainly due to the lack of cognitive ability for inner structural composition of objects. Therefore, it is crucial to learn the structure of objects beyond simple visual patterns. Suffering from diverse visual appearances and poses, it is not always reliable to correctly recognize objects purely based on discriminative foreground areas, even with adequate human-labeled training samples. Such legacy of CNN promotes the birth of capsule network (CapsNet) for the innovation of perception mode (Chao et al., 2019, Liu et al., 2021, Yin et al., 2019, Zhang et al., 2019).

Aware of aforementioned drawbacks of CNN, Sabour, Frosst, and Hinton (2017) presented CapsNet, which owns special interests on spatial associations. CapsNet incorporates positional-equivariance as opposed to positional-invariance between entities by the routing agreement between capsule encoding units. The attention-like routing logic allows capsule encoders from a given layer to quantify the contributions from adjacent capsules of the last layer. This encourages CapsNet to shape a richer representation of abstracted entities existing in images along with their spatial relevance. Benefitting from this mechanism, CapsNet can find optimal routes among capsules and the credit attribution between nodes from lower and higher layers, namely, to cluster the extracted entities for each target class (Sabour et al., 2017). Therefore, intuitively this characteristic of CapsNet is suitable for robust object detection by extracting part-object feature relationships.

Despite the promising benefits over CNNs, CapsNet is confined to its blindly fully-connected voting manner. As shown in Fig. 1(a), the baseline CapsNet (Sabour et al., 2017) allows each low-level capsule to vote for all high-level ones regardless of feature relevance (e.g., similarity of color and shape). Such voting manner easily results in noisy assignment, thus giving rise to performance degradation (Lalonde et al., 2021, Liu et al., 2020a, Liu et al., 2019). Therefore, a primitive adaptation of CapsNet to object detection may not work due to the above shackle. Prior works have attempted to address such issue by proposing convolutional capsules (Lalonde et al., 2021, Verma and Zhang, 2018). Although some progress has been achieved, the representability of CapsNet is still fundamentally limited due to imposing local constraints (Zhang, Yang, Han, Wang, & Gao, 2013). This is because convolution operation, by design, is bound to a set of geometric patterns (Dai et al., 2017), and such geometric restriction significantly inhibits the representability to learn part-whole relationships, relying on parts of foreground to fall within a fixed domain-specific grid. Therefore, it is unrealistic to expect convolutional capsules to precisely model the pose and deformation of an object when the descriptions related to the parts of that object are locked into a fixed spatial topology.

To mitigate the above issues and effectively learn object-part feature correlations for object detection, we propose to steer the trend of “part backtracking” between low-level and high-level capsules by feature correlation, with which a Feature Correlation-Steered Capsule Network (FCS-CapsNet) is presented for object detection. It is inspired by the fact that regions belonging to one common object would own high correlations, owing to the commonality among images from the same category. Specially, after the visual features of input images are preliminarily extracted by a proposed CDN, low-level capsules refine the parts of objects based on inter-part diversity (cf. Fig. 1(b)). Such diversity is reflected in the distinctions between parts sharing one common object (cf. Fig. 1(b)), possibly varying in one or more of textures, material, shape, size, and color, such as the difference between “head” and “trunk” of a dog. Next come the routing agreement between adjacent capsule layers. The following three feature correlation-aware procedures are customized and embedded into traditional Expectation–Maximimization (EM) routing procedure to implement a Locally-Constrained Expectation-Maximumization Routing Agreement (LCEMRA), where the relationships/similarities between different types of capsules across adjacent layers are utilized to correct the capsule assignments.

  • (1)

    The first procedure is Intra-Object Cohesiveness Quantification (IOCQ) phase, which determines whether it is necessary to calculate the votes (product of pose matrix of capsules and transformation matrix between adjacent capsule layers (Hinton, Sabour, & Frosst, 2018)) between two low-level capsules and one high-level capsule according to the interrelation of Frobenius norm and trace of their pose matrixes and transformation matrixes. This step just preliminarily judges whether there is a minimum degree of correlation between two capsules from the perspective of similarity between their pose matrixes which contain the inherent attributes of perceived entities.

  • (2)

    Once two capsules meet the requirement of IOCQ, the votes between them will be verified in the second Part Backtracking (PB) stage. From its name we see that this step decides whether to merge two similar low-level capsules into one high-level capsule, mainly according to whether the votes calculated in the first IOCQ stage satisfy three feature similarity inequalities of PB stage.

  • (3)

    Finally, once a new high-level capsule is generated by merging two low-level capsules together, the feature similarity between this new capsule and its adjacent capsules will be reevaluated by a criterion in the third Feature Correlation Quantification and Appending (FCQA) phase. At the end of each training iteration, this measured feature correlation information will be utilized to update assignment probability of LCEMRA via a residual formulation.

In this way, only those relevant low-level capsules (denoting parts) meeting the requirement of intra-object cohesiveness can be permitted to cluster to make up high-level capsules (denoting objects) (cf. Fig. 1(b)), thus casting a local constraint voting mechanism in LCEMRA. Since such cohesiveness is reflected in the distinctions corresponding to different categories, the clustered high-level capsules can comprehensively collect and refine the discriminative descriptions of objects belong to the same class.

More importantly, the feature relationships between parts and objects can also be dug out by the transformation weighting matrixes between adjacent capsule layers during voting process, which is what the most CNN-based methods lack. In doing so, the structural composition of object parts can be sensed, while the object parts can be naturally linked to its belonged object and the target regions are predictable and can be distinguished from the background. Therefore, from the psychological point of view, the detection pipeline of FCS-CapsNet can be naturally separated into two stages:

  • (1)

    Roughly positioning the object extent (the whole extent of an object rather than its specific components) in imagery.

  • (2)

    Parsing the structure among parts within the object.

Evidently, detecting the discriminative regions in a well-parsed part-object topology is much easier than in a raw feature map.

In summary, LCEMRA allows high-level capsules to selectively gather projections from a non-spatially-fixed set of low-level capsules, thus FCS-CapsNet can effectively and efficiently predict the part-object correlations and potential poses of instances for accurate object detection. Such self-supervised voting manner can benefit FCS-CapsNet for the object internal holistic structure understanding by end-to-end fine-tuning. Furthermore, such feature correlation-steered voting manner also enables to address the redundancy of traditional blindly fully-connected voting manner (cf. Fig. 1(1)) in classic CapsNet.

The contributions of the paper are four-fold:

  • (1)

    A novel Feature Correlation-Steered Capsule Network (FCS-CapsNet) is proposed to implement robust object detection. It can capture a diverse set of part-whole feature associations to boost the overall performance.

  • (2)

    A novel feature extractor denoted as Cascade Dilation Network (CDN) is proposed at the beginning of FCS-CapsNet to improve the quality of learned visual features.

  • (3)

    Three feature correlation-aware routing procedures, i.e., Intra-Object Cohesiveness Quantification (IOCQ) phase, Part Backtracking (PB) stage, and Feature Correlation Quantification and Appending (FCQA) phase, are proposed to cast a novel Locally-Constrained Expectation-Maximum Routing Agreement (LCEMRA). It can steer the voting trend of low-level capsules according to measured intra-object cohesiveness.

  • (4)

    Quantitative experiments with state-of-the-art methods have been conducted on five widely-used benchmark datasets. Experimental results show our superiority on multiple evaluation criteria.

The remainder of the manuscript is organized as below. Section 2 reviews prior works about object detection and CapsNet. Section 3 illustrates the methodology of proposed FCS-CapsNet, with the emphasis on the derivation process of LCEMRA. In Section 4, we compare our model with other rivals on several benchmarks and analyze the experimental results, followed by the extended discussion in Section 5. Section 6 concludes this paper.

Section snippets

Related work

In this subsection, we introduce previous works in two aspects: object detection and CapsNet.

Feature Correlation-Steered Capsule Network (FCS-CapsNet)

To remedy the redundancy caused by blindly fully-connected voting manner of conventional CapsNet, we design three feature correlation-aware routing procedures, i.e., IOCQ, PB, and FCQA, to cast a LCEMRA in FCS-CapsNet (cf. Fig. 2). It can scientifically guide the voting tendency of low-level capsules whilst effectively sensing the part-whole feature association for precise object detection. We present that even with no object proposals, FCS-CapsNet can still deliver state-of-the-art performance

Experimental evaluation

We conduct quantitative experiments to answer following questions. (1) Can the FCS-CapsNet outperform other state-of-the-art methods on multiple benchmarks in terms of object detection? (2) Are the modified CDN, LCEMRA, and decoder layer advantageous when compared with classic methods during object detection?

Discussion

The main argument of this manuscript is that the presented FCS-CapsNet with LCEMRA is an effective pattern for object detection. As the feature relationships are essential for the object detection, the spirit of this paper, i.e., refining part-object feature correlations, can help boost the detectability and representability of various computer vision applications, especially object-centric ones. Taking the spatial part-whole topology into consideration, not only the redundancy of conventional

Conclusion

To implement accurate object detection by refining part-whole feature associations, we have formulated our pipeline as a Feature Correlation-Steered CapsNet (FCS-CapsNet), which is a carrier of customized Locally-Constrained Expectation-Maximum Routing Agreement (LCEMRA). Its three feature relevance-aware procedures (i.e., Intra-Object Cohesiveness Quantification (IOCQ) phase, Part Backtracking (PB) stage, and Feature Correlation Quantification and Appending (FCQA) phase) can also alleviate the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Project of Scientific Operating Expenses, Ministry of Education of China, under Grant 2017PT19; in part by the National Natural Science Foundation of China, under Grant No. 12075315; in part by the National Natural Science Foundation of China, under Grant No 11675261, in part by the National Natural Science Foundation for the Youth of China , under Grant ZR2018QF002, and in part by the Department of Science and Technology of Shandong Province, China ,

References (65)

  • CongR. et al.

    Co-saliency detection for RGBD images based on multi-constraint feature matching and cross label propagation

    IEEE Transactions on Image Processing

    (2018)
  • DaiJ. et al.

    R-fcn: Object detection via region-based fully convolutional networks

  • DaiJ. et al.

    Deformable convolutional networks

  • DeliègeA. et al.

    Hitnet: a neural network with capsules embedded in a Hit-or-Miss layer, extended with hybrid data augmentation and ghost capsules

    (2018)
  • DuarteK. et al.

    Videocapsulenet: A simplified network for action detection

    (2018)
  • EveringhamM. et al.

    The pascal visual object classes (VOC) challenge

    International Journal of Computer Vision

    (2010)
  • FanD.-P. et al.

    Camouflaged object detection

  • GirshickR.

    Fast R-CNN

  • GirshickR. et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • GraumanK. et al.

    Visual object recognition

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    (2011)
  • HeK. et al.

    Mask R-CNN

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2020)
  • HeK. et al.

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2015)
  • HintonG.E. et al.

    Transforming auto-encoders

  • Hinton, G. E., Sabour, S., & Frosst, N. (2018). Matrix capsules with EM routing. In International conference on...
  • HuH. et al.

    Relation networks for object detection

  • HuangC.-T. et al.

    Ecnn: A block-based and highly-parallel CNN accelerator for edge inference

  • JaiswalA. et al.

    Capsulegan: Generative adversarial capsule network

    (2018)
  • KeW. et al.

    Multiple anchor learning for visual object detection

  • KongT. et al.

    Foveabox: Beyound anchor-based object detection

    IEEE Transactions on Image Processing

    (2020)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

    Communications of the ACM

    (2017)
  • LalondeR. et al.

    Deformable capsules for object detection

    (2021)
  • LanS. et al.

    Saccadenet: A fast and accurate object detector

  • Cited by (18)

    View all citing articles on Scopus
    View full text