Feature Correlation-Steered Capsule Network for object detection
Introduction
Existing Convolutional Neural Network (CNN) (Krizhevsky, Sutskever, & Hinton, 2017) based object detectors, such as one-stage frameworks (e.g., YOLO (Redmon, Divvala, Girshick, & Farhadi, 2016), SSD (Liu et al., 2016), and FCOS (Tian, Shen, Chen, & He, 2019)) with relatively higher efficiency and two-stage frameworks (e.g., SPP-Net (He, Zhang, Ren, & Sun, 2015), Fast R-CNN (Girshick, 2015), and FPN (Lin, Dollár, Girshick, He, Hariharan, & Belongie, 2017)) with relatively higher accuracy, have made substantial progress (Carion et al., 2020, Fan et al., 2020, Liu et al., 2020a, Luo et al., 2017) and made up for the defects of hand-crafted visual features (e.g., HOG, SIFT, ORB, SURF). Despite automatically mining rich deep features in local respective field, pooling operation discards lots of informative descriptions such as the relative spatial constraint and the potential pose of objects which are meaningful for interest area localization and target recognition purpose (Xiang, Zhang, Tang, Zou, & Xu, 2018). Thereby, CNN-based methods end up learning limited spatial association among abstracted entities (Liu, Zhang, Zhang, & Han, 2019).
Actually, there always exist intrinsic correlations between an object and its parts, which have long been ignored by current CNN-based object detectors. Because of this, most CNN-based detectors can be easily deceived in many specific object detection tasks, e.g., distinguishing the left leg of a pedestrian from his right leg, positioning the limbs of a cat among the bushes. It is mainly due to the lack of cognitive ability for inner structural composition of objects. Therefore, it is crucial to learn the structure of objects beyond simple visual patterns. Suffering from diverse visual appearances and poses, it is not always reliable to correctly recognize objects purely based on discriminative foreground areas, even with adequate human-labeled training samples. Such legacy of CNN promotes the birth of capsule network (CapsNet) for the innovation of perception mode (Chao et al., 2019, Liu et al., 2021, Yin et al., 2019, Zhang et al., 2019).
Aware of aforementioned drawbacks of CNN, Sabour, Frosst, and Hinton (2017) presented CapsNet, which owns special interests on spatial associations. CapsNet incorporates positional-equivariance as opposed to positional-invariance between entities by the routing agreement between capsule encoding units. The attention-like routing logic allows capsule encoders from a given layer to quantify the contributions from adjacent capsules of the last layer. This encourages CapsNet to shape a richer representation of abstracted entities existing in images along with their spatial relevance. Benefitting from this mechanism, CapsNet can find optimal routes among capsules and the credit attribution between nodes from lower and higher layers, namely, to cluster the extracted entities for each target class (Sabour et al., 2017). Therefore, intuitively this characteristic of CapsNet is suitable for robust object detection by extracting part-object feature relationships.
Despite the promising benefits over CNNs, CapsNet is confined to its blindly fully-connected voting manner. As shown in Fig. 1(a), the baseline CapsNet (Sabour et al., 2017) allows each low-level capsule to vote for all high-level ones regardless of feature relevance (e.g., similarity of color and shape). Such voting manner easily results in noisy assignment, thus giving rise to performance degradation (Lalonde et al., 2021, Liu et al., 2020a, Liu et al., 2019). Therefore, a primitive adaptation of CapsNet to object detection may not work due to the above shackle. Prior works have attempted to address such issue by proposing convolutional capsules (Lalonde et al., 2021, Verma and Zhang, 2018). Although some progress has been achieved, the representability of CapsNet is still fundamentally limited due to imposing local constraints (Zhang, Yang, Han, Wang, & Gao, 2013). This is because convolution operation, by design, is bound to a set of geometric patterns (Dai et al., 2017), and such geometric restriction significantly inhibits the representability to learn part-whole relationships, relying on parts of foreground to fall within a fixed domain-specific grid. Therefore, it is unrealistic to expect convolutional capsules to precisely model the pose and deformation of an object when the descriptions related to the parts of that object are locked into a fixed spatial topology.
To mitigate the above issues and effectively learn object-part feature correlations for object detection, we propose to steer the trend of “part backtracking” between low-level and high-level capsules by feature correlation, with which a Feature Correlation-Steered Capsule Network (FCS-CapsNet) is presented for object detection. It is inspired by the fact that regions belonging to one common object would own high correlations, owing to the commonality among images from the same category. Specially, after the visual features of input images are preliminarily extracted by a proposed CDN, low-level capsules refine the parts of objects based on inter-part diversity (cf. Fig. 1(b)). Such diversity is reflected in the distinctions between parts sharing one common object (cf. Fig. 1(b)), possibly varying in one or more of textures, material, shape, size, and color, such as the difference between “head” and “trunk” of a dog. Next come the routing agreement between adjacent capsule layers. The following three feature correlation-aware procedures are customized and embedded into traditional Expectation–Maximimization (EM) routing procedure to implement a Locally-Constrained Expectation-Maximumization Routing Agreement (LCEMRA), where the relationships/similarities between different types of capsules across adjacent layers are utilized to correct the capsule assignments.
- (1)
The first procedure is Intra-Object Cohesiveness Quantification (IOCQ) phase, which determines whether it is necessary to calculate the votes (product of pose matrix of capsules and transformation matrix between adjacent capsule layers (Hinton, Sabour, & Frosst, 2018)) between two low-level capsules and one high-level capsule according to the interrelation of Frobenius norm and trace of their pose matrixes and transformation matrixes. This step just preliminarily judges whether there is a minimum degree of correlation between two capsules from the perspective of similarity between their pose matrixes which contain the inherent attributes of perceived entities.
- (2)
Once two capsules meet the requirement of IOCQ, the votes between them will be verified in the second Part Backtracking (PB) stage. From its name we see that this step decides whether to merge two similar low-level capsules into one high-level capsule, mainly according to whether the votes calculated in the first IOCQ stage satisfy three feature similarity inequalities of PB stage.
- (3)
Finally, once a new high-level capsule is generated by merging two low-level capsules together, the feature similarity between this new capsule and its adjacent capsules will be reevaluated by a criterion in the third Feature Correlation Quantification and Appending (FCQA) phase. At the end of each training iteration, this measured feature correlation information will be utilized to update assignment probability of LCEMRA via a residual formulation.
In this way, only those relevant low-level capsules (denoting parts) meeting the requirement of intra-object cohesiveness can be permitted to cluster to make up high-level capsules (denoting objects) (cf. Fig. 1(b)), thus casting a local constraint voting mechanism in LCEMRA. Since such cohesiveness is reflected in the distinctions corresponding to different categories, the clustered high-level capsules can comprehensively collect and refine the discriminative descriptions of objects belong to the same class.
More importantly, the feature relationships between parts and objects can also be dug out by the transformation weighting matrixes between adjacent capsule layers during voting process, which is what the most CNN-based methods lack. In doing so, the structural composition of object parts can be sensed, while the object parts can be naturally linked to its belonged object and the target regions are predictable and can be distinguished from the background. Therefore, from the psychological point of view, the detection pipeline of FCS-CapsNet can be naturally separated into two stages:
- (1)
Roughly positioning the object extent (the whole extent of an object rather than its specific components) in imagery.
- (2)
Parsing the structure among parts within the object.
Evidently, detecting the discriminative regions in a well-parsed part-object topology is much easier than in a raw feature map.
In summary, LCEMRA allows high-level capsules to selectively gather projections from a non-spatially-fixed set of low-level capsules, thus FCS-CapsNet can effectively and efficiently predict the part-object correlations and potential poses of instances for accurate object detection. Such self-supervised voting manner can benefit FCS-CapsNet for the object internal holistic structure understanding by end-to-end fine-tuning. Furthermore, such feature correlation-steered voting manner also enables to address the redundancy of traditional blindly fully-connected voting manner (cf. Fig. 1(1)) in classic CapsNet.
The contributions of the paper are four-fold:
- (1)
A novel Feature Correlation-Steered Capsule Network (FCS-CapsNet) is proposed to implement robust object detection. It can capture a diverse set of part-whole feature associations to boost the overall performance.
- (2)
A novel feature extractor denoted as Cascade Dilation Network (CDN) is proposed at the beginning of FCS-CapsNet to improve the quality of learned visual features.
- (3)
Three feature correlation-aware routing procedures, i.e., Intra-Object Cohesiveness Quantification (IOCQ) phase, Part Backtracking (PB) stage, and Feature Correlation Quantification and Appending (FCQA) phase, are proposed to cast a novel Locally-Constrained Expectation-Maximum Routing Agreement (LCEMRA). It can steer the voting trend of low-level capsules according to measured intra-object cohesiveness.
- (4)
Quantitative experiments with state-of-the-art methods have been conducted on five widely-used benchmark datasets. Experimental results show our superiority on multiple evaluation criteria.
The remainder of the manuscript is organized as below. Section 2 reviews prior works about object detection and CapsNet. Section 3 illustrates the methodology of proposed FCS-CapsNet, with the emphasis on the derivation process of LCEMRA. In Section 4, we compare our model with other rivals on several benchmarks and analyze the experimental results, followed by the extended discussion in Section 5. Section 6 concludes this paper.
Section snippets
Related work
In this subsection, we introduce previous works in two aspects: object detection and CapsNet.
Feature Correlation-Steered Capsule Network (FCS-CapsNet)
To remedy the redundancy caused by blindly fully-connected voting manner of conventional CapsNet, we design three feature correlation-aware routing procedures, i.e., IOCQ, PB, and FCQA, to cast a LCEMRA in FCS-CapsNet (cf. Fig. 2). It can scientifically guide the voting tendency of low-level capsules whilst effectively sensing the part-whole feature association for precise object detection. We present that even with no object proposals, FCS-CapsNet can still deliver state-of-the-art performance
Experimental evaluation
We conduct quantitative experiments to answer following questions. (1) Can the FCS-CapsNet outperform other state-of-the-art methods on multiple benchmarks in terms of object detection? (2) Are the modified CDN, LCEMRA, and decoder layer advantageous when compared with classic methods during object detection?
Discussion
The main argument of this manuscript is that the presented FCS-CapsNet with LCEMRA is an effective pattern for object detection. As the feature relationships are essential for the object detection, the spirit of this paper, i.e., refining part-object feature correlations, can help boost the detectability and representability of various computer vision applications, especially object-centric ones. Taking the spatial part-whole topology into consideration, not only the redundancy of conventional
Conclusion
To implement accurate object detection by refining part-whole feature associations, we have formulated our pipeline as a Feature Correlation-Steered CapsNet (FCS-CapsNet), which is a carrier of customized Locally-Constrained Expectation-Maximum Routing Agreement (LCEMRA). Its three feature relevance-aware procedures (i.e., Intra-Object Cohesiveness Quantification (IOCQ) phase, Part Backtracking (PB) stage, and Feature Correlation Quantification and Appending (FCQA) phase) can also alleviate the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the Project of Scientific Operating Expenses, Ministry of Education of China, under Grant 2017PT19; in part by the National Natural Science Foundation of China, under Grant No. 12075315; in part by the National Natural Science Foundation of China, under Grant No 11675261, in part by the National Natural Science Foundation for the Youth of China , under Grant ZR2018QF002, and in part by the Department of Science and Technology of Shandong Province, China ,
References (65)
- et al.
Fine-grained visual categorization of butterfly specimens at sub-species level via a convolutional neural network with skip-connections
Neurocomputing
(2020) - et al.
A novel quadruple generative adversarial network for semi-supervised categorization of low-resolution images
Neurocomputing
(2020) - et al.
Part-object relational visual saliency
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2021) - et al.
Frequency-tuned salient region detection
- et al.
Salient object detection: A benchmark
IEEE Transactions on Image Processing
(2015) - et al.
D2det: Towards high quality object detection and instance segmentation
- et al.
End-to-end object detection with transformers
- et al.
Emotion recognition from multiband EEG signals using CapsNet
Sensors
(2019) - et al.
Recursive context routing for object detection
International Journal of Computer Vision
(2021) - et al.
Global contrast Based Salient Region detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2015)
Co-saliency detection for RGBD images based on multi-constraint feature matching and cross label propagation
IEEE Transactions on Image Processing
R-fcn: Object detection via region-based fully convolutional networks
Deformable convolutional networks
Hitnet: a neural network with capsules embedded in a Hit-or-Miss layer, extended with hybrid data augmentation and ghost capsules
Videocapsulenet: A simplified network for action detection
The pascal visual object classes (VOC) challenge
International Journal of Computer Vision
Camouflaged object detection
Fast R-CNN
Rich feature hierarchies for accurate object detection and semantic segmentation
Visual object recognition
Synthesis Lectures on Artificial Intelligence and Machine Learning
Mask R-CNN
IEEE Transactions on Pattern Analysis and Machine Intelligence
Spatial pyramid pooling in deep convolutional networks for visual recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Transforming auto-encoders
Relation networks for object detection
Ecnn: A block-based and highly-parallel CNN accelerator for edge inference
Capsulegan: Generative adversarial capsule network
Multiple anchor learning for visual object detection
Foveabox: Beyound anchor-based object detection
IEEE Transactions on Image Processing
Imagenet classification with deep convolutional neural networks
Communications of the ACM
Deformable capsules for object detection
Saccadenet: A fast and accurate object detector
Cited by (18)
Reducing vulnerable internal feature correlations to enhance efficient topological structure parsing
2024, Expert Systems with ApplicationsFourier feature decorrelation based sample attention for dense crowd localization
2024, Neural NetworksA coarse-to-fine pattern parser for mitigating the issue of drastic imbalance in pixel distribution
2024, Pattern RecognitionFCPN: Pruning redundant part-whole relations for more streamlined pattern parsing
2024, Neural NetworksCtFPPN: A coarse-to-fine pattern parser for dealing with distribution imbalance of pixels
2023, Knowledge-Based SystemsIOP-CapsNet with ISEMRA: Fetching part-to-whole topology for improving detection performance of articulated instances
2023, Expert Systems with Applications