Modified Capsule Network for Object Classification

Yi, Sheng; Ma, Huimin; Li, Xi

doi:10.1007/978-3-030-34120-6_21

Sheng Yi¹⁴,
Huimin Ma¹⁴ &
Xi Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

International Conference on Image and Graphics

1978 Accesses
2 Citations

Abstract

The recognition of images in complex scenes is essential to intelligent unmanned systems. The CapsNet performs well on MNIST datasets with overlapping numbers, but it has too many parameters on real scene datasets. In this paper, we proposes three methods to reduce its excessive parameters: (1) proposing the CapsNetPr network, in which the shallow feature extraction network is introduced, to reduce the data dimension of the input capsule layer. (2) utilizing the method of decomposing the transformation matrix to reduce space consumption and time consumption. (3) sharing the transformation matrix on the same location to reduce the number of matrices in the low-level capsule layers. The study successfully reduces the number of parameters of the capsule network and accelerates training and testing at the same time, which is of great value to the promotion and use of the capsule network.

This work was supported by National Key Basic Research Program of China (No. 2016YFB0100900) and National Natural Science Foundation of China (No. 61773231).

You have full access to this open access chapter, Download conference paper PDF

Multi-level Dense Capsule Networks

Experimental Analysis of Convolutional Neural Networks and Capsule Networks for Image Classification

Res-CapsNet: Residual Capsule Network for Data Classification

Article 05 April 2022

Keywords

1 Introduction

Neural networks play more and more critical roles in CV tasks, such as image classification, object detection, object localization, VQA, and object tracking. These methods performs much better than traditional methods.

CNN is one kind of main neural networks. It uses convolution layers to extract image features and solve the translation invariance problem. Therefore, CNN achieves high accuracy in most computer vision tasks, and the most efficient network architectures in the latest work are mainly based on CNN, such as ResNet [4], DenseNet [6], DeepLab [1], and SSD [12].

However, CNN has its inherent fault. Firstly, it is insensitive to change of object location in images, which is an advantage for image classification but are not suitable for other tasks such as semantic segmentation and object localization. Secondly, it is difficult for CNN to deal with complex conditions. For example, if the object changes its angle in the image in object detection tasks, from top or bottom, network performance will decrease. These two shortcomings reduce output accuracy and restrict the application of CNN.

Some studies have been conducted to solve these problems. To locate objects in images, R-CNN [3] takes advantage of proposal method. It adopts SelectiveSerach [17] method to propose candidate boxes. Hinton’s CapsNet [15] is another attempt to address CNN’s shortcomings. CapsNet uses a set of neurons rather than a single neuron as the basic unit of the layer. This basic unit, named capsule, contains a vector with its length representing the probability of the feature in the capsule. CapsNet produces more abundant output information and can solve the problem of translation invariance, because the direction of the vector in capsules can encode rich characteristic information, instead of having only location information as do traditional networks.

However, the above works are imperfect. The proposal method in R-CNN doesn’t solve the problem but bypasses it. The problem of the CapsNet is that the network is slower and more cumbersome than the other networks. Using vectors instead of real numbers as the basic unit means that, as the length of the vector grows, the amount of computation and memory used will increase dramatically. Because the length of the capsule cannot be too short and will grow as the network depth increases to ensure the capacity of the capsule layer. For example, on VOC dataset or COCO dataset, the CapsNet requires dozens or even hundreds of GPU, which is intolerable. If we downsample the input image, it will cause a severe drop in network performance.

We propose a new network based on CapsNet, adopting two network parameter reduction methods, and appropriately introducing the convolution layer to achieve the goal of applying the capsule network to more complex object classification datasets. Our contributions are as follows. First, we test a variety of structures of the capsule network, comparing the depth and performance of feature extraction. Then we design a suitable network named CapsNetPr, which means the capsule network designed for parameters reduction. Second, we analyze the topic source of the number and the primary constraint conditions of parameters in capsule network and adopt the method of transformation matrix decomposition to save space and time cost with a finite precision loss. Third, based the place-coded features, we account for most of the parameters of the lower capsule layers and adopt the method of sharing transformation matrix in the same position of different channels, reducing the storage space needed for the network. We test the above methods on MNIST, CIFAR10 and Princeton CAD datasets, achieve fast and accurate results.

2 Related Work

Based on CapsNet, many works have been done. Some research focus on CapsNet on complex datasets [20], confirming that CapsNet is not performing well in real scenes images. Due to the parameter limitation, this work only tested the performance of CapsNet on the CIFAR10 dataset. [10] reduces the number of connections between capsules by changing the rules of capsule routing, and introduces a feedback connection, which has achieved good results on complex datasets. In contrast, our work changes the transformation matrix and reduces the number of parameters required without changing the routing rules. [13] adopt a complex feature extraction network to outperform the CNN network in specific indicators, while our CapsNetPr uses a simple feature extraction network to have better performance than CNN, with the same complexity as CNN. [5] optimizes the format of the capsule and the routing algorithm to improve network performance, while not addressing the problem of excessive parameters. Unlike that, our work optimizes the transformation matrix and network structure to reduce the number of parameters. [13, 21] use a weight sharing method similar to our work, but the method we use is more productive and more efficient.

Other works, [2, 7,8,9, 11, 14, 18, 19, 21] apply capsule network to various practical problems, replacing the traditional convolutional network with the capsule network, effectively improving the performance on these specific tasks, but there is little or no optimization of the capsule layer itself, which is different from our work.

3 Method

We figure out why the parameter size of the capsule network is much larger than that of the CNN network. By comparing the two network, we find that the parameters of the capsule network are mainly concentrated on the transformation matrix between capsule layers. Since the unit of the capsule network change from scalars to vector, the transformation matrix dimension between the two capsule layers needs to rise by one dimension. Therefore, the total amount of data of the transformation matrix will increase by a square of the length of the capsule. Our main work is to design new network structure and modify the capsule layer for this part of the parameters so that the parameter amount can be significantly reduced under the condition that the performance of the capsule network is nearly not reduced.

3.1 CapsNetPr

CapsNet [15] is a three-layer simple network, with one convolution layer and two capsule layers. This network is not suitable for VOC, COCO and other complex datasets, because the parameters are too large. We need to design a new capsule network model with acceptable parameters, and the new network model performs better on complex datasets than the CNN with the same complexity.

We test the most straightforward way: design a classification network, extract features by convolution and pooling operations, and then connect to the capsule layers to process the extracted features to predict the classification results. Experiments show that the modified network has neither increased nor decreased in performance compared to the original network. Because the convolution and pooling operations lose low-level semantic information while extracting high-level semantic information, the former is precisely what the capsule layer needs. Experimental details: we use the resnet20 model and replaced its fully connected layers with the capsule layers. Experiments are implemented on the cifar10 dataset, and the results are in Table 1. After the above experiments and discussions, we find that to achieve the goal; we need to modify both the network structure and the capsule layer.

Table 1. The test result of the feature extraction module and the capsule layer. ResNet (we use) is the resnet20 network we implemented. ResNet+capsule is the result of resnet20 removing the last fully connected layers and replacing with the capsule layers. ResNet (in paper [4]) is the result of ResNet’s original paper.

Full size table

The network structure we designed is shown in Fig. 1. First, there are two convolutional layers, and a pooling layer follows each one. Behind these are three capsule layers. Each of the capsule layers has been modified by Matrix Decomposition and Channel Sharing methods in this paper, and the parameter amount is significantly reduced.

3.2 Matrix Decomposition

The decomposition transformation matrix method reduces the number of parameters by breaking a large matrix into two small matrices. An N by M matrix can be broken down into a N by 1 matrix and a 1 by M matrix. According to this method, the size of a transformation matrix can be significantly reduced from N * M to N + M. Moreover, in the operation of matrix multiplication, this method also reduces the amount of computation of the same ratio. However, this method has a severe prerequisite: if the matrix split is entirely equivalent, the column rank of the split matrix must equal to 1. This prerequisite is difficult to meet for the transformation matrix.

A sparse matrix can also carry out the matrix splitting, and difference before and after decomposition are acceptable. So, we train a capsule network on the MNIST dataset and check the transformation matrix to see if it satisfies sparsity. We directly observe the characteristics of the matrix and find that it is approximately sparse. The performance of the network after using the matrix decomposition method also proves that it is feasible to adopt this method for the transformation matrix.

Implementation details: as Fig. 2, we perform matrix decomposition on a matrix that composed of a specific two-dimensional structure in the transformation matrix, and the capsules of the two layers before and after being passed through the two-dimensional matrix for feature transfer. Matrix decomposition of this transformation matrix can reduce the number of parameters, reduce the amount of computation, and the network performance is nearly not reduced.

3.3 Channel Sharing

Channel sharing method reduces the number of parameters required by using multiple channels to reuse the same transformation matrix. In the capsule layer, the place-coded feature is that the position of the lower capsule in the overall capsule matrix can represent some feature of the input image, which may be the location, color, texture, and so on. Because this feature is related to the position of capsule, it is called place-coded. Given that place-code is encoded by location, it is assumed that the capsule at the same location in different channels in the lower layers of the network should have similar encoding characteristics. The transition matrix between the capsule with a similar encoding feature and the same capsule below should also be similar, and the source of this similarity has considerable credibility. Since these transformation matrices are similar, can similar transformation matrices be combined and replaced by the same transformation matrix?

As shown in the Fig. 3, the transformation matrix between the original capsule in the same position of multiple channels and the same capsule in the next layer is different, but we can replace these transformation matrixes with the same one. The parameters of new capsule layers will be less than that of original capsule layers, reduced by the same number of channels sharing the same transformation matrix. The optimal layer to use channel sharing method is the underlying capsule layer, which is easy to understand: the lower the capsule layer, the more obvious its place-coded feature. Meanwhile, the underlying capsule layer has the most parameters, and modifying them is most significant for network optimization.

4 Experiments

4.1 Datasets

We evaluate the proposed method on three different datasets. The datasets include: MNIST dataset, CIFAR10 dataset and Princeton CAD dataset.

The MNIST dataset contains 70,000 handwritten digital pictures, 10 categories, and every picture is grayscale with size 28 * 28. Since the images in the MNIST dataset are small, have few categories, and are pure in content, high accuracy can be achieved without using the CapsNetPr network, so we only tested the effects of transformation matrix decomposition and channel sharing methods on this dataset.

The CIFAR10 dataset is made up of 60,000 color pictures with size 32 * 32, ten categories. The images in the dataset are all natural scenes, with animals or vehicles as the main body. We tested the CapsNetPr network, transformation matrix decomposition, and channel sharing methods on this dataset. In contrast, we also tested CapsNet, CapsuleNet [20], and the effect of a CNN with the same number of layers.

The Princeton CAD dataset is from a public CAD library of Princeton University. We project each CAD model in 7 directions, and the resulting image is saved in 224 * 224 size, 20 categories, totaling 25,000 pictures. Some examples of the Princeton CAD dataset are in Fig. 4. We tested the CapsNetPr network, transformation matrix decomposition, and channel sharing methods on this dataset. For comparison, we examined the performance of VGG16 and CNN on this dataset. CapsNet cannot be used in this dataset because of parameter limitation.

4.2 Results

The MNIST dataset is a common dataset. The results are in Table 2. The results show that the MD method is better than the CS method with the same parameter reduction ratio. The parameter size of the MD method on CapsNet is reduced to 0.1875 times, and CS2 is 0.5 times, while the accuracy of the two is similar.

CIFAR10 dataset has ten classes, with 32 * 32 natural color pictures. Compared to MNIST dataset, CIFAR10 are more complex. In this dataset, we examine the results of a CNN and our CapsNetPr. Our network has the highest accuracy. As with the results on the MNIST dataset, the MD method has less loss of accuracy than the CS method. There is also a surprising discovery that the results of CS4 are higher than CS2.

The Princeton CAD dataset has 20 classes, with 224 * 224 grey images. The images in this dataset are too large for the CapsNet, we only test CNN, VGG 16 and CapsNetPr on this dataset. CaspNetCx has higher results than the unpretrained VGG16 network, which proves that CaspNetCx does have excellent performance. The Performance of the CS method and the MD method are similar to those on other datasets (Tables 3 and 4).

Table 2. MNIST dataset results. Classification accuracy (%). Trained 30 epochs. CS means Channel Sharing, MD means Matrix Decomposition.

Full size table

Table 3. CIFAR10 dataset results. Classification accuracy (%). Trained 30 epochs. CS means Channel Sharing, MD means Matrix Decomposition.

Full size table

Table 4. Princeton CAD dataset results. VGG16 (pretrained) has been pretrained on IMAGENET. Classification accuracy (%). Trained 30 epochs. CS means Channel Sharing, MD means Matrix Decomposition.

Full size table

5 Discussion

Our CapsNetPr network model, MD and CS methods have achieved significant results across different datasets. Especially on the Princeton CAD dataset, our network performs better than VGG16 network, which is much more complicated than CapsNetPr. The reason why CapsNetPr has a good performance on the Princeton CAD dataset is related to the characteristics of the dataset and the capsule layer. This dataset is composed of images projected from objects at different angles and contains a large number of unfamiliar perspectives. Other datasets come from real photos, following a few fixed perspectives. Therefore, in the Princeton CAD dataset, the high-level information obtained by feature extraction layers may be entirely different for images of the same category. The capsule network can keep the underlying information in the capsule, with better rotation invariance and excellent performance on this dataset. This result proves that our optimized capsule network can perform excellently under complex conditions.

Observing the relationship between the CS method’s reduced parameter ratio and the final accuracy, we can find two characteristics. First, as the CS method’s parameter reduction ratio increases, the network results have different changes in two stages. In the first stage, each time the CS ratio is doubled (for example, from CS2 to CS4, CS4 to CS8), the accuracy only changes little, less than 0.3%. In the second stage, after crossing a certain CS ratio (CS8 for CIFAR10, CS16 for CAD), the accuracy of the network drops sharply. From this feature, we can infer the relationship between the number of network parameters and the accuracy of the network: The network parameter quantity has a threshold. When the parameter is lower than the threshold, increasing the network parameter quantity will significantly improve the result, but the effect is not significant when the parameter quantity is above the threshold. This threshold is the appropriate target for our task. Second, when the CS ratio is 4, the network result is better than when CS is 2. With fewer parameters, the network results have improved, which proves that sharing some of the parameters helps the network find more general feature links on the picture.

References

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Article Google Scholar
Deng, F., Pu, S., Chen, X., Shi, Y., Yuan, T., Pu, S.: Hyperspectral image classification with capsule network using limited training samples. Sensors 18(9), 3153 (2018)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing (2018)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. IEEE (2017)
Google Scholar
Islam, K.A., Pérez, D., Hill, V., Schaeffer, B., Zimmerman, R., Li, J.: Seagrass detection in coastal water through deep capsule networks. In: Lai, J.-H., et al. (eds.) PRCV 2018. LNCS, vol. 11257, pp. 320–331. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03335-4_28
Chapter Google Scholar
Jaiswal, A., AbdAlmageed, W., Wu, Y., Natarajan, P.: CapsuleGAN: generative adversarial capsule network. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11131, pp. 526–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11015-4_38
Chapter Google Scholar
Kruthika, K., Maheshappa, H., Alzheimer’s Disease Neuroimaging Initiative, et al.: CBIR system using capsule networks and 3D CNN for Alzheimer’s disease diagnosis. Inform. Med. Unlocked 14, 59–68 (2019)
Google Scholar
Li, H., Guo, X., Dai, B., Ouyang, W., Wang, X.: Neural network encapsulation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 266–282. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_16
Chapter Google Scholar
Li, Y., et al.: The recognition of rice images by UAV based on capsule network. Cluster Comput. 1–10 (2018)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Nair, P., Doshi, R., Keselj, S.: Pushing the limits of capsule networks. Technical note (2018)
Google Scholar
Ramasinghe, S., Athuraliya, C.D., Khan, S.H.: A context-aware capsule network for multi-label classification. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11131, pp. 546–554. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11015-4_40
Chapter Google Scholar
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3856–3866. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Wang, Q., Qiu, J., Zhou, Y., Ruan, T., Gao, D., Gao, J.: Automatic severity classification of coronary artery disease via recurrent capsule network. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1587–1594. IEEE (2018)
Google Scholar
Wang, S., Liu, G., Li, Z., Xuan, S., Yan, C., Jiang, C.: Credit card fraud detection using capsule network. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3679–3684. IEEE (2018)
Google Scholar
Xi, E., Bing, S., Jin, Y.: Capsule network performance on complex data. arXiv preprint arXiv:1712.03480 (2017)
Zhu, K., Chen, Y., Ghamisi, P., Jia, X., Benediktsson, J.A.: Deep convolutional capsule network for hyperspectral image spectral and spectral-spatial classification. Remote Sens. 11(3), 223 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Tsinghua University, Beijing, China
Sheng Yi, Huimin Ma & Xi Li

Authors

Sheng Yi
View author publications
You can also search for this author in PubMed Google Scholar
Huimin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huimin Ma .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yi, S., Ma, H., Li, X. (2019). Modified Capsule Network for Object Classification. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_21
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Modified Capsule Network for Object Classification

Abstract