Keywords

1 Introduction

Neural networks play more and more critical roles in CV tasks, such as image classification, object detection, object localization, VQA, and object tracking. These methods performs much better than traditional methods.

CNN is one kind of main neural networks. It uses convolution layers to extract image features and solve the translation invariance problem. Therefore, CNN achieves high accuracy in most computer vision tasks, and the most efficient network architectures in the latest work are mainly based on CNN, such as ResNet [4], DenseNet [6], DeepLab [1], and SSD [12].

However, CNN has its inherent fault. Firstly, it is insensitive to change of object location in images, which is an advantage for image classification but are not suitable for other tasks such as semantic segmentation and object localization. Secondly, it is difficult for CNN to deal with complex conditions. For example, if the object changes its angle in the image in object detection tasks, from top or bottom, network performance will decrease. These two shortcomings reduce output accuracy and restrict the application of CNN.

Some studies have been conducted to solve these problems. To locate objects in images, R-CNN [3] takes advantage of proposal method. It adopts SelectiveSerach [17] method to propose candidate boxes. Hinton’s CapsNet [15] is another attempt to address CNN’s shortcomings. CapsNet uses a set of neurons rather than a single neuron as the basic unit of the layer. This basic unit, named capsule, contains a vector with its length representing the probability of the feature in the capsule. CapsNet produces more abundant output information and can solve the problem of translation invariance, because the direction of the vector in capsules can encode rich characteristic information, instead of having only location information as do traditional networks.

However, the above works are imperfect. The proposal method in R-CNN doesn’t solve the problem but bypasses it. The problem of the CapsNet is that the network is slower and more cumbersome than the other networks. Using vectors instead of real numbers as the basic unit means that, as the length of the vector grows, the amount of computation and memory used will increase dramatically. Because the length of the capsule cannot be too short and will grow as the network depth increases to ensure the capacity of the capsule layer. For example, on VOC dataset or COCO dataset, the CapsNet requires dozens or even hundreds of GPU, which is intolerable. If we downsample the input image, it will cause a severe drop in network performance.

We propose a new network based on CapsNet, adopting two network parameter reduction methods, and appropriately introducing the convolution layer to achieve the goal of applying the capsule network to more complex object classification datasets. Our contributions are as follows. First, we test a variety of structures of the capsule network, comparing the depth and performance of feature extraction. Then we design a suitable network named CapsNetPr, which means the capsule network designed for parameters reduction. Second, we analyze the topic source of the number and the primary constraint conditions of parameters in capsule network and adopt the method of transformation matrix decomposition to save space and time cost with a finite precision loss. Third, based the place-coded features, we account for most of the parameters of the lower capsule layers and adopt the method of sharing transformation matrix in the same position of different channels, reducing the storage space needed for the network. We test the above methods on MNIST, CIFAR10 and Princeton CAD datasets, achieve fast and accurate results.

2 Related Work

Based on CapsNet, many works have been done. Some research focus on CapsNet on complex datasets [20], confirming that CapsNet is not performing well in real scenes images. Due to the parameter limitation, this work only tested the performance of CapsNet on the CIFAR10 dataset. [10] reduces the number of connections between capsules by changing the rules of capsule routing, and introduces a feedback connection, which has achieved good results on complex datasets. In contrast, our work changes the transformation matrix and reduces the number of parameters required without changing the routing rules. [13] adopt a complex feature extraction network to outperform the CNN network in specific indicators, while our CapsNetPr uses a simple feature extraction network to have better performance than CNN, with the same complexity as CNN. [5] optimizes the format of the capsule and the routing algorithm to improve network performance, while not addressing the problem of excessive parameters. Unlike that, our work optimizes the transformation matrix and network structure to reduce the number of parameters. [13, 21] use a weight sharing method similar to our work, but the method we use is more productive and more efficient.

Other works, [2, 7,8,9, 11, 14, 18, 19, 21] apply capsule network to various practical problems, replacing the traditional convolutional network with the capsule network, effectively improving the performance on these specific tasks, but there is little or no optimization of the capsule layer itself, which is different from our work.

3 Method

We figure out why the parameter size of the capsule network is much larger than that of the CNN network. By comparing the two network, we find that the parameters of the capsule network are mainly concentrated on the transformation matrix between capsule layers. Since the unit of the capsule network change from scalars to vector, the transformation matrix dimension between the two capsule layers needs to rise by one dimension. Therefore, the total amount of data of the transformation matrix will increase by a square of the length of the capsule. Our main work is to design new network structure and modify the capsule layer for this part of the parameters so that the parameter amount can be significantly reduced under the condition that the performance of the capsule network is nearly not reduced.

3.1 CapsNetPr

CapsNet [15] is a three-layer simple network, with one convolution layer and two capsule layers. This network is not suitable for VOC, COCO and other complex datasets, because the parameters are too large. We need to design a new capsule network model with acceptable parameters, and the new network model performs better on complex datasets than the CNN with the same complexity.

We test the most straightforward way: design a classification network, extract features by convolution and pooling operations, and then connect to the capsule layers to process the extracted features to predict the classification results. Experiments show that the modified network has neither increased nor decreased in performance compared to the original network. Because the convolution and pooling operations lose low-level semantic information while extracting high-level semantic information, the former is precisely what the capsule layer needs. Experimental details: we use the resnet20 model and replaced its fully connected layers with the capsule layers. Experiments are implemented on the cifar10 dataset, and the results are in Table 1. After the above experiments and discussions, we find that to achieve the goal; we need to modify both the network structure and the capsule layer.

Table 1. The test result of the feature extraction module and the capsule layer. ResNet (we use) is the resnet20 network we implemented. ResNet+capsule is the result of resnet20 removing the last fully connected layers and replacing with the capsule layers. ResNet (in paper [4]) is the result of ResNet’s original paper.

The network structure we designed is shown in Fig. 1. First, there are two convolutional layers, and a pooling layer follows each one. Behind these are three capsule layers. Each of the capsule layers has been modified by Matrix Decomposition and Channel Sharing methods in this paper, and the parameter amount is significantly reduced.

Fig. 1.
figure 1

The above is our work: the CapsNetPr network and modified capsule layer. In the yellow box is the CapsNetPr network structure we designed, using three layers of convolution layers, and then two capsule layers. In the blue box is the optimization methods of capsule layer, and details are given in Sects. 3.2 and 3.3. The CapsNetPr and modified capsule layer are combined in black box. (Color figure online)

3.2 Matrix Decomposition

The decomposition transformation matrix method reduces the number of parameters by breaking a large matrix into two small matrices. An N by M matrix can be broken down into a N by 1 matrix and a 1 by M matrix. According to this method, the size of a transformation matrix can be significantly reduced from N * M to N + M. Moreover, in the operation of matrix multiplication, this method also reduces the amount of computation of the same ratio. However, this method has a severe prerequisite: if the matrix split is entirely equivalent, the column rank of the split matrix must equal to 1. This prerequisite is difficult to meet for the transformation matrix.

A sparse matrix can also carry out the matrix splitting, and difference before and after decomposition are acceptable. So, we train a capsule network on the MNIST dataset and check the transformation matrix to see if it satisfies sparsity. We directly observe the characteristics of the matrix and find that it is approximately sparse. The performance of the network after using the matrix decomposition method also proves that it is feasible to adopt this method for the transformation matrix.

Implementation details: as Fig. 2, we perform matrix decomposition on a matrix that composed of a specific two-dimensional structure in the transformation matrix, and the capsules of the two layers before and after being passed through the two-dimensional matrix for feature transfer. Matrix decomposition of this transformation matrix can reduce the number of parameters, reduce the amount of computation, and the network performance is nearly not reduced.

Fig. 2.
figure 2

Example of Matrix Decomposition. Decompose the transformation matrix between two capsules into the product of two vectors.

3.3 Channel Sharing

Channel sharing method reduces the number of parameters required by using multiple channels to reuse the same transformation matrix. In the capsule layer, the place-coded feature is that the position of the lower capsule in the overall capsule matrix can represent some feature of the input image, which may be the location, color, texture, and so on. Because this feature is related to the position of capsule, it is called place-coded. Given that place-code is encoded by location, it is assumed that the capsule at the same location in different channels in the lower layers of the network should have similar encoding characteristics. The transition matrix between the capsule with a similar encoding feature and the same capsule below should also be similar, and the source of this similarity has considerable credibility. Since these transformation matrices are similar, can similar transformation matrices be combined and replaced by the same transformation matrix?

Fig. 3.
figure 3

Example of Channel Sharing. The rectangles of different colors represent different transformation matrices. In the original capsule network, any two transformation matrices are different, as shown in the upper left of the figure. The channel sharing method is to share the transformation matrix between the capsule at the same position of different channels in the same layer and a capsule of the next layer, as shown in the bottom left of the figure. All channels in a layer can be shared, or sharing parameters of adjacent K channels, and K must be a factor of the number of channels, as shown on the right: each row is an example of a different parameter sharing factor(CS2, CS4,...). (Color figure online)

As shown in the Fig. 3, the transformation matrix between the original capsule in the same position of multiple channels and the same capsule in the next layer is different, but we can replace these transformation matrixes with the same one. The parameters of new capsule layers will be less than that of original capsule layers, reduced by the same number of channels sharing the same transformation matrix. The optimal layer to use channel sharing method is the underlying capsule layer, which is easy to understand: the lower the capsule layer, the more obvious its place-coded feature. Meanwhile, the underlying capsule layer has the most parameters, and modifying them is most significant for network optimization.

4 Experiments

4.1 Datasets

We evaluate the proposed method on three different datasets. The datasets include: MNIST dataset, CIFAR10 dataset and Princeton CAD dataset.

The MNIST dataset contains 70,000 handwritten digital pictures, 10 categories, and every picture is grayscale with size 28 * 28. Since the images in the MNIST dataset are small, have few categories, and are pure in content, high accuracy can be achieved without using the CapsNetPr network, so we only tested the effects of transformation matrix decomposition and channel sharing methods on this dataset.

The CIFAR10 dataset is made up of 60,000 color pictures with size 32 * 32, ten categories. The images in the dataset are all natural scenes, with animals or vehicles as the main body. We tested the CapsNetPr network, transformation matrix decomposition, and channel sharing methods on this dataset. In contrast, we also tested CapsNet, CapsuleNet [20], and the effect of a CNN with the same number of layers.

Fig. 4.
figure 4

Examples of CAD dataset, projected from CAD modeling of 20 common objects. Each type of object has multiple CAD models, and each model projects pictures with a size of 224 * 224 in seven directions.

The Princeton CAD dataset is from a public CAD library of Princeton University. We project each CAD model in 7 directions, and the resulting image is saved in 224 * 224 size, 20 categories, totaling 25,000 pictures. Some examples of the Princeton CAD dataset are in Fig. 4. We tested the CapsNetPr network, transformation matrix decomposition, and channel sharing methods on this dataset. For comparison, we examined the performance of VGG16 and CNN on this dataset. CapsNet cannot be used in this dataset because of parameter limitation.

4.2 Results

The MNIST dataset is a common dataset. The results are in Table 2. The results show that the MD method is better than the CS method with the same parameter reduction ratio. The parameter size of the MD method on CapsNet is reduced to 0.1875 times, and CS2 is 0.5 times, while the accuracy of the two is similar.

CIFAR10 dataset has ten classes, with 32 * 32 natural color pictures. Compared to MNIST dataset, CIFAR10 are more complex. In this dataset, we examine the results of a CNN and our CapsNetPr. Our network has the highest accuracy. As with the results on the MNIST dataset, the MD method has less loss of accuracy than the CS method. There is also a surprising discovery that the results of CS4 are higher than CS2.

The Princeton CAD dataset has 20 classes, with 224 * 224 grey images. The images in this dataset are too large for the CapsNet, we only test CNN, VGG 16 and CapsNetPr on this dataset. CaspNetCx has higher results than the unpretrained VGG16 network, which proves that CaspNetCx does have excellent performance. The Performance of the CS method and the MD method are similar to those on other datasets (Tables 3 and 4).

Table 2. MNIST dataset results. Classification accuracy (%). Trained 30 epochs. CS means Channel Sharing, MD means Matrix Decomposition.
Table 3. CIFAR10 dataset results. Classification accuracy (%). Trained 30 epochs. CS means Channel Sharing, MD means Matrix Decomposition.
Table 4. Princeton CAD dataset results. VGG16 (pretrained) has been pretrained on IMAGENET. Classification accuracy (%). Trained 30 epochs. CS means Channel Sharing, MD means Matrix Decomposition.

5 Discussion

Our CapsNetPr network model, MD and CS methods have achieved significant results across different datasets. Especially on the Princeton CAD dataset, our network performs better than VGG16 network, which is much more complicated than CapsNetPr. The reason why CapsNetPr has a good performance on the Princeton CAD dataset is related to the characteristics of the dataset and the capsule layer. This dataset is composed of images projected from objects at different angles and contains a large number of unfamiliar perspectives. Other datasets come from real photos, following a few fixed perspectives. Therefore, in the Princeton CAD dataset, the high-level information obtained by feature extraction layers may be entirely different for images of the same category. The capsule network can keep the underlying information in the capsule, with better rotation invariance and excellent performance on this dataset. This result proves that our optimized capsule network can perform excellently under complex conditions.

Observing the relationship between the CS method’s reduced parameter ratio and the final accuracy, we can find two characteristics. First, as the CS method’s parameter reduction ratio increases, the network results have different changes in two stages. In the first stage, each time the CS ratio is doubled (for example, from CS2 to CS4, CS4 to CS8), the accuracy only changes little, less than 0.3%. In the second stage, after crossing a certain CS ratio (CS8 for CIFAR10, CS16 for CAD), the accuracy of the network drops sharply. From this feature, we can infer the relationship between the number of network parameters and the accuracy of the network: The network parameter quantity has a threshold. When the parameter is lower than the threshold, increasing the network parameter quantity will significantly improve the result, but the effect is not significant when the parameter quantity is above the threshold. This threshold is the appropriate target for our task. Second, when the CS ratio is 4, the network result is better than when CS is 2. With fewer parameters, the network results have improved, which proves that sharing some of the parameters helps the network find more general feature links on the picture.