Keywords

1 Introduction

In recent years, It is well known that convolutional neural networks have achieved great success in various computer vision tasks [1], including object detection [2,3,4], object classification [5, 6] and semantic segmentation [7, 8] and many others. CNNs have achieved state-of-the-art performance in these fields compared with traditional methods based on manually designed visual features [9]. However, with the deepening and widening of CNN convolution layer, higher computational overhead and larger memory are required, so it is difficult to deploy CNN model to resource-limited devices, such as mobile phones and embedded devices. As a result, the application of convolutional neural network in practical scenarios will be limited by various hardware equipment resources, such as storage space, computing power and battery power. For instance, AlexNet [10] network contains about \(6\times {10}^{6}\) parameters, while some better networks like VGG [11] contain about \(1.38\times {10}^{8}\) parameters. For less complex tasks, such as simple image recognition, the VGG network will require more than 500 MB memory and \(1.56\times {10}^{10}\) Float Point Operations (FLOPs). The over-parameterized [12] of deep learning is a major obstacle to deployment on mobile devices.

Thus, network compression has drawn a significant amount of interests from both academia and industry. In recent years, numerous efficient compression methods have been proposed, including low-rank approximation [12, 13], parameter quantization [14, 15], and binarization [16]. Among them, network pruning [17,18,19,20] has excellent performance in reducing redundancy of CNNs, and it has better model deployment ability compared with parameter quantization. simultaneous pruning can be applied to different elements of CNNs, such as weights, filters, and layers. Early works in network pruning mainly resort to removing several unimportant weight connections from a well-trained network with negligible impact on network performance.

In this paper, a pruning strategy based on Feature Abstraction Capability (FAC) of filters is proposed. Zhou et al. [21] have shown that the features extracted by convolution kernels across layers or even the same layer have significant differences in the contribution of the final prediction results. The more information the feature map represents, the more important it is to the network. Therefore, we believe that the Feature Abstraction Capabilities of different filters are different, and use this criterion to guide the pruning of network elements.

Fig. 1.
figure 1

The framework of our method, in the original convolution layer, the input feature graph is convoluted with the 3D filter to obtain the output feature map. In our FAC method, we first obtain the sparsity of the 3D filter and the dispersion of the batch feature maps by the L1-norm [24] and the data-driven method [18]. Finally, the quantized value of the FAC is obtained as the pruning index in combination with the sparsity and dispersion, and the filter with the lower FAC value in the volume layer is pruned to generate a more Compact network.

As shown in Fig. 1, the Feature Abstraction Capability (FAC) of the filter is obtained by evaluating the sparsity of the convolution kernel of the filter and the information richness contained in the feature map activated by the filter. Our main insight is that the feature map activated by the filter with lower Feature Abstraction Capability (FAC) is redundant. Pruning unimportant filters and fine-tuning the network to restore its generalization capabilities. Finally, the CNN model accelerates and compresses during the training and testing phases, transforming the cumbersome network into a smaller model with a slight performance degradation. At the same time, we propose to normalize the quantized value of each filter’s FAC, the proposed pruning strategy can be extended to all layers of the deep CNN, eliminating the need for threshold sensitivity analysis for each layer.

We evaluated our pruning framework by using two commonly used CNN models: VGG-16 [22] and Resnet-110 [23]. These two models are pruned on two benchmark datasets CIFAR10 and CUB_200_2011. These two data sets are representative. In the CIFAR10 dataset, our method still achieves \(4.9{\times } \) compression and \(1.77{\times } \) acceleration on VGG-16, with about \(0.3\%\) top-1 accuracy drop. Similarly, in the CUB_200_2011 dataset with 200 kinds of complex tasks of fine-grained classification, our method still had \(4.2{\times } \) acceleration on VGG-16 with roughly \(0.6\%\) top-1 accuracy drop, which was better than most similar pruning algorithms.

2 Related Work

In this section, we will briefly introduce some popular network pruning methods in CNN compression, which can be divided into structured pruning and unstructured pruning.

Unstructured pruning is to zero the weight value below a certain threshold in the weights. Among them, what is impressive is that Han et al. [17] proposed to connect by pruning the weight of small magnitudes on AlexNet network and VGG network, and then retrain without affecting the overall accuracy, effectively reducing the number of parameters. However, this pruning operation generates an unstructured sparse model that requires sparse BLAS libraries or even specialized hardware to achieve acceleration.

Structured pruning reduces computational complexity and memory overhead by directly removing structured parts, such as kernels, filters, or layers, and is well supported by a variety of off-the-shelf deep learning platforms. For instance, One pruning criterion is sparsity activated by non-linear ReLu mappings. Hu et al. [18] proposed a data-driven neuron pruning approach to remove unimportant neurons. They argue that if most of these activated feature maps are zero, it is not important for neurons to have a high probability. The criterion measures the importance of neurons by calculating the average percentage of zeros (APoZ) in the activated feature map. However, the APoZ pruning criterion requires the introduction of different threshold parameters for each convolutional layer, which are difficult to accurately determine. Li et al. [24] proposed to remove unimportant filters based on the L1-norm. Molchanov et al. [19] calculated the influence of filters on network loss function based on Taylor expansion. According to the criterion, if the filter has little influence on the loss function, the filter can be safely removed. So they use Taylor expansion to approximate the change in loss. He et al. [23] proposes a channel selection method based on LASSO regression, which uses least squares reconstruction to eliminate redundant filters. Similar to our study, Luo et al. [20] proposed a method to calculate entropy of filters to measure the information richness of the convolution kernel. However, only the information richness of the filter is considered in the method, and the strategy can only compare the entropy value of the same convolutional layer. Most of these methods need to accurately obtain the pruning threshold of each convolutional layer, but this is difficult to achieve. If fixed compression rate is used for pruning, it may lead to irreparable accuracy reduction.

In addition to the network pruning method, some other CNN compression methods are introduced, such as designing a more compact architecture. For example, it is known that most parameters of the CNN model exist in fully-connected layers, so the global average pooling is proposed to replace the full connection layer in the Network-In-Network [26]. Son et al. [27] reconstructed the network by unified representation of similar convolutions, so as to achieve effective compression of the network. However, this method has some limitations. It is only effective for \(3\times 3\) convolution kernels. Sandler et al. [28] proposed the use of depthwise separable convolution to build a lightweight network, which has also been widely used in mobile devices. It’s important to note that our approach can be combined with these strategies to achieve a more compact and optimized network. As for ResNet-50, there exists less redundancy compared with classic CNN models. We can still bring \(1.63{\times } \) acceleration and achieves \(2.48{\times } \) reduction in FLOPs and parameters with 0.007 decrease in accuracy.

3 Pruning Method

In this section, we will describe in detail our pruning method based on the Feature Abstraction Capability of the filter. First, the general framework is given. Our main idea is to quantify the FAC of all convolutional layer filters, discard those filters with poor performance in each pruning, and restore their performance by fine-tuning. These implementation details will be released later. Finally, the training and pruning planning strategy we used in the experiment is introduced, which has less impact on the final prediction accuracy compared with other previous strategies.

3.1 Framework

Figure 1 illustrates the overall framework of our proposed FAC pruning method. We first obtain the weight values of all 2D kernels in the 3D filter, obtain the sparsity of the 3D filter from the sum of the L1-norm [24] of the 2D kernels, at the same time obtain the batch feature maps of the filter by the data-driven method [18], and then calculate the discreteness of batch feature maps. We use the discreteness of the filter batch feature maps to evaluate the richness of the information contained in the activated feature map, because if the difference in the feature map of the filter is small each time, we have enough reason to believe that the filter Feature Abstraction Capability is weak. Finally, we combine the sparsity of the convolution kernel with the information dispersion of the activated feature map to make the estimated filter feature abstraction more accurate and robust.

Then, all weak filters are pruned from the original model to achieve a more optimized network architecture. Note that the corresponding input channels of filters in the next layer should be removed. Finally, the network is fine-tuned to restore its generalization performance.

3.2 Filter Sparsity

In the convolution layer i, the input tensor \({I}_{i}\,\epsilon \,\mathbb {R}^{{C}\times {H}_{in}\times {W}_{in}}\) is convolved with a set of filter weights \({W}_{i}\,\epsilon \,\mathbb {R}^{{N}\times {C}\times {K}_{h}\times {K}_{w}}\) to get the output tensor \({Y}_{i}\,\epsilon \,\mathbb {R}^{{N}\times {H}_{out}\times {W}_{out}}\). Here, C is the number of the input feature maps, \({H}_{in}\) and \({W}_{in}\) are the height and width of the input feature maps, N is the number of the filters, \({H}_{out}\) and \({W}_{out}\) are the height and width of the output feature maps, \({K}_{h}\) and \({K}_{w}\) are the height and width of a filter. The convolution operation can be expressed by the following formula:

$$\begin{aligned} {Y}_{n} = {I}_{i} * {W}_{n} \end{aligned}$$
(1)

Where denote the convolution operation, \({W}_{n}\) is the weight of the nth filter in the convolutional layer I, \({W}_{n}\,\epsilon \,\mathbb {R}^{{C}\times {K}_{h}\times {K}_{w}}\). \({Y}_{n}\) is the feature map of nth filter, \({Y}_{n}\,\epsilon \,\mathbb {R}^{{H}_{out}\times {W}_{out}}\).

We evaluate the sparsity of the filter by calculating the L1-norm of \({W}_{n}\), because it is known from Eq. (1) that if the absolute value of the weights value in \({W}_{n}\) is mostly close to zero, the L1-norm will be small and the value in the output feature map of the filter will also be closer to zero. We think that such a feature map is approximately sparse, indicating that the filter’s Feature Abstraction Capability is also weaker. Therefore, for the n-th input feature map, we define its sparsity as:

$$\begin{aligned} {S}_{n} = \sum _{c=0}^{C}\sum _{i=0}^{{K}_{h}}\sum _{j=0}^{{K}_{w}} \vert {W}_{n, c, i, j}\vert ,~~~~n = 0, 1, 2,\ldots , N \end{aligned}$$
(2)

3.3 Discreteness of Feature Maps

In this paper, we propose a criterion based on the FAC of the filter to evaluate the importance of each filter. We believe that if the difference between the feature maps of each output of the filter is greater, the filter’s Feature Abstraction Capability will be stronger.

As shown in Fig. 2, we use global average pooling [26] for the activated feature maps output of layer i, in this way, a \(N\times {H}_{out}\times {W}_{out}\) output tensor will be converted into a \(1\times N\) vector. At the same time, a corresponding score is obtained for each feature map output of the filter. In order to calculate the dispersion of each filter’s output feature map score, more output results are needed, so we calculate a score for each batch of the data set, and finally we will get a matrix \(M\,\epsilon \,\mathbb {R}^{{B}\times {N}}\), where B refers to the number of batches in the training data set, and N is the output channel number.

Fig. 2.
figure 2

How to use Global Average Pooling (GAP) to calculate the score of output feature map in convolutional layer i, it should be noted that feature maps are activated by ReLu function before GAP, because we think that the negative values in feature maps are filtered out in the network.

Then, for \({M}_{:,j}\) in the matrix M represents the output scores vector of the j-th filter, let \({\mu }_{j}\) be the average of the scores of the j-th filter, the formula is as follows:

$$\begin{aligned} {S}_{n} = \sum _{i=0}^{B} {M}_{i,j},~~j = 0, 1, 2,\ldots , N \end{aligned}$$
(3)

Then, the feature maps dispersion of the j-th filter is:

$$\begin{aligned} {D}_{j}=\left\{ \begin{array}{rcl} 0, &{} &{}{{\mu }_{j} = 0}\\ \sqrt{\frac{\sum _{i=0}^{B}({M}_{i,j} - {\mu }_{j})^{2}}{B}}, &{} &{} {otherwise} \end{array}\right. \end{aligned}$$
(4)

3.4 Definition of the FAC

From the above discussion, we know that the importance of the filter depends on two parts, the sparsity of the convolution kernel and the discreteness of the feature maps. Therefore, we combine the two parts to propose the Feature Abstraction Capability (FAC) to measure the importance of the filter by:

$$\begin{aligned} FAC = {S}_{n} * {D}_{j} \end{aligned}$$
(5)

3.5 Normalization

In many papers, the pruning criterion is only applicable to the convolution kernel comparison between the same layers, and the scale inconsistency will occur when applied to the cross-layer. Therefore, in our method, we uses layer-wise L2-normalization to achieve reasonable rescaling:

$$\begin{aligned} {\varTheta }^{(i)} = \frac{{\varTheta }^{(i)}}{\sqrt{\sum _{j=0}^{N}({\varTheta }^{(i)})^2}} \end{aligned}$$
(6)

Where, \({\varTheta }^{(i)}\) refers to the set of all FAC filters in the layer i, which can be understood as a vector. \({\varTheta }^{(i)}\) refers to the FAC of the j-th filter in the layer i.

3.6 Pruning and Fine-Tuning Strategy

There are two main types of network architectures: traditional convolution and fully-connected architectures, as well as some structural variants. VGG and Resnet are typical representatives, and we mainly introduce the pruning methods of these two networks. As shown in Table 1, we notice that more than \(39\%\) parameters exist in the fully-connected layers for VGG-16. Some papers [20] use global average pooling instead of the full connection layer, which can greatly reduce the number of parameters of the model, but also greatly reduce the convergence speed of the model, which may make it difficult to train the model back to the original accuracy. Therefore, we reduce the parameters of the full connection layer by pruning the filter of the last convolution layer to reduce the input channel of FC1 layer.

Fig. 3.
figure 3

Our pruning strategy for ResNet. For each residual block, the final convolutional layer filter cannot be pruned, reducing its input channel by pruning of the previous layer.

For ResNet, there are some restrictions in the pruning process due to the introduction of so-called “identity shortcut connection”. For example, the summation operation requires that the number of output channels per block in the same group needs to be consistent (see Fig. 3). In the ResNet network structure, two kinds of residual modules are mainly used, one is that two \(3\times 3\) convolution networks are connected in series as one residual module, and the other is \(1\times 1\), \(3\times 3\), \(1\times 1\) of 3 convolutional networks are concatenated together as a residual module.

The final question is how to fine tune the entire network during the pruning process. Our strategy is prune and retrain iteratively. We found that most of the pruning method is to pruning filter of each layer at a fixed rate of pruning, and then fine-tuning the model a few times, but if the pruning rate is too high, the filter structure of the layer may be damaged. This problem will become more apparent in more complex task networks because too many filters are pruned at once in this layer.

So our method is to preset a compression ratio \(\alpha \) for the whole network. We need to pruning \(N\times (1-\alpha )\) filters, and then we only pruned \(\beta \) filters each time, finally whole network is fine-tuned with few epochs to recover its performance slightly. In this way, we only need to perform \(N\times (1-\alpha )/\beta \) pruning, and those \(\beta \) filters that are pruned each time are distributed across all convolutional layers of the network, not concentrated on one layer at a time. And the value of \(\beta \) in the experiment is 256. The fine-tuning method obtained better results in experiments than other methods.

Table 1. Overall performance of our approach to reduce FLOPs and parameters on the VGG-16 model. The experiment is based on the CIFAR10 dataset. Note that we will resize the \(32\times 32\) image size in CIFAR to \(128\times 128\).

4 Experiments

We used our pruning method to pruning two typical CNN models: VGG-16 and ResNet-50. We have implemented our approach using the deep learning framework Pytorch. The validity of the algorithm is verified on two datasets, CIFAR-10 and CUB_200_2011. The CIFAR-10 dataset consists of 60000 images, whose size is \(32\times 32\), and the number of images in each category is 6000, with 10 categories. During training, images are converted to \(128\times 128\), because if the image is too small, the FLOPs of the network itself will be very small, and the deceleration effect after pruning is not obvious. The Cub_200_2011 is a birds data set for fine-grained classification tasks, which contains 11788 images of 200 different bird species. It presents a significant challenge for pruning algorithms to maximize model compression and acceleration without reducing accuracy too much. During training, all images of Cub_200_2011 are resize to \(320\times 320\), After each pruning, we fine-tune the whole network in 8 epochs with learning rate varying from \({10}^{-3}\) to \({10}^{-5}\). During the last pruning, the network is fine-tuned in 20 epochs with learning rate varying from \({10}^{-3}\) to \({10}^{-8}\). All experiments are run on a computer equipped with Nvidia GTX 1080Ti GPU.

4.1 VGG-16 Pruning

The detailed distribution of FLOPs and parameters in each layer of VGG-16 is shown in Table 1. As we have seen, the 2nd - 12th layer convolutional layer contains 90% FLOPs. And we can see that our pruning method is also mainly aimed at layer 3–12. For the first two layers of convolutional layer, there is no large-scale pruning. We think that the first two layers of filters contain rich feature information, so they have stronger Feature Abstraction Capability than other filters in the layer. The side proves that our method has a certain degree of interpretability. The pruning rate we set is 80%, that is, 80% of the filters in the model are pruned off. Finally, we compare our method with following baselines on the VGG-16 model:

Taylor Expansion [19]: The effect of the filter on the network loss function is calculated based on the Taylor expansion method. According to this criterion, if the filter has little effect on the loss function, the filter can be safely removed.

APoZ [18]: The criterion measures the importance of filters by calculating the average percentage of zeros (APoZ) in the activated feature map.

Entropy [20]: The method to calculate entropy of filters to measure the information richness of the convolution kernel.

Table 2. Comparison of different model compression methods for VGG-16 network on CIFAR10.
Table 3. Comparison of different model compression methods for VGG-16 network on CUB_200_2011.

As shown in Tables 2 and 3, we used different algorithms for pruning VGG-16 networks in CIFAR10 and CUB_200_2011 datasets, among which the APOZ method pruned the filter of each layer with a fixed prune rate. We can see that when the pruning rate reaches 80%, the accuracy of the model drops very seriously. In the Entropy method, Luo et al. used GAP instead of the fully-connected layer, which greatly reduced the parameters of the model, but had a greater impact on the prediction accuracy of the model (which greatly increased the difficulty of convergence of the model). The Taylor method uses a pruning strategy similar to ours. It can be seen that the Taylor criterion has better performance for model size compression, but at the same pruning rate, our method has less precision loss and more excellent acceleration performance. As can be seen from the two tables, the larger the size of the input image, the greater the clipping acceleration. In the CUB_200_2011 dataset, the size of the input image is \(320\times 320\). Our method can achieve about \(4.7{\times } \) reduction in FLOPs and parameters with 0.006 decrease in accuracy. When the pruning rate is 50%, the accuracy of the pruned model is even higher than the original model, and higher than other pruning methods.

By comparison, we can see that our FAC-based pruning method has better overall performance, and there is a better balance between model compression and model acceleration at the same pruning rate.

4.2 ResNet-110 Pruning

In the network structure of ResNet-110, it is divided into three hierarchies by the residual block, and the size of its corresponding feature maps are \(32\times 32\), \(16\times 16\), and \(8\times 8\), respectively. According to the process of pruning for ResNets in Sect. 3.5, the pruned model for ResNet-110 was obtained on CIFAR-10. During the training process, the images are randomly cropped to \(32\times 32\).

Table 4. The pruned model for ResNet-110 on CIFAR-10 with different pruning rate.

The overall performance of our method on pruning ResNet-110 is shown in Table 4, We prune this model with 2 different pruning rate (pruning 20%, 30% filters respectively). The best pruned model achieves \(2.48{\times } \) reduction in FLOPs and parameters with 0.007 decrease in accuracy. Unlike traditional CNN architectures, ResNet is more compact. There is less redundancy than the VGG-16 model, so it seems more difficult to delete a large number of filters. However, when small pruning rate is adopted, our method can improve the performance of ResNet-110 to the maximum extent.

5 Conclusion

In this paper, we propose a pruning framework based on the Feature Abstraction Capabilities of filters to accelerate and compress the CNN model simultaneously in the training and inference phases. Compared with the previous pruning strategy, the pruning model has better performance. Our approach does not depend on any proprietary libraries, so it can be widely used in a variety of practical applications of current deep learning libraries.

In the future, we want to further explore the interpretability of model pruning, and then design pruning strategies that are more suitable for different visual tasks (such as semantic segmentation, target detection, image restoration, etc.). The pruned network will greatly accelerate these visual tasks.