Elsevier

Signal Processing

Volume 156, March 2019, Pages 84-91
Signal Processing

Channel pruning based on mean gradient for accelerating Convolutional Neural Networks

https://doi.org/10.1016/j.sigpro.2018.10.019Get rights and content

Highlights

  • Channel pruning is applied to reducing huge memory consumption and high computational complexity of convolutional neural networks.

  • New pruning criterion based on mean gradient does well in measure the importance of channels in network performance.

  • Hierarchical global pruning strategy, which improves global pruning strategy, achieves significant reduction in Float Point Operations of networks.

Abstract

Convolutional Neural Networks (CNNs) are getting deeper and wider to improve their performance and thus increase their computational complexity. We apply channel pruning methods to accelerate CNNs and reduce its computational consumption. A new pruning criterion is proposed based on the mean gradient for convolutional kernels. To significantly reduce Float Point Operations (FLOPs) of CNNs, a hierarchical global pruning strategy is introduced. In each pruning step, the importance of convolutional kernels is evaluated by the mean gradient criterion. Hierarchical global pruning strategy is adopted to remove less important kernels, and get a smaller CNN model. Finally we fine-tune the model to restore network performance. Experimental results show that VGG-16 network pruned by channel pruning on CIFAR-10 achieves 5.64 ×  reduction in FLOPs with less than 1% decrease in accuracy. Meanwhile ResNet-110 network pruned on CIFAR-10 achieves 2.48 ×  reduction in FLOPs and parameters with only 0.08% decrease in accuracy.

Introduction

Convolution neural networks (CNNs) have achieved remarkable success in various recognition tasks [1], [2], [3], especially in computer vision [4], [5], [6]. CNNs have achieved state-of-the-art performance in these fields compared with traditional methods based on manually designed visual features [7]. However, these deep neural networks have a huge number of parameters. For example, AlexNet [4] network contains about 6 × 106 parameters, while a better performance network such as VGG [6] network contains about 1.44 × 108 parameters, which causes higher memory and computational costs. For instance, VGG-16 model takes up more than 500MB storage space and needs 1.56 × 1010 Float Point Operations (FLOPs) to classify a single image. The huge memory and high computational costs of CNNs restrict the application of deep learning on mobile devices with limited resources [8]. What’s more, deep learning models are known to be over-parameterized [9]. Denil et al. [10] pointed out that deep neural networks can be reconstructed by a subset of network parameters without affecting network performance, which means that there are a huge number of redundant connections in neural network models and we can reduce the memory and computational costs by pruning and compressing such connections [11], [12].

The huge memory consumption and high computational complexity of deep neural networks drive the research of compression [13], [14] and acceleration algorithms [15], [16], and pruning [17] is one of effective methods. In the 1990s, LeCun et al. [18] introduced the Optimal Brain Damage pruning strategy, they had observed that several unimportant weight connections could be safely removed from a well-trained network with negligible impact on network performance. Hassibi et al. [19] proposed a similar Optimal Brain Surgeon pruning strategy and pointed out that the importance of weight was determined by the second derivative. However, these two methods needed to calculate Hessian matrix, which increased the memory consumption and computational complexity of network model. Recently, Han et al. [20], [21] reported impressive compression rates and effective decrease of the number of parameters on AlexNet network and VGG Network by pruning weight connections with small magnitudes and then retraining without hurting overall accuracy. The decrease of parameters was mainly concentrated in full connection layers, which achieved 3  ∼  4 ×  speedup in full connection layers during inference time. However, this pruning operation had generated an unstructured [22] sparse model, which additionally required sparse BLAS libraries [23] or even specialized hardware to achieve its acceleration [16]. Similar to our study, Li et al. [24] measured the relative importance of a convolution kernel in each layer by calculating the sum of its absolute weights, i.e., its l1 norm. Compared to the minimum weight criterion[24], our criterion is based on mean gradient of feature maps in each layer, which more intuitively reflects the importance of feature extracted from convolutional kernels. Another pruning criterion obtained the sparsity of activations after a non-linear ReLU [25] mapping. Hu et al. [26] believed that if most outputs after these non-linear neurons are zero, the probability of neuronal redundancy should be bigger. This criterion measured importance score of a neuron by calculating its Average Percentage of Zeros (APoZ). However, APoZ pruning criterion requires the introduction of threshold parameters, which will vary from layer to layer. These two criteria simply and intuitively reflect the importance of channels for convolutional kernels or feature maps, but do not directly consider the final loss after pruning. In this paper, pruning algorithm was based on the importance of feature maps in each channel, and considered the effect on network performance after pruning a channel. Meanwhile hierarchical global pruning strategy and FLOPs constraint were introduced to significantly reduce the network FLOPs.

Firstly, channel pruning for CNNs with different structures will be achieved in Section 2. Secondly pruning criterion based on the mean gradient and hierarchical global pruning strategy will be proposed in Section 3. Effectiveness of the algorithm will be presented by experimental comparisons in Section 4. Finally, the paper will be concluded in Section 5.

Section snippets

Pruning channels and corresponding feature maps

The paper mainly studies the effect of channel pruning on reducing network FLOPs. Convolutional layers accounts for more than 90% [27] FLOPs of common CNNs. Therefore, we only prune convolutional layers, Sections 2.1 and 2.2 implement specific pruning on channels and their corresponding feature maps for different networks, respectively.

Channel pruning strategy

The proposed strategy for channel pruning consists of the following steps: (1) Given a pre-trained network model; (2) Evaluating the importance of feature map on each channel by mean gradient criterion; (3) Adopting a hierarchical global pruning strategy to prune less important channels and corresponding feature maps; (4) Alternate iterations of pruning and further fine-tuning; (5) Stopping pruning until the desired pruning target is achieved. The flow chart is depicted in Fig. 3. Our desired

Experiments

To verify the validity of our algorithm, the following experiments are conducted. Effect of removing channels with different order for mean gradient on network accuracy is considered in Section 4.1, which indicates that channels with larger mean gradient are more important in network performance. The comparison results of our strategy and global pruning strategy are shown in Section 4.2. Comparisons of different pruning criteria are given in Section 4.3, which shows that our algorithm can

Conclusion

In this paper, we apply channel pruning to accelerate CNNs and introduce a new criterion based on the mean gradient of feature maps, we propose hierarchical global pruning strategy to effectively reduce network FLOPs. During each pruning, we measure the importance of feature maps on each channel by its mean gradient and use hierarchical global pruning strategy to remove lower important feature maps, and then we obtain a smaller network model. We focus on the effect of removing feature maps on

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant Number: 61801325), the Natural Science Foundation of Tianjin City (Grant Number: 18JCQNJC00600) and the Huawei Innovation Research Program (Grant Number: HO2018085138).

References (29)

  • R. Girshick

    Fast R-CNN

    IEEE International Conference on Computer Vision

    (2015)
  • H. Noh et al.

    Learning deconvolution network for semantic segmentation

    IEEE International Conference on Computer Vision

    (2016)
  • X. Jia et al.

    Guiding the long-short term memory model for image caption generation

    IEEE International Conference on Computer Vision

    (2016)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    International Conference on Neural Information Processing Systems

    (2012)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    arXiv preprint arXiv:1409.1556

    (2014)
  • K. He et al.

    Deep residual learning for image recognition

    Computer Vision and Pattern Recognition

    (2016)
  • M. Kuhn et al.

    An introduction to feature selection

    Applied Predictive Modeling

    (2013)
  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

    Computer Vision and Pattern Recognition

    (2016)
  • Y.D. Kim et al.

    Compression of deep convolutional neural networks for fast and low power mobile applications

    Comput. Sci.

    (2015)
  • M. Denil et al.

    Predicting parameters in deep learning

    Advances in Neural Information Processing Systems

    (2013)
  • E.L. Denton et al.

    Exploiting linear structure within convolutional networks for efficient evaluation

    Advances in Neural Information Processing Systems

    (2014)
  • G.E. Hinton et al.

    Improving neural networks by preventing co-adaptation of feature detectors

    Comput. Sci.

    (2012)
  • H. Zhou et al.

    Less is more: towards compact cnns

    European Conference on Computer Vision

    (2016)
  • A. Novikov et al.

    Tensorizing neural networks

    Advances in Neural Information Processing Systems

    (2015)
  • Cited by (70)

    View all citing articles on Scopus
    View full text