Abstract
Deep neural networks achieve state-of-the-art results on several tasks while increasing in complexity. It has been shown that neural networks can be pruned during training by imposing sparsity inducing regularizers. In this paper, we investigate two techniques for group-wise pruning during training in order to improve network efficiency. We propose a gating factor after every convolutional layer to induce channel level sparsity, encouraging insignificant channels to become exactly zero. Further, we introduce and analyse a bounded variant of the \(\ell _1\) regularizer, which interpolates between \(\ell _1\) and \(\ell _0\)-norms to retain performance of the network at higher pruning rates. To underline effectiveness of the proposed methods, we show that the number of parameters of ResNet-164, DenseNet-40 and MobileNetV2 can be reduced down by \(30\%\), \(69\%\), and \(75\%\) on CIFAR100 respectively without a significant drop in accuracy. We achieve state-of-the-art pruning results for ResNet-50 with higher accuracy on ImageNet. Furthermore, we show that the light weight MobileNetV2 can further be compressed on ImageNet without a significant drop in performance .
T. Genewein—Currently at DeepMind.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We changed the average pooling kernel size from \(7\times 7\) to \(4\times 4\) and the stride from 2 to 1 in the first convolutional layer and also in the second block of bottleneck structure of the network.
References
Achterhold, J., Koehler, J.M., Schmeink, A., Genewein, T.: Variational network quantization. In: ICLR2018 (2018)
Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2270–2278 (2016)
Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning (ICML), pp. 2285–2294 (2015)
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282 (2017)
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to + 1 or \(-\)1. arXiv:1602.02830 (2016)
Federici, M., Ullrich, K., Welling, M.: Improved Bayesian compression. arXiv:1711.06494 (2017)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding small, trainable neural networks. arXiv:1803.03635 (2018)
Ghosh, S., Yao, J., Doshi-Velez, F.: Structured variational learning of Bayesian neural networks with horseshoe priors. arXiv:1806.05975 (2018)
Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv:1412.6115 (2014)
Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Advances In Neural Information Processing Systems (NIPS), pp. 1379–1387 (2016)
Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S.: Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5784–5789 (2018)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: International Conference on Learning Representations (ICLR) (2016)
Han, S., et al.: DSD: regularizing deep neural networks with dense-sparse-dense training flow. In: International Conference on Learning Representations (ICLR) (2017)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143 (2015)
Hanson, S.J., Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation. In: Advances in Neural Information Processing Systems (NIPS), pp. 177–185 (1989)
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision (ICCV), vol. 2 (2017)
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250 (2016)
Huang, Z., Wang, N.: Data-driven sparse structure selection for deep neural networks. arXiv:1707.01213 (2017)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. (JMLR) 18(1), 6869–6898 (2017)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and \(<\)0.5 mb model size. arXiv:1602.07360 (2016)
Karaletsos, T., Rätsch, G.: Automatic relevance determination for deep generative models. arXiv:1505.07765 (2015)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (ICLR) (2017)
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. IEEE (2017)
Louizos, C., Ullrich, K., Welling, M.: Bayesian compression for deep learning. In: Advances in Neural Information Processing Systems (2017)
Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through \(L_0\) regularization. In: ICLR 2018 (2018)
Luo, J.H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neural network compression. In: ICCV 2017 (2017)
MacKay, D.J.: Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks. Netw. Comput. Neural Syst. 6(3), 469–505 (1995)
Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deepneural networks. In: ICML 2017 (2017)
Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. In: ICLR2017 (2017)
Neal, R.M.: Bayesian Learning for Neural Networks. Ph.D. thesis, University of Toronto (1995)
Neklyudov, K., Molchanov, D., Ashukha, A., Vetrov, D.: Structured Bayesian pruning via log-normal multiplicative noise. arXiv:1705.07283 (2017)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: imagenet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Sze, V., Chen, Y.H., Yang, T.J., Emer, J.: Efficient processing of deep neural networks: A tutorial and survey. arXiv:1703.09039 (2017)
Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network compression. In: ICLR 2017 (2017)
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. (JMLR) 3, 1439–1461 (2003)
Wu, S., Li, G., Chen, F., Shi, L.: Training and inference with integers in deep neural networks. arXiv:1802.04680 (2018)
Ye, J., Lu, X., Lin, Z., Wang, J.Z.: Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv:1802.00124 (2018)
Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless CNNs with low-precision weights. arXiv:1702.03044 (2017)
Zhou, H., Alvarez, J.M., Porikli, F.: Less is more: towards compact CNNs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 662–677. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_40
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Supplementary material
A1 Proof of Lemma 1 (Lemma 1):
To improve readability, we will restate Lemma 1 from the main text:
The mapping \(\Vert .\Vert _{\text {bound-}p,\sigma }\) has the following properties:
-
For \(\sigma \rightarrow 0^{+}\) the bounded-norm converges towards the \(0\)-norm:
$$\begin{aligned} \lim \limits _{\sigma \rightarrow 0^{+}} \Vert x\Vert _{\text {bound-}p,\sigma } = \Vert x\Vert _{0}. \end{aligned}$$(1) -
In case \(|x_i| \approx 0\) for all coefficients of \(x\), the bounded-norm of \(x\) is approximately equal to the \(p\)-norm of \(x\) weighted by \(1/\sigma \):
$$\begin{aligned} \Vert x\Vert _{\text {bound-}p,\sigma } \approx \left||\frac{x}{\sigma }\right||_{p}^{p} \end{aligned}$$(2)
Proof
The first statement Eq. (1) can easily be seen using:
For the second statement Eq. (2) we use the taylor expansion of \(\text {exp}\) around zero to get:
For \(|x_i| \approx 0\) we keep only the leading coefficient \(j = 1\) yielding:
A2 Experiment Details
Both CIFAR100 and ImageNet datasets are augmented with standard techniques like random horizontal flip and random crop of the zero-padded input image and further processed with mean-std normalization. The architecture MobileNetV2 is originally designed for the task of classification on ImageNet dataset. We adapt the networkFootnote 1 to fit the input resolution \(32\times 32\) of CIFAR100. ResNet-164 is a pre-activation ResNet architecture containing 164 layers with bottleneck structure while DenseNet with 40 layer network and growth rate 12 has been used. All the networks are trained from scratch (weights with random initialization and bias is disabled for all the convolutional and fully connected layers) with a hypeparameter search on regularization strengths \(\lambda _1\) for \(\ell _1\) or bounded-\(\ell _1\) regularizers and weight decay \(\lambda _2\) on each dataset. The scaling factor \(\gamma \) of BN is initialized with 1.0 in case of exponential gate while it is initialized with 0.5 for linear gate as described in [24] and bias \(\beta \) to be zero. The hyperparameter \(\sigma \) in bounded-\(\ell _1\) regularizer is set to be 1.0 when the scheduling of this parameter is disabled. All the gating parameters g are initialized with 1.0.
We use the standard categorical cross-entropy loss and an additional penalty is added to the loss objective in the form of weight decay and sparsity induced \(\ell _1\) or bounded-\(\ell _1\) regularizers. Note that \(\ell _1\) and bounded-\(\ell _1\) regularization acts only on the gating parameters g whereas weight decay regularizes all the network parameters including the gating parameters g. We reimplemented the technique proposed in  [24] which impose \(\ell _1\) regularization on scaling factor \(\gamma \) of Batch Normalization layers to induce channel level sparsity. We refer this method as \(\ell _1\) on linear gate and compare it against our methods bounded-\(\ell _1\) on linear gate, \(\ell _1\) on exponential gate and bounded-\(\ell _1\) on exponential gate. We train ResNet-164, DenseNet-40 and ResNet-50 for \(240\), \(130\) and \(100\) epochs respectively. Furthermore, learning rate of ResNet-164, DenseNet-40 and ResNet-50 is dropped by a factor of \(10\) after \((30, 60, 90)\), \((120, 200, 220)\), \((100, 110, 120)\) epochs. The networks are trained with batch size 128 using the SGD optimizer with initial learning rate 0.1 and momentum 0.9 unless specified. Below, we present the training details of each architecture individually.
LeNet5-Caffe: Since this architecture does not contain Batch Normalization layers, we do not compare our results with the method \(\ell _1\) on linear gate. We train the network with exponential gating layers that are added after every convolution/fully connected layer except the output layer and apply different regularizers like \(\ell _1\), bounded-\(\ell _1\) and weight decay separately to evaluate their pruning results. We set the weight decay to zero when training with \(\ell _1\) or bounded-\(\ell _1\) regularizers. The network is trained for 200 epochs with the weight decay and 60 epochs in case of other regularizers.
ResNet-50: We train the network with exponential gating layers that are added after every convolutional layer on ImageNet dataset. We evaluate performance of the network on different values of regularization strength \(\lambda _1\) like \(10^{-5}\), \(5\times 10^{-5}\) and \(10^{-4}\). The weight decay \(\lambda _2\) is enabled for all the settings of \(\lambda _1\) and set to be \(10^{-4}\). We analyzed the influence of exponential gate and compared against the existing methods.
ResNet-164: We use a dropout rate of \(0.1\) after the first Batch Normalization layer in every Bottleneck structure. Here, every convolutional layer in the network is followed by an exponential gating layer.
DenseNet-40: We use a dropout of 0.05 after every convolutional layer in the Dense block. Here, the exponential gating layer is added after every convolutional layer in the network except the first convolutional layer.
MobileNetV2: On CIFAR100, we train the network for 240 epochs where learning rate drops by 0.1 at 200 and 220 epochs. A dropout of 0.3 is applied after the global average pooling layer. On ImageNet, we train this network for 100 epochs which is in contrast to the standard training of 400 epochs. We start with learning rate 0.045 and reduced it by 0.1 at 30, 60 and 90 epochs. We evaluate performance of the network on exponential gate over the linear gate with \(\ell _1\) regularizer and also tested the significance of bounded-\(\ell _1\) on linear gate. Exponential gating layer is added after every standard convolutional/depthwise separable convolutional layer in the network.
On CIFAR100, we investigate the influence of weight decay, \(\ell _1\) and bounded-\(\ell _1\) regularizers, the role of linear and exponential gates on every architecture. We also study the influence of scheduling \(\sigma \) in both \(\ell _1\) and bounded-\(\ell _1\) regularizers on this dataset. For MobileNetV2, we initialize \(\sigma \) with 2.0 and decay it at a rate of 0.99 after every epoch. In case of ResNet-164 and DenseNet-40, we initialize the hyperparameters \(\lambda _1\) and \(\lambda _2\) with \(10^{-4}\) and \(5\times 10^{-4}\) respectively and \(\sigma \) with 2.0. We increase the \(\lambda _1\) to \(5\times 10^{-4}\) after 120 epochs and \(\sigma \) drops by 0.02 after every epoch until the value of \(\sigma \) reaches to 0.2 and later decays at a rate of 0.99.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mummadi, C.K., Genewein, T., Zhang, D., Brox, T., Fischer, V. (2019). Group Pruning Using a Bounded-\(\ell _p\) Norm for Group Gating and Regularization. In: Fink, G., Frintrop, S., Jiang, X. (eds) Pattern Recognition. DAGM GCPR 2019. Lecture Notes in Computer Science(), vol 11824. Springer, Cham. https://doi.org/10.1007/978-3-030-33676-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-33676-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33675-2
Online ISBN: 978-3-030-33676-9
eBook Packages: Computer ScienceComputer Science (R0)