Skip to main content

Group Pruning Using a Bounded-\(\ell _p\) Norm for Group Gating and Regularization

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2019)

Abstract

Deep neural networks achieve state-of-the-art results on several tasks while increasing in complexity. It has been shown that neural networks can be pruned during training by imposing sparsity inducing regularizers. In this paper, we investigate two techniques for group-wise pruning during training in order to improve network efficiency. We propose a gating factor after every convolutional layer to induce channel level sparsity, encouraging insignificant channels to become exactly zero. Further, we introduce and analyse a bounded variant of the \(\ell _1\) regularizer, which interpolates between \(\ell _1\) and \(\ell _0\)-norms to retain performance of the network at higher pruning rates. To underline effectiveness of the proposed methods, we show that the number of parameters of ResNet-164, DenseNet-40 and MobileNetV2 can be reduced down by \(30\%\), \(69\%\), and \(75\%\) on CIFAR100 respectively without a significant drop in accuracy. We achieve state-of-the-art pruning results for ResNet-50 with higher accuracy on ImageNet. Furthermore, we show that the light weight MobileNetV2 can further be compressed on ImageNet without a significant drop in performance .

T. Genewein—Currently at DeepMind.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We changed the average pooling kernel size from \(7\times 7\) to \(4\times 4\) and the stride from 2 to 1 in the first convolutional layer and also in the second block of bottleneck structure of the network.

References

  1. Achterhold, J., Koehler, J.M., Schmeink, A., Genewein, T.: Variational network quantization. In: ICLR2018 (2018)

    Google Scholar 

  2. Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2270–2278 (2016)

    Google Scholar 

  3. Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning (ICML), pp. 2285–2294 (2015)

    Google Scholar 

  4. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282 (2017)

  5. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to + 1 or \(-\)1. arXiv:1602.02830 (2016)

  6. Federici, M., Ullrich, K., Welling, M.: Improved Bayesian compression. arXiv:1711.06494 (2017)

  7. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding small, trainable neural networks. arXiv:1803.03635 (2018)

  8. Ghosh, S., Yao, J., Doshi-Velez, F.: Structured variational learning of Bayesian neural networks with horseshoe priors. arXiv:1806.05975 (2018)

  9. Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv:1412.6115 (2014)

  10. Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Advances In Neural Information Processing Systems (NIPS), pp. 1379–1387 (2016)

    Google Scholar 

  11. Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S.: Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5784–5789 (2018)

    Article  Google Scholar 

  12. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  13. Han, S., et al.: DSD: regularizing deep neural networks with dense-sparse-dense training flow. In: International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  14. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143 (2015)

    Google Scholar 

  15. Hanson, S.J., Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation. In: Advances in Neural Information Processing Systems (NIPS), pp. 177–185 (1989)

    Google Scholar 

  16. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision (ICCV), vol. 2 (2017)

    Google Scholar 

  17. Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)

  18. Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250 (2016)

  19. Huang, Z., Wang, N.: Data-driven sparse structure selection for deep neural networks. arXiv:1707.01213 (2017)

  20. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. (JMLR) 18(1), 6869–6898 (2017)

    MathSciNet  MATH  Google Scholar 

  21. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and \(<\)0.5 mb model size. arXiv:1602.07360 (2016)

  22. Karaletsos, T., Rätsch, G.: Automatic relevance determination for deep generative models. arXiv:1505.07765 (2015)

  23. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  24. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. IEEE (2017)

    Google Scholar 

  25. Louizos, C., Ullrich, K., Welling, M.: Bayesian compression for deep learning. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  26. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through \(L_0\) regularization. In: ICLR 2018 (2018)

    Google Scholar 

  27. Luo, J.H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neural network compression. In: ICCV 2017 (2017)

    Google Scholar 

  28. MacKay, D.J.: Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks. Netw. Comput. Neural Syst. 6(3), 469–505 (1995)

    Article  Google Scholar 

  29. Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deepneural networks. In: ICML 2017 (2017)

    Google Scholar 

  30. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. In: ICLR2017 (2017)

    Google Scholar 

  31. Neal, R.M.: Bayesian Learning for Neural Networks. Ph.D. thesis, University of Toronto (1995)

    Google Scholar 

  32. Neklyudov, K., Molchanov, D., Ashukha, A., Vetrov, D.: Structured Bayesian pruning via log-normal multiplicative noise. arXiv:1705.07283 (2017)

  33. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: imagenet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32

    Chapter  Google Scholar 

  34. Sze, V., Chen, Y.H., Yang, T.J., Emer, J.: Efficient processing of deep neural networks: A tutorial and survey. arXiv:1703.09039 (2017)

  35. Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network compression. In: ICLR 2017 (2017)

    Google Scholar 

  36. Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)

    Google Scholar 

  37. Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. (JMLR) 3, 1439–1461 (2003)

    MathSciNet  MATH  Google Scholar 

  38. Wu, S., Li, G., Chen, F., Shi, L.: Training and inference with integers in deep neural networks. arXiv:1802.04680 (2018)

  39. Ye, J., Lu, X., Lin, Z., Wang, J.Z.: Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv:1802.00124 (2018)

  40. Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless CNNs with low-precision weights. arXiv:1702.03044 (2017)

  41. Zhou, H., Alvarez, J.M., Porikli, F.: Less is more: towards compact CNNs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 662–677. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_40

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaithanya Kumar Mummadi .

Editor information

Editors and Affiliations

Appendices

Supplementary material

A1 Proof of Lemma 1 (Lemma 1):

To improve readability, we will restate Lemma 1 from the main text:

The mapping \(\Vert .\Vert _{\text {bound-}p,\sigma }\) has the following properties:

  • For \(\sigma \rightarrow 0^{+}\) the bounded-norm converges towards the \(0\)-norm:

    $$\begin{aligned} \lim \limits _{\sigma \rightarrow 0^{+}} \Vert x\Vert _{\text {bound-}p,\sigma } = \Vert x\Vert _{0}. \end{aligned}$$
    (1)
  • In case \(|x_i| \approx 0\) for all coefficients of \(x\), the bounded-norm of \(x\) is approximately equal to the \(p\)-norm of \(x\) weighted by \(1/\sigma \):

    $$\begin{aligned} \Vert x\Vert _{\text {bound-}p,\sigma } \approx \left||\frac{x}{\sigma }\right||_{p}^{p} \end{aligned}$$
    (2)

Proof

The first statement Eq. (1) can easily be seen using:

$$ \lim \limits _{\sigma \rightarrow 0} \text {exp}(-\frac{|x_i|^p}{\sigma ^p}) = \mathbf 1 _0(x_i) $$

For the second statement Eq. (2) we use the taylor expansion of \(\text {exp}\) around zero to get:

$$\begin{aligned} \begin{aligned} \Vert x\Vert _{\text {bound-}p,\sigma }&= \sum \limits _{i=1}^{n} 1 - \text {exp}\left( -\frac{|x_i|^p}{\sigma ^p}\right) \\&= \sum \limits _{i=1}^{n} 1 - \sum \limits _{j=0}^{\infty }\left( -\frac{|x_i|^p}{\sigma ^p}\right) ^j \frac{1}{j!} \end{aligned} \end{aligned}$$
(3)

For \(|x_i| \approx 0\) we keep only the leading coefficient \(j = 1\) yielding:

$$ \Vert x\Vert _{\text {bound-}p,\sigma } \approx \sum \limits _{i=1}^{n} \frac{|x_i|^p}{\sigma ^p} = \left||\frac{x}{\sigma }\right||_{p}^{p}. $$

A2 Experiment Details

Both CIFAR100 and ImageNet datasets are augmented with standard techniques like random horizontal flip and random crop of the zero-padded input image and further processed with mean-std normalization. The architecture MobileNetV2 is originally designed for the task of classification on ImageNet dataset. We adapt the networkFootnote 1 to fit the input resolution \(32\times 32\) of CIFAR100. ResNet-164 is a pre-activation ResNet architecture containing 164 layers with bottleneck structure while DenseNet with 40 layer network and growth rate 12 has been used. All the networks are trained from scratch (weights with random initialization and bias is disabled for all the convolutional and fully connected layers) with a hypeparameter search on regularization strengths \(\lambda _1\) for \(\ell _1\) or bounded-\(\ell _1\) regularizers and weight decay \(\lambda _2\) on each dataset. The scaling factor \(\gamma \) of BN is initialized with 1.0 in case of exponential gate while it is initialized with 0.5 for linear gate as described in [24] and bias \(\beta \) to be zero. The hyperparameter \(\sigma \) in bounded-\(\ell _1\) regularizer is set to be 1.0 when the scheduling of this parameter is disabled. All the gating parameters g are initialized with 1.0.

We use the standard categorical cross-entropy loss and an additional penalty is added to the loss objective in the form of weight decay and sparsity induced \(\ell _1\) or bounded-\(\ell _1\) regularizers. Note that \(\ell _1\) and bounded-\(\ell _1\) regularization acts only on the gating parameters g whereas weight decay regularizes all the network parameters including the gating parameters g. We reimplemented the technique proposed in  [24] which impose \(\ell _1\) regularization on scaling factor \(\gamma \) of Batch Normalization layers to induce channel level sparsity. We refer this method as \(\ell _1\) on linear gate and compare it against our methods bounded-\(\ell _1\) on linear gate, \(\ell _1\) on exponential gate and bounded-\(\ell _1\) on exponential gate. We train ResNet-164, DenseNet-40 and ResNet-50 for \(240\), \(130\) and \(100\) epochs respectively. Furthermore, learning rate of ResNet-164, DenseNet-40 and ResNet-50 is dropped by a factor of \(10\) after \((30, 60, 90)\), \((120, 200, 220)\), \((100, 110, 120)\) epochs. The networks are trained with batch size 128 using the SGD optimizer with initial learning rate 0.1 and momentum 0.9 unless specified. Below, we present the training details of each architecture individually.

LeNet5-Caffe: Since this architecture does not contain Batch Normalization layers, we do not compare our results with the method \(\ell _1\) on linear gate. We train the network with exponential gating layers that are added after every convolution/fully connected layer except the output layer and apply different regularizers like \(\ell _1\), bounded-\(\ell _1\) and weight decay separately to evaluate their pruning results. We set the weight decay to zero when training with \(\ell _1\) or bounded-\(\ell _1\) regularizers. The network is trained for 200 epochs with the weight decay and 60 epochs in case of other regularizers.

ResNet-50: We train the network with exponential gating layers that are added after every convolutional layer on ImageNet dataset. We evaluate performance of the network on different values of regularization strength \(\lambda _1\) like \(10^{-5}\), \(5\times 10^{-5}\) and \(10^{-4}\). The weight decay \(\lambda _2\) is enabled for all the settings of \(\lambda _1\) and set to be \(10^{-4}\). We analyzed the influence of exponential gate and compared against the existing methods.

ResNet-164: We use a dropout rate of \(0.1\) after the first Batch Normalization layer in every Bottleneck structure. Here, every convolutional layer in the network is followed by an exponential gating layer.

DenseNet-40: We use a dropout of 0.05 after every convolutional layer in the Dense block. Here, the exponential gating layer is added after every convolutional layer in the network except the first convolutional layer.

MobileNetV2: On CIFAR100, we train the network for 240 epochs where learning rate drops by 0.1 at 200 and 220 epochs. A dropout of 0.3 is applied after the global average pooling layer. On ImageNet, we train this network for 100 epochs which is in contrast to the standard training of 400 epochs. We start with learning rate 0.045 and reduced it by 0.1 at 30, 60 and 90 epochs. We evaluate performance of the network on exponential gate over the linear gate with \(\ell _1\) regularizer and also tested the significance of bounded-\(\ell _1\) on linear gate. Exponential gating layer is added after every standard convolutional/depthwise separable convolutional layer in the network.

On CIFAR100, we investigate the influence of weight decay, \(\ell _1\) and bounded-\(\ell _1\) regularizers, the role of linear and exponential gates on every architecture. We also study the influence of scheduling \(\sigma \) in both \(\ell _1\) and bounded-\(\ell _1\) regularizers on this dataset. For MobileNetV2, we initialize \(\sigma \) with 2.0 and decay it at a rate of 0.99 after every epoch. In case of ResNet-164 and DenseNet-40, we initialize the hyperparameters \(\lambda _1\) and \(\lambda _2\) with \(10^{-4}\) and \(5\times 10^{-4}\) respectively and \(\sigma \) with 2.0. We increase the \(\lambda _1\) to \(5\times 10^{-4}\) after 120 epochs and \(\sigma \) drops by 0.02 after every epoch until the value of \(\sigma \) reaches to 0.2 and later decays at a rate of 0.99.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mummadi, C.K., Genewein, T., Zhang, D., Brox, T., Fischer, V. (2019). Group Pruning Using a Bounded-\(\ell _p\) Norm for Group Gating and Regularization. In: Fink, G., Frintrop, S., Jiang, X. (eds) Pattern Recognition. DAGM GCPR 2019. Lecture Notes in Computer Science(), vol 11824. Springer, Cham. https://doi.org/10.1007/978-3-030-33676-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33676-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33675-2

  • Online ISBN: 978-3-030-33676-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics