Group Pruning Using a Bounded- $$\ell _p$$ Norm for Group Gating and Regularization

Mummadi, Chaithanya Kumar; Genewein, Tim; Zhang, Dan; Brox, Thomas; Fischer, Volker

doi:10.1007/978-3-030-33676-9_10

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11824))

Included in the following conference series:

German Conference on Pattern Recognition

1830 Accesses

Abstract

Deep neural networks achieve state-of-the-art results on several tasks while increasing in complexity. It has been shown that neural networks can be pruned during training by imposing sparsity inducing regularizers. In this paper, we investigate two techniques for group-wise pruning during training in order to improve network efficiency. We propose a gating factor after every convolutional layer to induce channel level sparsity, encouraging insignificant channels to become exactly zero. Further, we introduce and analyse a bounded variant of the $\ell _1$ regularizer, which interpolates between $\ell _1$ and $\ell _0$-norms to retain performance of the network at higher pruning rates. To underline effectiveness of the proposed methods, we show that the number of parameters of ResNet-164, DenseNet-40 and MobileNetV2 can be reduced down by $30\%$, $69\%$, and $75\%$ on CIFAR100 respectively without a significant drop in accuracy. We achieve state-of-the-art pruning results for ResNet-50 with higher accuracy on ImageNet. Furthermore, we show that the light weight MobileNetV2 can further be compressed on ImageNet without a significant drop in performance .

T. Genewein—Currently at DeepMind.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We changed the average pooling kernel size from $7\times 7$ to $4\times 4$ and the stride from 2 to 1 in the first convolutional layer and also in the second block of bottleneck structure of the network.

References

Achterhold, J., Koehler, J.M., Schmeink, A., Genewein, T.: Variational network quantization. In: ICLR2018 (2018)
Google Scholar
Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2270–2278 (2016)
Google Scholar
Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning (ICML), pp. 2285–2294 (2015)
Google Scholar
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282 (2017)
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to + 1 or $-$1. arXiv:1602.02830 (2016)
Federici, M., Ullrich, K., Welling, M.: Improved Bayesian compression. arXiv:1711.06494 (2017)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding small, trainable neural networks. arXiv:1803.03635 (2018)
Ghosh, S., Yao, J., Doshi-Velez, F.: Structured variational learning of Bayesian neural networks with horseshoe priors. arXiv:1806.05975 (2018)
Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv:1412.6115 (2014)
Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Advances In Neural Information Processing Systems (NIPS), pp. 1379–1387 (2016)
Google Scholar
Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S.: Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5784–5789 (2018)
Article Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Han, S., et al.: DSD: regularizing deep neural networks with dense-sparse-dense training flow. In: International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143 (2015)
Google Scholar
Hanson, S.J., Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation. In: Advances in Neural Information Processing Systems (NIPS), pp. 177–185 (1989)
Google Scholar
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision (ICCV), vol. 2 (2017)
Google Scholar
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250 (2016)
Huang, Z., Wang, N.: Data-driven sparse structure selection for deep neural networks. arXiv:1707.01213 (2017)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. (JMLR) 18(1), 6869–6898 (2017)
MathSciNet MATH Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and $<$0.5 mb model size. arXiv:1602.07360 (2016)
Karaletsos, T., Rätsch, G.: Automatic relevance determination for deep generative models. arXiv:1505.07765 (2015)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. IEEE (2017)
Google Scholar
Louizos, C., Ullrich, K., Welling, M.: Bayesian compression for deep learning. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through $L_0$ regularization. In: ICLR 2018 (2018)
Google Scholar
Luo, J.H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neural network compression. In: ICCV 2017 (2017)
Google Scholar
MacKay, D.J.: Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks. Netw. Comput. Neural Syst. 6(3), 469–505 (1995)
Article Google Scholar
Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deepneural networks. In: ICML 2017 (2017)
Google Scholar
Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. In: ICLR2017 (2017)
Google Scholar
Neal, R.M.: Bayesian Learning for Neural Networks. Ph.D. thesis, University of Toronto (1995)
Google Scholar
Neklyudov, K., Molchanov, D., Ashukha, A., Vetrov, D.: Structured Bayesian pruning via log-normal multiplicative noise. arXiv:1705.07283 (2017)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: imagenet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Chapter Google Scholar
Sze, V., Chen, Y.H., Yang, T.J., Emer, J.: Efficient processing of deep neural networks: A tutorial and survey. arXiv:1703.09039 (2017)
Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network compression. In: ICLR 2017 (2017)
Google Scholar
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)
Google Scholar
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. (JMLR) 3, 1439–1461 (2003)
MathSciNet MATH Google Scholar
Wu, S., Li, G., Chen, F., Shi, L.: Training and inference with integers in deep neural networks. arXiv:1802.04680 (2018)
Ye, J., Lu, X., Lin, Z., Wang, J.Z.: Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv:1802.00124 (2018)
Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless CNNs with low-precision weights. arXiv:1702.03044 (2017)
Zhou, H., Alvarez, J.M., Porikli, F.: Less is more: towards compact CNNs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 662–677. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_40
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, Germany
Chaithanya Kumar Mummadi, Tim Genewein, Dan Zhang & Volker Fischer
University of Freiburg, Freiburg im Breisgau, Germany
Chaithanya Kumar Mummadi & Thomas Brox

Authors

Chaithanya Kumar Mummadi
View author publications
You can also search for this author in PubMed Google Scholar
Tim Genewein
View author publications
You can also search for this author in PubMed Google Scholar
Dan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Brox
View author publications
You can also search for this author in PubMed Google Scholar
Volker Fischer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaithanya Kumar Mummadi .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
University of Hamburg, Hamburg, Germany
Simone Frintrop
University of Münster, Münster, Germany
Xiaoyi Jiang

Appendices

Supplementary material

A1 Proof of Lemma 1 (Lemma 1):

To improve readability, we will restate Lemma 1 from the main text:

The mapping $\Vert .\Vert _{\text {bound-}p,\sigma }$ has the following properties:

For $\sigma \rightarrow 0^{+}$ the bounded-norm converges towards the $0$-norm:
$$\begin{aligned} \lim \limits _{\sigma \rightarrow 0^{+}} \Vert x\Vert _{\text {bound-}p,\sigma } = \Vert x\Vert _{0}. \end{aligned}$$
(1)
In case $|x_i| \approx 0$ for all coefficients of $x$, the bounded-norm of $x$ is approximately equal to the $p$-norm of $x$ weighted by $1/\sigma $:
$$\begin{aligned} \Vert x\Vert _{\text {bound-}p,\sigma } \approx \left||\frac{x}{\sigma }\right||_{p}^{p} \end{aligned}$$
(2)

Proof

The first statement Eq. (1) can easily be seen using:

$$ \lim \limits _{\sigma \rightarrow 0} \text {exp}(-\frac{|x_i|^p}{\sigma ^p}) = \mathbf 1 _0(x_i) $$

For the second statement Eq. (2) we use the taylor expansion of $\text {exp}$ around zero to get:

$$\begin{aligned} \begin{aligned} \Vert x\Vert _{\text {bound-}p,\sigma }&= \sum \limits _{i=1}^{n} 1 - \text {exp}\left( -\frac{|x_i|^p}{\sigma ^p}\right) \\&= \sum \limits _{i=1}^{n} 1 - \sum \limits _{j=0}^{\infty }\left( -\frac{|x_i|^p}{\sigma ^p}\right) ^j \frac{1}{j!} \end{aligned} \end{aligned}$$

(3)

For $|x_i| \approx 0$ we keep only the leading coefficient $j = 1$ yielding:

$$ \Vert x\Vert _{\text {bound-}p,\sigma } \approx \sum \limits _{i=1}^{n} \frac{|x_i|^p}{\sigma ^p} = \left||\frac{x}{\sigma }\right||_{p}^{p}. $$

A2 Experiment Details

Both CIFAR100 and ImageNet datasets are augmented with standard techniques like random horizontal flip and random crop of the zero-padded input image and further processed with mean-std normalization. The architecture MobileNetV2 is originally designed for the task of classification on ImageNet dataset. We adapt the network^{Footnote 1} to fit the input resolution $32\times 32$ of CIFAR100. ResNet-164 is a pre-activation ResNet architecture containing 164 layers with bottleneck structure while DenseNet with 40 layer network and growth rate 12 has been used. All the networks are trained from scratch (weights with random initialization and bias is disabled for all the convolutional and fully connected layers) with a hypeparameter search on regularization strengths $\lambda _1$ for $\ell _1$ or bounded-$\ell _1$ regularizers and weight decay $\lambda _2$ on each dataset. The scaling factor $\gamma $ of BN is initialized with 1.0 in case of exponential gate while it is initialized with 0.5 for linear gate as described in [24] and bias $\beta $ to be zero. The hyperparameter $\sigma $ in bounded-$\ell _1$ regularizer is set to be 1.0 when the scheduling of this parameter is disabled. All the gating parameters g are initialized with 1.0.

We use the standard categorical cross-entropy loss and an additional penalty is added to the loss objective in the form of weight decay and sparsity induced $\ell _1$ or bounded-$\ell _1$ regularizers. Note that $\ell _1$ and bounded-$\ell _1$ regularization acts only on the gating parameters g whereas weight decay regularizes all the network parameters including the gating parameters g. We reimplemented the technique proposed in [24] which impose $\ell _1$ regularization on scaling factor $\gamma $ of Batch Normalization layers to induce channel level sparsity. We refer this method as $\ell _1$ on linear gate and compare it against our methods bounded-$\ell _1$ on linear gate, $\ell _1$ on exponential gate and bounded-$\ell _1$ on exponential gate. We train ResNet-164, DenseNet-40 and ResNet-50 for $240$, $130$ and $100$ epochs respectively. Furthermore, learning rate of ResNet-164, DenseNet-40 and ResNet-50 is dropped by a factor of $10$ after $(30, 60, 90)$, $(120, 200, 220)$, $(100, 110, 120)$ epochs. The networks are trained with batch size 128 using the SGD optimizer with initial learning rate 0.1 and momentum 0.9 unless specified. Below, we present the training details of each architecture individually.

LeNet5-Caffe: Since this architecture does not contain Batch Normalization layers, we do not compare our results with the method $\ell _1$ on linear gate. We train the network with exponential gating layers that are added after every convolution/fully connected layer except the output layer and apply different regularizers like $\ell _1$, bounded-$\ell _1$ and weight decay separately to evaluate their pruning results. We set the weight decay to zero when training with $\ell _1$ or bounded-$\ell _1$ regularizers. The network is trained for 200 epochs with the weight decay and 60 epochs in case of other regularizers.

ResNet-50: We train the network with exponential gating layers that are added after every convolutional layer on ImageNet dataset. We evaluate performance of the network on different values of regularization strength $\lambda _1$ like $10^{-5}$, $5\times 10^{-5}$ and $10^{-4}$. The weight decay $\lambda _2$ is enabled for all the settings of $\lambda _1$ and set to be $10^{-4}$. We analyzed the influence of exponential gate and compared against the existing methods.

ResNet-164: We use a dropout rate of $0.1$ after the first Batch Normalization layer in every Bottleneck structure. Here, every convolutional layer in the network is followed by an exponential gating layer.

DenseNet-40: We use a dropout of 0.05 after every convolutional layer in the Dense block. Here, the exponential gating layer is added after every convolutional layer in the network except the first convolutional layer.

MobileNetV2: On CIFAR100, we train the network for 240 epochs where learning rate drops by 0.1 at 200 and 220 epochs. A dropout of 0.3 is applied after the global average pooling layer. On ImageNet, we train this network for 100 epochs which is in contrast to the standard training of 400 epochs. We start with learning rate 0.045 and reduced it by 0.1 at 30, 60 and 90 epochs. We evaluate performance of the network on exponential gate over the linear gate with $\ell _1$ regularizer and also tested the significance of bounded-$\ell _1$ on linear gate. Exponential gating layer is added after every standard convolutional/depthwise separable convolutional layer in the network.

On CIFAR100, we investigate the influence of weight decay, $\ell _1$ and bounded-$\ell _1$ regularizers, the role of linear and exponential gates on every architecture. We also study the influence of scheduling $\sigma $ in both $\ell _1$ and bounded-$\ell _1$ regularizers on this dataset. For MobileNetV2, we initialize $\sigma $ with 2.0 and decay it at a rate of 0.99 after every epoch. In case of ResNet-164 and DenseNet-40, we initialize the hyperparameters $\lambda _1$ and $\lambda _2$ with $10^{-4}$ and $5\times 10^{-4}$ respectively and $\sigma $ with 2.0. We increase the $\lambda _1$ to $5\times 10^{-4}$ after 120 epochs and $\sigma $ drops by 0.02 after every epoch until the value of $\sigma $ reaches to 0.2 and later decays at a rate of 0.99.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mummadi, C.K., Genewein, T., Zhang, D., Brox, T., Fischer, V. (2019). Group Pruning Using a Bounded-$\ell _p$ Norm for Group Gating and Regularization. In: Fink, G., Frintrop, S., Jiang, X. (eds) Pattern Recognition. DAGM GCPR 2019. Lecture Notes in Computer Science(), vol 11824. Springer, Cham. https://doi.org/10.1007/978-3-030-33676-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-33676-9_10
Published: 25 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33675-2
Online ISBN: 978-3-030-33676-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Group Pruning Using a Bounded-\(\ell _p\) Norm for Group Gating and Regularization

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Supplementary material

A1 Proof of Lemma 1 (Lemma 1):

Proof

A2 Experiment Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Group Pruning Using a Bounded-\(\ell _p\) Norm for Group Gating and Regularization

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Supplementary material

A1 Proof of Lemma 1 (Lemma 1):

Proof

A2 Experiment Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation