Elsevier

Neural Networks

Volume 119, November 2019, Pages 286-298
Neural Networks

Transformed 1 regularization for learning sparse deep neural networks

https://doi.org/10.1016/j.neunet.2019.08.015Get rights and content

Abstract

Deep Neural Networks (DNNs) have achieved extraordinary success in numerous areas. However, DNNs often carry a large number of weight parameters, leading to the challenge of heavy memory and computation costs. Overfitting is another challenge for DNNs when the training data are insufficient. These challenges severely hinder the application of DNNs in resource-constrained platforms. In fact, many network weights are redundant and can be removed from the network without much loss of performance. In this paper, we introduce a new non-convex integrated transformed 1 regularizer to promote sparsity for DNNs, which removes redundant connections and unnecessary neurons simultaneously. Specifically, we apply the transformed 1 regularizer to the matrix space of network weights and utilize it to remove redundant connections. Besides, group sparsity is integrated to remove unnecessary neurons. An efficient stochastic proximal gradient algorithm is presented to solve the new model. To the best of our knowledge, this is the first work to develop a non-convex regularizer in sparse optimization based method to simultaneously promote connection-level and neuron-level sparsity for DNNs. Experiments on public datasets demonstrate the effectiveness of the proposed method.

Introduction

Recently, Deep Neural Networks (DNNs) have achieved remarkable success in many fields (Deng et al., 2014, Goodfellow et al., 2016, Lecun et al., 2015, Schmidhuber, 2015, Yu et al., 0000). One of the key factors of the success is its expressive power, which heavily relies on the large number of parameters (Alvarez and Salzmann, 2016, Yoon and Hwang, 2017, Zhou et al., 2016). For example, VGG (Simonyan & Zisserman, 0000), a convolutional neural network which won the ImageNet Large Scale Visual Recognition Challenge 2014, consists of 15M neurons and up to 144M parameters. A large number of parameters put heavy burden on both memory and computation power, which make DNNs costly for training and inapplicable to resource limited platforms (Alvarez & Salzmann, 2016). Moreover, models with massive parameters are more easily overfitting when the training data are insufficient (Scardapane et al., 2017, Yoon and Hwang, 2017). These challenges seriously hinder the application of DNNs (Alvarez & Salzmann, 2016). However, DNNs are known to have many redundant parameters (Alvarez and Salzmann, 2016, Cheng et al., 2015, Denil et al., 2013, Scardapane et al., 2017, Yoon and Hwang, 2017). For example, the work (Denil et al., 2013) shows that in some networks, only five percent of the parameters are enough to achieve acceptable models. A number of research works have focused on compressing and accelerating DNNs (Alvarez and Salzmann, 2016, Cheng et al., 0000, Cheng et al., 2018, Han et al., 2015, Hinton et al., 0000). Among these techniques, one branch of research directions is to promote sparsity in DNNs.

We classify the existing works on sparsity promotion for DNNs into three categories: pruning, dropout, and the sparse optimization based method. Pruning removes weight parameters which are insensitive to the performance with respect to the established dense networks. The seminal work is the Biased Weight Decay (Hanson & Pratt, 1989). Then, the works (Cun et al., 1989, Hassibi and Stork, 1993, Hassibi et al., 1993) use the Hessian loss function to remove network connections. In a recent work (Han et al., 2015), connections having slight effect are removed to obtain sparse networks. There are also methods using various criteria to determine which parameters or connections are unnecessary (Anwar et al., 2017, Narang et al., 0000). However, in these approaches, the pruning criteria require manual setups of layer sensitivity and heuristic assumptions are also necessary during the pruning phase (Cheng et al., 0000).

Dropout reduces the size of networks during training by randomly dropping units along with their connections from DNNs (Hinton et al., 0000, Srivastava et al., 2014, Wan et al., 2013). Biased dropout and crossmap dropout (Poernomo & Kang, 2018) are proposed to implement dropout on hidden units and convolutional layers respectively. These methods can reduce overfitting efficiently and improve the performance. Nonetheless, training a dropout network usually takes more time than training a standard neural network, even if they are with the same architecture (Srivastava et al., 2014). In addition, dropout can only simplify networks during training. Full-sized networks are still needed in the prediction phase. Recently, Shakeout is proposed to randomly enhance or reverse the contribution of each unit to the next layer and dropout can be considered as a special case of Kang, Li, and Tao (2017).

The sparse optimization based method promotes sparsity in networks by introducing structured sparse regularization term into the optimization model of DNNs, and zeroing out the redundant parameters during the process of training. Compared with pruning, this type of approaches does not rely on manual setups. In contrast to dropout, the simplified networks obtained by sparse optimization can also be used in the prediction stage. Moreover, different from most existing methods which compress network with negligible drop of accuracy, experiments show that some sparse optimization based methods could even achieve better performance than their original networks (Alvarez and Salzmann, 2016, Scardapane et al., 2017, Yoon and Hwang, 2017). Considering all these merits, we would construct a sparse network in the framework of optimization with sparse regularizers.

The sparse optimization method can be utilized to various tasks to produce sparse solutions. The key challenge of this approach is the design of regularization functions (Candes et al., 2008, Donoho, 2006, Esser et al., 2013, Fan and Li, 0000, Gong et al., 2016, Xu, 2010, Zhang, 2009, Zhang and Xin, 2017). The 0 norm, which counts the number of non-zero elements, is the most intuitive form of sparse regularizers and can promote the sparsest solution. However, minimizing 0 problem is combinatory and usually NP-hard (Natarajan, 1995). The 1 norm is the most commonly used surrogate (Candes et al., 2008, Donoho, 2006, Yu et al., 2014), which is convex and can be solved easily. Although 1 enjoys several good properties, it is sensitive to outliers and may cause serious bias in estimation (Fan and Li, 0000, Fan and Li, 2001). To overcome this defect, many non-convex surrogates are proposed and analyzed, including smoothly clipped absolute deviation (SCAD) (Fan & Li, 0000), log penalty (Candes et al., 2008, Mazumder et al., 2011), capped 1 (Zhang, 2009, Zhang, 2010), minimax concave penalty (MCP) (Zhang et al., 2010), p penalty with p(0,1) (Krishnan and Fergus, 2009, Xu, 2010, Xu et al., 2012), the difference of 1 and 2 norms (Esser et al., 2013, Lou et al., 2015, Yin et al., 2015) and transformed 1 (Nikolova, 2000, Zhang and Xin, 2017, Zhang and Xin, 2018). More and more works have shown the superior performance of non-convex regularizers in both theoretical analysis and real-world applications. Generally speaking, non-convex regularizers are more likely to produce unbiased models with sparser solutions than convex ones (Fan and Li, 0000, Fan and Li, 2001, Xu, 2010, Xu et al., 2012).

When applied in DNNs, sparse regularizer is supposed to zero out redundant weights and thus remove unnecessary connections. The removal of a large number of connections can reduce enormous computation and memory requirements. Since the variables in DNNs are weights, which are usually modeled as matrices or tensors, we aim to develop a proper regularizer that can avoid augmenting excessive computation complexity. In general, solving a non-convex optimization problem is much harder and more complex than solving a convex one. Considering the fact that DNNs themselves are complicated and their solving process needs numerous calculations, we suspect that this is the main reason for the exiguity of non-convex regularizer based method for DNNs sparsification. In this work, we wish to develop a proper non-convex regularizer which can avoid these problems when applied to DNNs. After comparing the properties of the commonly used non-convex regularizers, we choose transformed 1 as the regularizer in our model. It satisfies the three desired properties that a regularizer should result in an estimator with, i.e., unbiasedness, sparsity and continuity (Fan & Li, 2001). Besides, it has a simple formula and a parameter which makes it adaptive for different tasks. In addition, although it is non-convex and non-smooth, its thresholding function has a closed-form solution, which gives a solution idea that utilizes proximal gradient descent to avoid most of increase of the computational complexity due to the introduction of non-convex regularizer component. In order to further minify the scale of the network, we also consider group sparsity as an auxiliary of transformed 1 to remove unnecessary neurons because of its remarkable performance in promoting neuron-level sparsity (Fang et al., 2015, Lebedev and Lempitsky, 2016, Scardapane et al., 2017, Simon et al., 2013, Yuan and Lin, 2006, Zhou et al., 2016). By combining the transformed 1 and group sparsity, we propose a new integrated transformed 1 regularizer. Until now, there have been attempts to apply sparse optimization methods with non-convex regularization to train sparse DNNs. To the best of our knowledge, our work is the first one that utilizes a non-convex regularizer in sparse optimization based method to promote neuron-level and connection-level sparsity in DNNs simultaneously. Extensive experiments are carried out to show the effectiveness of our method. The contribution of this paper is three-fold:

  • To obtain sparse DNNs, a new model with non-convex regularizer is proposed. The regularizer integrates transformed 1 and group sparsity together. To the best of our knowledge, this is the first work which uses a non-convex regularizer to induce sparsity of DNNs in neuron level and connection level at the same time.

  • To train the new model, an algorithm based on proximal gradient descent is proposed. Although the transformed 1 is non-convex, the proximal operators in our algorithm have closed-form solutions and can be computed easily. Finally, most of increase of the computational cost due to the introduction of non-convex regularizer component can be avoided.

  • Extensive experiments in computer vision are executed on several public datasets. Compared with popular baselines, experimental results show the effectiveness of the proposed regularizer.

The rest of the paper is organized as follows. Section 2 surveys existing sparse optimization based works which aim to promote sparsity in DNNs and some popular non-convex regularizers. Section 3 introduces the new integrated transformed 1 regularizer and proposes a proximal gradient algorithm to deal with the new model at the same time. Experiments on several public classification datasets are reported in Section 4. We conclude the paper in Section 5.

Section snippets

Sparse optimization for DNNs

Sparse optimization based approaches in DNNs achieve sparsity through introducing sparse regularization term to the objective function and turning the training process into an optimization problem. Some pruning methods are also equipped with an objective function regularized by some norms. However, these two categories of methods are inherently different. Pruning methods do not aim to learn the final values of the weights, but rather learn which connections are significant. In contrast, the

DNNs with transformed 1 regularizer

Our objective is to construct a sparse neural network with fewer parameters and comparable or even better performance than the dense model. In neural networks with multiple layers, let W(l) represent the weight matrix of lth layer. By regularizing the weights of each layer respectively, the training objective function for supervised learning can be formulated as min{W(l)}L{W(l)},T+λl=1LΩ(W(l)),where T=xi,yii=1N is a training dataset that has N instances, in which xiRp is a p-dimension input

Experiments

In this section, we evaluate the proposed combined regularizer on several real-world datasets. The regularizer is applied to all layers of the network, except the bias term.

Conclusion

In this work, we introduce a new sparsity-inducing regularization called integrated transformed 1 regularizer, where a group sparsity regularizer explores structural information of neural networks and removes redundant neurons and a transformed 1 norm enforces sparsity between network connections. We verify the performance of our regularizer on several public datasets. Experimental results demonstrate the effectiveness of the proposed regularizer, when comparing it with five prominent

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant No. 11671379. Thanks to the constructive suggestions from the anonymous referees.

References (65)

  • ChengY. et al.

    Model compression and acceleration for deep neural networks: The principles, progress, and challenges

    IEEE Signal Processing Magazine

    (2018)
  • ChengY. et al.

    An exploration of parameter redundancy in deep networks with circulant projections

    (2015)
  • Collins, M. D., & Kohli, P. (0000). Memory bounded deep convolutional networks. ArXiv preprint....
  • Cun, Y. L., Denker, J. S., & Solla, S. A. (1989). Optimal brain damage. In International conference on neural...
  • DengL. et al.

    Deep learning: methods and applications

    Foundations and Trends® in Signal Processing

    (2014)
  • DenilM. et al.

    Predicting parameters in deep learning

  • DonohoD.L.

    For most large underdetermined systems of linear equations the minimal 1–norm solution is also the sparsest solution

    Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences

    (2006)
  • EsserE. et al.

    A method for finding structured sparse solutions to nonnegative least squares problems with applications

    SIAM Journal on Imaging Sciences

    (2013)
  • Fan, J., & Li, R. (0000). Variable selection via penalized likelihood, Department of Statistics...
  • FanJ. et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    Journal of the American Statistical Association

    (2001)
  • FangY. et al.

    Graph-based learning via auto-grouped sparse regularization and kernelized extension

    IEEE Transactions on Knowledge and Data Engineering

    (2015)
  • GongC. et al.

    Ensemble teaching for hybrid label propagation

    IEEE Transactions on Cybernetics

    (2017)
  • GongC. et al.

    Multi-modal curriculum learning for semi-supervised image classification

    IEEE Transactions on Image Processing

    (2016)
  • GoodfellowI. et al.

    Deep learning, vol. 1

    (2016)
  • GuiJ. et al.

    Feature selection based on structured sparsity: A comprehensive study

    IEEE Transactions on Neural Networks Learing Systems

    (2017)
  • HanS. et al.

    Learning both weights and connections for efficient neural network

  • HansonS.J. et al.

    Comparing biases for minimal network construction with back-propagation

  • HassibiB. et al.

    Second order derivatives for network pruning: Optimal brain surgeon

  • HassibiB. et al.

    Optimal brain surgeon and general network pruning

  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (0000). Improving neural networks by...
  • KangG. et al.

    Shakeout: A new approach to regularized deep neural network training

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2017)
  • Krishnan, D., & Fergus, R. (2009). Fast image deconvolution using hyper-laplacian priors. In International conference...
  • Cited by (85)

    • Training Compact DNNs with ℓ<inf>1/2</inf> Regularization

      2023, Pattern Recognition
      Citation Excerpt :

      Intuitively, the zero-out of connections can help remove spare neurons. However, existing regularizers that sparsify the connections are not able to remove redundant neurons, as reported in the work [23]. The reason is that existing regularizers used for network compression are all Lipschitz continuous.

    View all citing articles on Scopus
    View full text