Skip to main content
Log in

SRS-DNN: a deep neural network with strengthening response sparsity

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Inspired by the sparse mechanism of biological neural systems, an approach of strengthening response sparsity for deep learning is presented in this paper. Firstly, an unsupervised sparse pre-training process is implemented and a sparse deep network is begun to take shape. In order to avoid that all the connections of the network will be readjusted backward during the following fine-tuning process, for the loss function of the fine-tuning process, some regularization items which strength the sparse responsiveness are added. More importantly, the unified and concise residual formulae for network updating are deduced, which ensure the backpropagation algorithm to perform successfully. The residual formulae significantly improve the existing sparse fine-tuning methods such as which in sparse autoencoders by Andrew Ng. In this way, the sparse structure obtained in the pre-training can be maintained, and the sparse abstract features of data can be extracted effectively. Numerical experiments show that by this sparsity-strengthened learning method, the sparse deep neural network has the best classification performance among several classical classifiers; meanwhile, the sparse learning abilities and time complexity all are better than traditional deep learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

    Article  MathSciNet  MATH  Google Scholar 

  2. LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  3. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  MATH  Google Scholar 

  4. Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Opin Neurobiol 14(4):481–487

    Article  Google Scholar 

  5. Morris G, Nevet A, Bergman H (2003) Anatomical funneling, sparse connectivity and redundancy reduction in the neural networks of the basal ganglia. J Physiol Paris 97(4–6):581–589

    Article  Google Scholar 

  6. Ji N, Zhang J, Zhang C et al (2014) Enhancing performance of restricted Boltzmann machines via log-sum regularization. Knowl Based Syst 63:82–96

    Article  Google Scholar 

  7. Banino A, Barry C et al (2018) Vector-based navigation using grid-like representations in artificial agents. Nature. https://doi.org/10.1038/s41586-018-0102-6

    Article  Google Scholar 

  8. Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29:1–15

    Article  MathSciNet  Google Scholar 

  9. Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30:873–1886

    Google Scholar 

  10. Barlow HB (1972) Single units and sensation: a neuron doctrine for perceptual psychology. Perception 38(4):795–798

    Google Scholar 

  11. Nair V, Hinton G E (2009) 3D object recognition with Deep Belief Nets. In: International conference on neural information processing systems, pp 1339–1347

  12. Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area V2. Adv Neural Inf Process Syst 20:873–880

    Google Scholar 

  13. Lee H, Grosse R, Ranganath R et al (2011) Unsupervised learning of hierarchical representations with convolutional deep belief networks. Commun ACM 54(10):95–103

    Article  Google Scholar 

  14. Ranzato MA, Poultney C, Chopra S, LeCun Yann (2006) Efficient learning of sparse representations with an energy-based model. Adv Neural Inf Process Syst 19:1137–1144

    Google Scholar 

  15. Thom M, Palm G (2013) Sparse activity and sparse connectivity in supervised learning. J Mach Learn Res 14(1):1091–1143

    MathSciNet  MATH  Google Scholar 

  16. Wan W, Mabu S, Shimada K et al (2009) Enhancing the generalization ability of neural networks through controlling the hidden layers. Appl Soft Comput 9(1):404–414

    Article  Google Scholar 

  17. Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2):219–269

    Article  Google Scholar 

  18. Williams PM (1995) Bayesian regularization and pruning using a laplace prior. Neural Comput 7(1):117–143

    Article  Google Scholar 

  19. Weigend A S, Rumelhart D E, Huberman B A (1990) Generalization by weight elimination with application to forecasting. In: Advances in neural information processing systems, DBLP, pp 875–882

  20. Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493

    Article  Google Scholar 

  21. Zhang J, Ji N, Liu J et al (2015) Enhancing performance of the backpropagation algorithm via sparse response regularization. Neurocomputing 153:20–40

    Article  Google Scholar 

  22. Ng A (2011) Sparse autoencoder. CS294A Lecture Notes for Stanford University

  23. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  24. Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. In: Proceedings of the advances in neural information processing systems, pp 19:153–160

  25. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800

    Article  MATH  Google Scholar 

  26. Hinton GE (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):599–619

    Google Scholar 

  27. Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recognit 47(1):25–39

    Article  MATH  Google Scholar 

  28. Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306

    Article  MathSciNet  MATH  Google Scholar 

  29. Xiao H, Rasul K, Vollgraf R (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms arXiv:1708.07747v1

  30. Maaten LV, Hinton GE (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605

    MATH  Google Scholar 

Download references

Acknowledgements

This research was funded by NSFC Nos. 11471006 and 11101327, the Fundamental Research Funds for the Central Universities (No. xjj2017126), the Science and Technology Project of Xi’an (No. 201809164CX5JC6) and the HPC Platform of Xi’an Jiaotong University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Qiao.

Ethics declarations

Conflict of interest

The authors declare that there are no financial or other relationships that might lead to conflict of interest of the present article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The deviation of KL divergence for the parameters in Sect. 2.2

$$\begin{aligned} \frac{\partial }{\partial W_{ij}}\sum ^{N_h}_{j=1}KL(\rho \parallel p_j)= & {} \frac{\partial }{\partial W_{ij}}\sum ^{N_h}_{j=1}(\rho \log \frac{\rho }{p_j}+(1-\rho )\log \frac{1-\rho }{1-p_j})\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{\partial p_j}{\partial W_{ij}}\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{1}{N_s}\sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j})v^{(q)}_{i}\\= & {} \frac{1}{N_s}\left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j})v^{(q)}_{i}\\ \frac{\partial }{\partial \beta _{j}}\sum ^{N_h}_{j=1}KL(\rho \parallel p_j)= & {} \frac{\partial }{\partial \beta _{j}}\sum ^{N_h}_{j=1}(\rho \log \frac{\rho }{p_j}+(1-\rho )\log \frac{1-\rho }{1-p_j})\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{\partial p_j}{\partial \beta _{j}}\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{1}{N_s}\sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j})\\= & {} \frac{1}{N_s}\left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j}) \end{aligned}$$

here \(\sigma ^{(q)}_{j}=\sigma (\sum ^{N_v}_{i=1}v^{(q)}_{i}W_{ij}+\beta _j)=\frac{1}{1+e^{-\sum ^{N_v}_{i=1}v^{(q)}_{i}W_{ij}-\beta _j}}\).

Appendix 2: The derivation of updating formula for the traditional BP in Sect. 3.1

For the traditional BP, the total error of the network in the backpropagation process, i.e., the loss function is

$$\begin{aligned} J(W)=\frac{1}{2N}\sum ^{N}_{q=1}\sum ^{n_L}_{j=1}(a^{(L)}_{qj}-y_{qj})^2 \end{aligned}$$

where N is the training sample size, \(y_{qj}\) is the target output of the j-th neuron in the output layer corresponding to the q-th sample, and \(a^{(L)}_{qj}\) is the actual output of it. For simplicity, we first give the parameter updating formula for one sample. Consider \(J(W)=\frac{1}{2}\sum ^{n_L}_{j=1}(a^{(L)}_{j}-y_j)^2\) as the error of the network for one sample. Let \(\eta _1\) be the learning rate, \(W^{(l)}_{ij}\) be the connection weight of the i-th node in the l-th layer and the j-th node in the \((l+1)\)-th layer (\(1\le i\le n_l+1\), \(1\le j\le n_{l+1}\)), then we have the following update formula for the network parameters

$$\begin{aligned} W^{(l)}_{ij}= & {} W^{(l)}_{ij}-\eta _1\frac{\partial J(W)}{\partial W^{(l)}_{ij}}=W^{(l)}_{ij}-\eta _1\frac{\partial J(W)}{\partial z^{(l+1)}_{j}}\cdot \frac{\partial z^{(l+1)}_{j}}{\partial W^{(l)}_{ij}} \\= & {} W^{(l)}_{ij}-\eta _1\delta ^{(l+1)}_{j} a^{(l)}_{i} \end{aligned}$$

where \(\delta ^{(l+1)}_{j}=\frac{\partial J(W)}{\partial z^{(l+1)}_{j}}\) is the residual of the j-th node in the \((l+1)\)-th layer. For the L-th layer, i.e., the output layer, the residual of the j-th node is

$$\begin{aligned} \delta ^{(L)}_{j}= & {} \frac{\partial J(W)}{\partial z^{(L)}_{j}}=\frac{\partial }{\partial z^{(L)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(a^{(L)}_{k}-y_k)^2 \\= & {} \frac{\partial }{\partial z^{(L)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(f(z^{(L)}_{k})-y_k)^2\\= & {} (f(z^{(L)}_{j})-y_j) f^{'}(z^{(L)}_{j}) \end{aligned}$$

Suppose

$$\begin{aligned} \delta ^{(l)}= & {} (\delta ^{(l)}_{1},\delta ^{(l)}_{2},\ldots ,\delta ^{(l)}_{n_l})\\ f^{'}(z^{(l)})= & {} (f^{'}(z^{(l)}_{1}),f^{'}(z^{(l)}_{2}),\ldots ,f^{'}(z^{(l)}_{n_l})) \end{aligned}$$

thus the residual vector of the L-th layer is

$$\begin{aligned} \delta ^{(L)}=(a^{(L)}-y)_{\cdot }*(f^{'}(z^{(L)})) \end{aligned}$$
(20)

where \(_{\cdot }*\) is the vector product operator (Hadamard product), which is defined as the product of the corresponding elements for one vector or matrix.

The residual of the j-th node for the \((L-1)\)-th layer is and the residual of the j-th node for the \((L-1)\)-th layer is

$$\begin{aligned} \delta ^{(L-1)}_{j}= & {} \frac{\partial J(W)}{\partial z^{(L-1)}_{j}}=\frac{\partial }{\partial z^{(L-1)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(a^{(L)}_{k}-y_k)^2 \\= & {} \frac{\partial }{\partial z^{(L-1)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(f(z^{(L)}_{k})-y_k)^2\\= & {} \frac{1}{2}\sum ^{n_L}_{k=1}\frac{\partial }{\partial z^{(L-1)}_{j}}(f(z^{(L)}_{k})-y_k)^2 \\= & {} \sum ^{n_L}_{k=1}(f(z^{(L)}_{k})-y_k)\cdot f^{'}(z^{(L)}_{k})\cdot \frac{\partial z^{(L)}_{k}}{\partial z^{(L-1)}_{j}}\\= & {} \sum ^{n_L}_{k=1}\delta ^{(L)}_{k}\cdot \frac{\partial z^{(L)}_{k}}{\partial z^{(L-1)}_{j}} \\= & {} \sum ^{n_L}_{k=1}\delta ^{(L)}_{k}\cdot \frac{\partial }{\partial z^{(L-1)}_{j}}\sum ^{n_{L-1}}_{s=1}a^{(L-1)}_{s}\cdot W^{(L-1)}_{sk}\\= & {} \sum ^{n_L}_{k=1}W^{(L-1)}_{jk}\delta ^{(L)}_{k} f^{'}(z^{(L-1)}_{j}) \end{aligned}$$

The residual of the j-th node for the l-th layer \((l=L-1,\ldots ,2,1)\) is \(\delta ^{(l)}_{j}=(\sum ^{n_{l+1}}_{k=1}W^{(l)}_{jk}\delta ^{(l+1)}_{k})f^{'}(z^{(l)}_{j})\), thus the vector form of the residual for the l-th layer is

$$\begin{aligned} \delta ^{(l)}=((\delta ^{(l+1)})\cdot ({\bar{W}}^{(l)})^{\mathrm{T}})_{\cdot }*f^{'}(z^{(l)}) \end{aligned}$$
(21)

where \(\cdot\) is the matrix product, \({\bar{W}}^{(l)}\) is the first \(n_l\) rows of \(W^{(l)}\). Let

$$\begin{aligned} \Delta W^{(l)}= & {} (a^{(l)})^{\mathrm{T}} \cdot \delta ^{(l+1)} \end{aligned}$$
(22)

in which \(\delta ^{(l+1)}\) is defined by (20)–(21) (\(l=L-1,\ldots ,2,1\)).

For the N samples case, by (22), we have \(\Delta {W^{(l)}_q}\_{J}=(a^{(l)})^{\mathrm{T}}_q \cdot \delta ^{(l+1)}_q\) for each sample \(q=1,2,\ldots ,N\). Thus, the update formula for the network parameters in a matrix form is

$$\begin{aligned} W^{(l)}= & {} W^{(l)}-\eta _1 \cdot \frac{1}{N} \sum ^{N}_{q=1}\Delta W^{(l)}_q\_{J} \end{aligned}$$

Appendix 3: AR results of SRS-DNN with different parameter sets

The following Table 6 contains different sparse parameter sets of sparsity penalty items for KL and \(L_1\) in RBM as well as in BP, i.e., \(\lambda _2\), \(\lambda _3\), \(\tau _1\) and \(\tau _2\), and also their corresponding accurate rates of classification. Based on which, in this paper, we select \(\lambda _2\), \(\lambda _3\), \(\tau _1\) and \(\tau _2\) to be 0.005, 0.0001, 0.0001 and 0.0002, respectively.

Table 6 AR results of SRS-DNN with different parameter sets

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiao, C., Gao, B. & Shi, Y. SRS-DNN: a deep neural network with strengthening response sparsity. Neural Comput & Applic 32, 8127–8142 (2020). https://doi.org/10.1007/s00521-019-04309-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04309-3

Keywords

Navigation