SRS-DNN: a deep neural network with strengthening response sparsity

Qiao, Chen; Gao, Bin; Shi, Yan

doi:10.1007/s00521-019-04309-3

SRS-DNN: a deep neural network with strengthening response sparsity

Original Article
Published: 26 June 2019

Volume 32, pages 8127–8142, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

370 Accesses
7 Citations
Explore all metrics

Abstract

Inspired by the sparse mechanism of biological neural systems, an approach of strengthening response sparsity for deep learning is presented in this paper. Firstly, an unsupervised sparse pre-training process is implemented and a sparse deep network is begun to take shape. In order to avoid that all the connections of the network will be readjusted backward during the following fine-tuning process, for the loss function of the fine-tuning process, some regularization items which strength the sparse responsiveness are added. More importantly, the unified and concise residual formulae for network updating are deduced, which ensure the backpropagation algorithm to perform successfully. The residual formulae significantly improve the existing sparse fine-tuning methods such as which in sparse autoencoders by Andrew Ng. In this way, the sparse structure obtained in the pre-training can be maintained, and the sparse abstract features of data can be extracted effectively. Numerical experiments show that by this sparsity-strengthened learning method, the sparse deep neural network has the best classification performance among several classical classifiers; meanwhile, the sparse learning abilities and time complexity all are better than traditional deep learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

Deep Neural Networks Regularization Using a Combination of Sparsity Inducing Feature Selection Methods

Article 07 January 2021

Fatemeh Farokhmanesh & Mohammad Taghi Sadeghi

Learning Bilevel Sparse Regularized Neural Network

References

Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article MathSciNet MATH Google Scholar
LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article MATH Google Scholar
Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Opin Neurobiol 14(4):481–487
Article Google Scholar
Morris G, Nevet A, Bergman H (2003) Anatomical funneling, sparse connectivity and redundancy reduction in the neural networks of the basal ganglia. J Physiol Paris 97(4–6):581–589
Article Google Scholar
Ji N, Zhang J, Zhang C et al (2014) Enhancing performance of restricted Boltzmann machines via log-sum regularization. Knowl Based Syst 63:82–96
Article Google Scholar
Banino A, Barry C et al (2018) Vector-based navigation using grid-like representations in artificial agents. Nature. https://doi.org/10.1038/s41586-018-0102-6
Article Google Scholar
Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29:1–15
Article MathSciNet Google Scholar
Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30:873–1886
Google Scholar
Barlow HB (1972) Single units and sensation: a neuron doctrine for perceptual psychology. Perception 38(4):795–798
Google Scholar
Nair V, Hinton G E (2009) 3D object recognition with Deep Belief Nets. In: International conference on neural information processing systems, pp 1339–1347
Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area V2. Adv Neural Inf Process Syst 20:873–880
Google Scholar
Lee H, Grosse R, Ranganath R et al (2011) Unsupervised learning of hierarchical representations with convolutional deep belief networks. Commun ACM 54(10):95–103
Article Google Scholar
Ranzato MA, Poultney C, Chopra S, LeCun Yann (2006) Efficient learning of sparse representations with an energy-based model. Adv Neural Inf Process Syst 19:1137–1144
Google Scholar
Thom M, Palm G (2013) Sparse activity and sparse connectivity in supervised learning. J Mach Learn Res 14(1):1091–1143
MathSciNet MATH Google Scholar
Wan W, Mabu S, Shimada K et al (2009) Enhancing the generalization ability of neural networks through controlling the hidden layers. Appl Soft Comput 9(1):404–414
Article Google Scholar
Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2):219–269
Article Google Scholar
Williams PM (1995) Bayesian regularization and pruning using a laplace prior. Neural Comput 7(1):117–143
Article Google Scholar
Weigend A S, Rumelhart D E, Huberman B A (1990) Generalization by weight elimination with application to forecasting. In: Advances in neural information processing systems, DBLP, pp 875–882
Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493
Article Google Scholar
Zhang J, Ji N, Liu J et al (2015) Enhancing performance of the backpropagation algorithm via sparse response regularization. Neurocomputing 153:20–40
Article Google Scholar
Ng A (2011) Sparse autoencoder. CS294A Lecture Notes for Stanford University
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. In: Proceedings of the advances in neural information processing systems, pp 19:153–160
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800
Article MATH Google Scholar
Hinton GE (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):599–619
Google Scholar
Fischer A, Igel C (2014) Training restricted Boltzmann machines: an introduction. Pattern Recognit 47(1):25–39
Article MATH Google Scholar
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Article MathSciNet MATH Google Scholar
Xiao H, Rasul K, Vollgraf R (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms arXiv:1708.07747v1
Maaten LV, Hinton GE (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):2579–2605
MATH Google Scholar

Download references

Acknowledgements

This research was funded by NSFC Nos. 11471006 and 11101327, the Fundamental Research Funds for the Central Universities (No. xjj2017126), the Science and Technology Project of Xi’an (No. 201809164CX5JC6) and the HPC Platform of Xi’an Jiaotong University.

Author information

Authors and Affiliations

School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China
Chen Qiao, Bin Gao & Yan Shi

Authors

Chen Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Qiao.

Ethics declarations

Conflict of interest

The authors declare that there are no financial or other relationships that might lead to conflict of interest of the present article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The deviation of KL divergence for the parameters in Sect. 2.2

$$\begin{aligned} \frac{\partial }{\partial W_{ij}}\sum ^{N_h}_{j=1}KL(\rho \parallel p_j)= & {} \frac{\partial }{\partial W_{ij}}\sum ^{N_h}_{j=1}(\rho \log \frac{\rho }{p_j}+(1-\rho )\log \frac{1-\rho }{1-p_j})\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{\partial p_j}{\partial W_{ij}}\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{1}{N_s}\sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j})v^{(q)}_{i}\\= & {} \frac{1}{N_s}\left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j})v^{(q)}_{i}\\ \frac{\partial }{\partial \beta _{j}}\sum ^{N_h}_{j=1}KL(\rho \parallel p_j)= & {} \frac{\partial }{\partial \beta _{j}}\sum ^{N_h}_{j=1}(\rho \log \frac{\rho }{p_j}+(1-\rho )\log \frac{1-\rho }{1-p_j})\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{\partial p_j}{\partial \beta _{j}}\\= & {} \left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \frac{1}{N_s}\sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j})\\= & {} \frac{1}{N_s}\left( -\frac{\rho }{p_j}+\frac{1-\rho }{1-p_j}\right) \sum ^{N_s}_{q=1}\sigma ^{(q)}_{j}(1-\sigma ^{(q)}_{j}) \end{aligned}$$

here $\sigma ^{(q)}_{j}=\sigma (\sum ^{N_v}_{i=1}v^{(q)}_{i}W_{ij}+\beta _j)=\frac{1}{1+e^{-\sum ^{N_v}_{i=1}v^{(q)}_{i}W_{ij}-\beta _j}}$.

Appendix 2: The derivation of updating formula for the traditional BP in Sect. 3.1

For the traditional BP, the total error of the network in the backpropagation process, i.e., the loss function is

$$\begin{aligned} J(W)=\frac{1}{2N}\sum ^{N}_{q=1}\sum ^{n_L}_{j=1}(a^{(L)}_{qj}-y_{qj})^2 \end{aligned}$$

where N is the training sample size, $y_{qj}$ is the target output of the j-th neuron in the output layer corresponding to the q-th sample, and $a^{(L)}_{qj}$ is the actual output of it. For simplicity, we first give the parameter updating formula for one sample. Consider $J(W)=\frac{1}{2}\sum ^{n_L}_{j=1}(a^{(L)}_{j}-y_j)^2$ as the error of the network for one sample. Let $\eta _1$ be the learning rate, $W^{(l)}_{ij}$ be the connection weight of the i-th node in the l-th layer and the j-th node in the $(l+1)$-th layer ($1\le i\le n_l+1$, $1\le j\le n_{l+1}$), then we have the following update formula for the network parameters

$$\begin{aligned} W^{(l)}_{ij}= & {} W^{(l)}_{ij}-\eta _1\frac{\partial J(W)}{\partial W^{(l)}_{ij}}=W^{(l)}_{ij}-\eta _1\frac{\partial J(W)}{\partial z^{(l+1)}_{j}}\cdot \frac{\partial z^{(l+1)}_{j}}{\partial W^{(l)}_{ij}} \\= & {} W^{(l)}_{ij}-\eta _1\delta ^{(l+1)}_{j} a^{(l)}_{i} \end{aligned}$$

where $\delta ^{(l+1)}_{j}=\frac{\partial J(W)}{\partial z^{(l+1)}_{j}}$ is the residual of the j-th node in the $(l+1)$-th layer. For the L-th layer, i.e., the output layer, the residual of the j-th node is

$$\begin{aligned} \delta ^{(L)}_{j}= & {} \frac{\partial J(W)}{\partial z^{(L)}_{j}}=\frac{\partial }{\partial z^{(L)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(a^{(L)}_{k}-y_k)^2 \\= & {} \frac{\partial }{\partial z^{(L)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(f(z^{(L)}_{k})-y_k)^2\\= & {} (f(z^{(L)}_{j})-y_j) f^{'}(z^{(L)}_{j}) \end{aligned}$$

Suppose

$$\begin{aligned} \delta ^{(l)}= & {} (\delta ^{(l)}_{1},\delta ^{(l)}_{2},\ldots ,\delta ^{(l)}_{n_l})\\ f^{'}(z^{(l)})= & {} (f^{'}(z^{(l)}_{1}),f^{'}(z^{(l)}_{2}),\ldots ,f^{'}(z^{(l)}_{n_l})) \end{aligned}$$

thus the residual vector of the L-th layer is

$$\begin{aligned} \delta ^{(L)}=(a^{(L)}-y)_{\cdot }*(f^{'}(z^{(L)})) \end{aligned}$$

(20)

where $_{\cdot }*$ is the vector product operator (Hadamard product), which is defined as the product of the corresponding elements for one vector or matrix.

The residual of the j-th node for the $(L-1)$-th layer is and the residual of the j-th node for the $(L-1)$-th layer is

$$\begin{aligned} \delta ^{(L-1)}_{j}= & {} \frac{\partial J(W)}{\partial z^{(L-1)}_{j}}=\frac{\partial }{\partial z^{(L-1)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(a^{(L)}_{k}-y_k)^2 \\= & {} \frac{\partial }{\partial z^{(L-1)}_{j}}\frac{1}{2}\sum ^{n_L}_{k=1}(f(z^{(L)}_{k})-y_k)^2\\= & {} \frac{1}{2}\sum ^{n_L}_{k=1}\frac{\partial }{\partial z^{(L-1)}_{j}}(f(z^{(L)}_{k})-y_k)^2 \\= & {} \sum ^{n_L}_{k=1}(f(z^{(L)}_{k})-y_k)\cdot f^{'}(z^{(L)}_{k})\cdot \frac{\partial z^{(L)}_{k}}{\partial z^{(L-1)}_{j}}\\= & {} \sum ^{n_L}_{k=1}\delta ^{(L)}_{k}\cdot \frac{\partial z^{(L)}_{k}}{\partial z^{(L-1)}_{j}} \\= & {} \sum ^{n_L}_{k=1}\delta ^{(L)}_{k}\cdot \frac{\partial }{\partial z^{(L-1)}_{j}}\sum ^{n_{L-1}}_{s=1}a^{(L-1)}_{s}\cdot W^{(L-1)}_{sk}\\= & {} \sum ^{n_L}_{k=1}W^{(L-1)}_{jk}\delta ^{(L)}_{k} f^{'}(z^{(L-1)}_{j}) \end{aligned}$$

The residual of the j-th node for the l-th layer $(l=L-1,\ldots ,2,1)$ is $\delta ^{(l)}_{j}=(\sum ^{n_{l+1}}_{k=1}W^{(l)}_{jk}\delta ^{(l+1)}_{k})f^{'}(z^{(l)}_{j})$, thus the vector form of the residual for the l-th layer is

$$\begin{aligned} \delta ^{(l)}=((\delta ^{(l+1)})\cdot ({\bar{W}}^{(l)})^{\mathrm{T}})_{\cdot }*f^{'}(z^{(l)}) \end{aligned}$$

(21)

where $\cdot$ is the matrix product, ${\bar{W}}^{(l)}$ is the first $n_l$ rows of $W^{(l)}$. Let

$$\begin{aligned} \Delta W^{(l)}= & {} (a^{(l)})^{\mathrm{T}} \cdot \delta ^{(l+1)} \end{aligned}$$

(22)

in which $\delta ^{(l+1)}$ is defined by (20)–(21) ($l=L-1,\ldots ,2,1$).

For the N samples case, by (22), we have $\Delta {W^{(l)}_q}\_{J}=(a^{(l)})^{\mathrm{T}}_q \cdot \delta ^{(l+1)}_q$ for each sample $q=1,2,\ldots ,N$. Thus, the update formula for the network parameters in a matrix form is

$$\begin{aligned} W^{(l)}= & {} W^{(l)}-\eta _1 \cdot \frac{1}{N} \sum ^{N}_{q=1}\Delta W^{(l)}_q\_{J} \end{aligned}$$

Appendix 3: AR results of SRS-DNN with different parameter sets

The following Table 6 contains different sparse parameter sets of sparsity penalty items for KL and $L_1$ in RBM as well as in BP, i.e., $\lambda _2$, $\lambda _3$, $\tau _1$ and $\tau _2$, and also their corresponding accurate rates of classification. Based on which, in this paper, we select $\lambda _2$, $\lambda _3$, $\tau _1$ and $\tau _2$ to be 0.005, 0.0001, 0.0001 and 0.0002, respectively.

Table 6 AR results of SRS-DNN with different parameter sets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qiao, C., Gao, B. & Shi, Y. SRS-DNN: a deep neural network with strengthening response sparsity. Neural Comput & Applic 32, 8127–8142 (2020). https://doi.org/10.1007/s00521-019-04309-3

Download citation

Received: 19 December 2018
Accepted: 17 June 2019
Published: 26 June 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00521-019-04309-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SRS-DNN: a deep neural network with strengthening response sparsity

Abstract

Access this article

Similar content being viewed by others

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

Deep Neural Networks Regularization Using a Combination of Sparsity Inducing Feature Selection Methods

Learning Bilevel Sparse Regularized Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: The deviation of KL divergence for the parameters in Sect. 2.2

Appendix 2: The derivation of updating formula for the traditional BP in Sect. 3.1

Appendix 3: AR results of SRS-DNN with different parameter sets

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

Deep Neural Networks Regularization Using a Combination of Sparsity Inducing Feature Selection Methods

Learning Bilevel Sparse Regularized Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1: The deviation of KL divergence for the parameters in Sect. 2.2

Appendix 2: The derivation of updating formula for the traditional BP in Sect. 3.1

Appendix 3: AR results of SRS-DNN with different parameter sets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation