Abstract
Choosing a learning rate is a necessary part of any subgradient method optimization. With deeper models such as convolutional neural networks of image classification, fine-tuning the learning rate can quickly become tedious, and it does not always result in optimal convergence. In this work, we suggest a variation of the subgradient method in which the learning rate is updated by a control step in each iteration of each epoch. Stochastic Perturbation Subgradient Algorithm (SPSA) is our approach for tackling image classification issues with deep neural networks including convolutional neural networks. Used MNIST dataset, the numerical results reveal that our SPSA method is faster than Stochastic Gradient Descent and its variants with a fixed learning rate. However SPSA and convolutional neural network model improve the results of image classification including loss and accuracy.
Similar content being viewed by others
Notes
The multilayer perceptron is discussed in detail in Sect. 3.
A directed graph is an ordered pair \(G = (V,E)\), where V is a set of nodes and E is a set of edges linking the nodes in its most general form: Within the graph, \((u,v) \in E\) denotes the presence of a directed edge from node u to node v. Given two units u and v in a network graph, a directed edge from u to v indicates that the output of unit u is used as input by unit v.
A one-hot vector v is then a binary vector with a single non-zero component, which takes the value 1.
Weight decay is a term used to describe the \(\ell _2\)-regularization; see Bishop (1995) for more information.
For \(p = 1\), the norm \(\Vert \cdot \Vert _1\) is defined as \(\Vert \textbf{w}\Vert _1 =\sum _{\ell = 1}^{K} \sum _{i=1}^{\eta _\ell }\sum _{j=1}^{\eta _{\ell -1}}{\vert w_{ij}^{(\ell )} \vert }\).
By averaging the predictions of different models, model averaging attempts to reduce inaccuracy (Hinton et al. 2012).
References
Bagirov AM, Jin L, Karmitsa N, Al Nuaimat A, Sultanova N (2013) Subgradient method for nonconvex nonsmooth optimization. J Optim Theory Appl 157:416–435
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Bishop C (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316
Botev A, Lever G, Barber D (2017) Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. In: Neural networks (IJCNN) 2017 international joint conference on, pp 1899–1903
Ciresan DC, Meier U, Schmidhuber J (2012) Multi-column deep neural networks for image classification. Comput Res Repos. arXiv:abs/1202.2745
Cui Y, He Z, Pang J (2020) Multicomposite nonconvex optimization for training deep neural networks. SIAM J Optim 30(2):1693–1723
Dem’vanov VF, Vasil’ev LV (1985) Nondifferentiable optimization. Optimization Software, Inc., Publications Division, New York
Duchi JC, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159
Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, New York
El Jaafari I, Ellahyani A, Charfi S (2021) Parametric rectified nonlinear unit (PRenu) for convolution neural networks. J Signal Image Video Process (SIViP) 15:241–246
El Mouatasim A (2018) Implementation of reduced gradient with bisection algorithms for non-convex optimization problem via stochastic perturbation. J Numer Algorithms 78(1):41–62
El Mouatasim A (2019) Control proximal gradient algorithm for \(\ell _1\) regularization image. J Signal Image Video Process (SIViP) 13(6):1113–1121
El Mouatasim A (2020) Fast gradient descent algorithm for image classification with neural networks. J Signal Image Video Process (SIViP) 14:1565–1572
El Mouatasim A, Wakrim M (2015) Control subgradient algorithm for image regularization. J Signal Image Video Process (SIViP) 9:275–283
El Mouatasim A, Ellaia R, Souza de Cursi JE (2006) Random perturbation of variable metric method for unconstraint nonsmooth nonconvex optimization. Appl Math Comput Sci 16(4):463–474
El Mouatasim A, Ellaia R, Souza de Cursi JE (2011) Projected variable metric method for linear constrained nonsmooth global optimization via perturbation stochastic. Int J Appl Math Comput Sci 21(2):317–329
El Mouatasim A, Ellaia R, Souza de Cursi JE (2014) Stochastic perturbation of reduced gradient & GRG methods for nonconvex programming problems. J Appl Math Comput 226:198–211
Feng J, Lu S (2019) Performance analysis of various activation functions in artificial neural networks. J Phys Conf Ser. https://doi.org/10.1088/1742-6596/1237/2/022030
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics, pp 249–256
Haykin S (2005) Neural networks a comprehensive foundation. Pearson Education, New Delhi
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. Comput Res Repos. arXiv:abs/1207.0580
Huang K, Hussain A, Wang Q, Zhang R (2019) Deep learning: fundamentals, theory and applications. Springer, Berlin
Jarrett K, Kavukcuogl K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition? In: International conference on computer vision, pp 2146–2153
Josef S (2022) A few samples from the MNIST test dataset. https://commons.wikimedia.org/wiki/File:MnistExamples.png. Accessed 12 Dec. Under Creative Commons Attribution-ShareAlike 4.0 International License
Khalij L, de Cursi ES (2021) Uncertainty quantification in data fitting neural and Hilbert networks. In: Proceedings of the 5th international symposium on uncertainty quantification and stochastic modelling, pp 222–241. https://doi.org/10.1007/978-3-030-53669-5_17
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, San Diego, CA
Konstantin E, Johannes S (2019) A comparison of deep networks with ReLU activation function and linear spline-type methods. Neural Netw 110:232–242
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 60:1097–1105
Kutyniok G (2022) The mathematics of artificial intelligence. arXiv preprint arXiv:2203.08890
LeCun Y (1989) Generalization and network design strategies. Connect Perspect 19:143–155
LeCun Y, Cortes C (2010) MNIST handwritten digit database
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
LeCun Y, Kavukvuoglu K, Farabet C (2010) Convolutional networks and applications in vision. In: International symposium on circuits and systems, vol 5, pp 253–256
Liu Z, Liu H (2019) An efficient gradient method with approximately optimal stepsize based on tensor model for unconstrained optimization. J Optim Theory Appl 181:608–633
Li J, Yang X (2020) A cyclical learning rate method in deep learning training. In: International conference on computer, information and telecommunication systems (CITS), pp 1–5
Minsky ML (1954) Theory of neural-analog reinforcement systems and its application to the brain-model problem. Ph.D. dissertation, Princeton University
Nakamura K, Derbel B, Won K-J, Hong B-W (2021) Learning-rate annealing methods for deep neural networks. Electronics 10:2029
Neutelings I (2022) Graphics with TikZ in LaTeX. Neural networks. https://tikz.net/neura_networks. Accessed 12 Dec. Under Creative Commons Attribution-ShareAlike 4.0 International License
Pelletier C, Webb GI, Petitjean F (2019) Temporal convolutional neural network for the classification of satellite image time series. Remote Sens 11(5):523
Pogu M, Souza de Cursi JE (1994) Global optimization by random perturbation of the gradient method with a fixed parameter. J Global Optim 5:159–180
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408. https://doi.org/10.1037/h0042519
Singh BK, Verma K, Thoke AS (2015) Adaptive gradient descent backpropagation for classification of breast tumors in ultrasound imaging. Procedia Comput Sci 46:1601–1609
Stutz D (2014) Understanding convolutional neural networks. Seminar report, Fakultät für Mathematik, Informatik und Naturwissenschaften
Szandała T (2021) Review and comparison of commonly used activation functions for deep neural networks. In: Bhoi A, Mallick P, Liu CM, Balas V (eds) Bio-inspired neurocomputing. Studies in computational intelligence, vol 903. Springer, Singapore. https://doi.org/10.1007/978-981-15-5495-7_11
Tuyen TT, Hang-Tuan N (2021) Backtracking gradient descent method and some applications in large scale optimisation. Part 2. Appl Math Optim 84:2557–2586
Uryas’ev SP (1991) New variable-metric algorithms for nondifferentiable optimization problems. J Optim Theory Appl 71(2):359–388
Wójcik B, Maziarka L, Tabor J (2018) Automatic learning rate in gradient descent. Schedae Inf 27:47–57
Xinhua L, Qian Y (2015) Face recognition based on deep neural network. Int J Signal Process Image Process Pattern Recogn 8(10):29–38
Zeiler MD, Fergus R (2013) Visualizing and understanding convolutional networks. Comput Res Repos. arXiv:abs/1311.2901
Acknowledgements
We are indebted to the anonymous Reviewers and Editors for their many helpful recommendations and insightful remarks that helped us improve the original article.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript “Stochastic Perturbation of Subgradient Algorithm for Nonconvex Deep Neural Networks” and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.
Additional information
Communicated by Antonio José Silva Neto.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
El Mouatasim, A., de Cursi, J.E.S. & Ellaia, R. Stochastic perturbation of subgradient algorithm for nonconvex deep neural networks. Comp. Appl. Math. 42, 167 (2023). https://doi.org/10.1007/s40314-023-02307-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40314-023-02307-9
Keywords
- Subgradient algorithm
- Nonconvex nonsmooth optimization
- Stochastic perturbation
- Learning rate
- Image classification
- Deep neural networks and CNN