skip to main content
10.1145/3152494.3152498acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

Are saddles good enough for neural networks

Published: 11 January 2018 Publication History

Abstract

Recent years have seen a growing interest in understanding neural networks from an optimization perspective. It is understood now that converging to low-cost local minima is sufficient for such models to become effective in practice. However, in this work, we propose a new hypothesis based on recent theoretical findings and empirical studies that neural network models actually converge to saddle points with high degeneracy. Our findings from this work are new, and can have a significant impact on the development of gradient descent based methods for training neural networks. We validated our hypotheses using an extensive experimental evaluation on standard datasets such as MNIST and CIFAR-10, and also showed that recent efforts that attempt to escape saddles finally converge to saddles with high degeneracy, which we define as 'good saddles'. We also verified the famous Wigner's Semicircle Law in our experimental results.

References

[1]
Animashree Anandkumar and Rong Ge. 2016. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016. 81--102.
[2]
Vijay Badrinarayanan, Bamdev Mishra, and Roberto Cipolla. 2015. Understanding symmetries in deep networks. arXiv preprint arXiv:1511.01029 (2015).
[3]
P. Baldi and K. Hornik. 1989. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Netw. (1989).
[4]
Alan J. Bray and David S. Dean. 2007. Statistics of critical points of Gaussian fields on large-dimensional spaces. Physical Review Letters 98 (2007), 150201. https://hal.archives-ouvertes.fr/hal-00124320 5 pages.
[5]
Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. 2015. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015.
[6]
Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. Identifying and Attacking the Saddle Point Problem in High-dimensional Non-convex Optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS'14).
[7]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (July 2011),2121--2159.
[8]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. 2015. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015. 797--842.
[9]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATSâĂŹ10). Society for Artificial Intelligence and Statistics.
[10]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT Press.
[11]
Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. 2014. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544 (2014).
[12]
Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. 1225--1234.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In: arXiv:abs/1502.01852 {cs.CV. (2015).
[14]
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. 2017. How to Escape Saddle Points Efficiently. Arxiv. https://arxiv.org/pdf/1703.00887
[15]
Kenji Kawaguchi. 2016. Deep Learning without Poor Local Minima. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 586--594.
[16]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
[17]
Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. Gradient Descent Only Converges to Minimizers. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016. 1246--1257.
[18]
Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. 2014. On the saddle point problem for non-convex optimization. arXiv preprint arXiv:1405.4604 (2014).
[19]
Andrew M Saxe, James L McClelland, and Surya Ganguli. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 (2013).
[20]
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. 2014. Striving for Simplicity: The All Convolutional Net. CoRR abs/1412.6806 (2014). http://arxiv.org/abs/1412.6806
[21]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML'13).
[22]
S. Watanabe. 2007. Almost All Learning Machines are Singular. In 2007 IEEE Symposium on Foundations of Computational Intelligence. 383--388.
[23]
Eugene P. Wigner. 1958. On the Distribution of the Roots of Certain Symmetric Matrices. The Annals of Mathematics 67 (1958).
[24]
Stephen Wright and Jorge Nocedal. 1999. Numerical optimization. Springer Science 35 (1999), 67--68.
[25]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In BMVC.

Cited By

View all
  • (2024)Combined methods for solving degenerate unconstrained optimization problemsUkrains’kyi Matematychnyi Zhurnal10.3842/umzh.v76i5.739576:5(695-718)Online publication date: 2-Jun-2024
  • (2024)Exploring nonlinear correlations among transition metal nanocluster properties using deep learning: a comparative analysis with LOO-CV method and cosine similarityNanotechnology10.1088/1361-6528/ad892c36:4(045701)Online publication date: 4-Nov-2024
  • (2024)Combined Methods for Solving Degenerate Unconstrained Optimization ProblemsUkrainian Mathematical Journal10.1007/s11253-024-02353-476:5(777-804)Online publication date: 26-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data
January 2018
379 pages
ISBN:9781450363419
DOI:10.1145/3152494
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. neural networks
  3. saddle points

Qualifiers

  • Research-article

Funding Sources

  • Intel Technology India Pvt Ltd
  • Ministry of Human Resource Development, Govt of India

Conference

CoDS-COMAD '18

Acceptance Rates

CODS-COMAD '18 Paper Acceptance Rate 50 of 150 submissions, 33%;
Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Combined methods for solving degenerate unconstrained optimization problemsUkrains’kyi Matematychnyi Zhurnal10.3842/umzh.v76i5.739576:5(695-718)Online publication date: 2-Jun-2024
  • (2024)Exploring nonlinear correlations among transition metal nanocluster properties using deep learning: a comparative analysis with LOO-CV method and cosine similarityNanotechnology10.1088/1361-6528/ad892c36:4(045701)Online publication date: 4-Nov-2024
  • (2024)Combined Methods for Solving Degenerate Unconstrained Optimization ProblemsUkrainian Mathematical Journal10.1007/s11253-024-02353-476:5(777-804)Online publication date: 26-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media