The weight-decay technique in learning from data: an optimization point of view

Gnecco, Giorgio; Sanguineti, Marcello

doi:10.1007/s10287-008-0072-5

The weight-decay technique in learning from data: an optimization point of view

Original Paper
Published: 21 March 2008

Volume 6, pages 53–79, (2009)
Cite this article

Computational Management Science Aims and scope Submit manuscript

Giorgio Gnecco^1,2 &
Marcello Sanguineti¹

198 Accesses
Explore all metrics

Abstract

The technique known as “weight decay” in the literature about learning from data is investigated using tools from regularization theory. Weight-decay regularization is compared with Tikhonov’s regularization of the learning problem and with a mixed regularized learning technique. The accuracies of suboptimal solutions to weight-decay learning are estimated for connectionistic models with a-priori fixed numbers of computational units.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Regularization: From Inverse Problems to Large-Scale Machine Learning

Regularization

Optimal Selection of the Regularization Function in a Weighted Total Variation Model. Part II: Algorithm, Its Analysis and Numerical Tests

Article 09 June 2017

References

Aarts E, Korst J (1989) Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing. Wiley,
Aronszajn N (1950) Theory of reproducing kernels. Trans AMS 68: 337–404
Article Google Scholar
Bartlett PL (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44(2): 525–536
Article Google Scholar
Berg C, Christensen JPR, Ressel P (1984) Harmonic analysis on semigroups. Springer, New York
Google Scholar
Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75: 1–120
Google Scholar
Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont
Google Scholar
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, London
Google Scholar
Bishop C (2006) Pattern recognition and machine learning. Springer, Heidelberg
Google Scholar
Burger M, Engl H (2000) Training neural networks with noisy data as an ill-posed problem. Adv Comput Math 13: 335–354
Article Google Scholar
Burger M, Neubauer A (2002) Analysis of Tikhonov regularization for function approximation by neural networks. Neural Netw 16: 79–90
Article Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, London
Google Scholar
Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull AMS 39: 1–49
Article Google Scholar
Cucker F, Smale S (2002) Best choices for regularization parameters in learning theory: on the bias-variance problem. Found Comput Math 2: 413–428
Article Google Scholar
Cuesta-Albertos JA, Wschebor M (2003) Some remarks on the condition number of a real random square matrix. J Compl 19: 548–554
Article Google Scholar
Demmel J (1987) The geometry of ill-conditioning. J Compl 3: 201–229
Article Google Scholar
Dontchev AL (1983) Perturbations, approximations and sensitivity analysis of optimal control systems. Lecture Notes in Control and Information Sciences, vol 52. Springer, Berlin
Friedman A (1970) Foundations of Modern Analysis. Holt, Rinehart, and Winston, New York
Google Scholar
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7: 219–269
Article Google Scholar
Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10: 1455–1480
Article Google Scholar
Girosi F (1994) Regularization theory, radial basis functions and networks. In: Cherkassky JHFV, Wechsler H(eds) From Statistics to Neural Networks. Theory and pattern recognition applications, ser. NATO ASI Series F, Computer and Systems Sciences. Springer, Berlin, pp 166–187
Google Scholar
Gnecco G, Sanguineti M (2007) Accuracy of suboptimal solutions to kernel principal component analysis. Comput Optim Appl. doi:10.1007/s10589-007-9108-y
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading
Google Scholar
Golub GH, Loan CFV (1996) Matrix computations. John Hopkins University Press, London
Google Scholar
Gupta A, Lam M (1998) The weight decay backpropagation for generalizations with missing values. Ann Oper Res 78: 165–187
Article Google Scholar
Gupta A, Lam M (1998) Weight decay backpropagation for noisy data. Neural Netw 11: 1127–1138
Article Google Scholar
Hofinger A (2006) Nonlinear function approximation: computing smooth solutions with an adaptive greedy algorithm. J Approxim Theory 143: 159–175
Article Google Scholar
Hofinger A, Pillichshammer F (2005) Learning a function from noisy samples at a finite sparse set of points. J. Kepler University, Linz, Technical Report, SFB F013
Kimeldorf GS, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41: 495–502
Article Google Scholar
Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Advances in neural information processing systems, vol. 4. Morgan Kaufmann Pub., pp 950–957
Kůrková V (1997) Dimension-independent rates of approximation by neural networks. In: Warwick K, Kárný M(eds) Computer-intensive methods in control and signal processing. The curse of dimensionality.. Birkhäuser, Boston, pp 261–270
Google Scholar
Kůrková V (2004) Learning from data as an inverse problem. In: Antoch J(eds) COMPSTAT 2004—proceedings in computational statistics. Physica-Verlag/Springer, Heidelberg, pp 1377–1384
Google Scholar
Kůrková V, Sanguineti M (2001) Bounds on rates of variable-basis and neural-network approximation. IEEE Trans Inf Theory 47: 2659–2665
Article Google Scholar
Kůrková V, Sanguineti M (2005) Error estimates for approximate optimization by the extended Ritz method. SIAM J Optim 15: 461–487
Article Google Scholar
Kůrková V, Sanguineti M (2005) Learning with generalization capability by kernel methods of bounded complexity. J Compl 21: 350–367
Article Google Scholar
Kůrková V, Savický P, Hlaváčková K (1998) Representations and rates of approximation of real-valued Boolean functions by neural networks. Neural Netw 11: 651–659
Article Google Scholar
Levitin ES, Polyak BT (1966) Convergence of minimizing sequences in conditional extremum problems. Dokl Akad Nauk SSSR 168(5): 764–767
Google Scholar
Ortega JM (1990) Numerical analysis: a second course. SIAM, Philadelphia
Google Scholar
Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78: 1481–1497
Article Google Scholar
Poggio T, Smale S (2003) The mathematics of learning: dealing with data. Notices AMS 50: 536–544
Google Scholar
Poggio T, Mukherjee S, Rifkin R, Rakhlin A, Verri A (2002) “b”. In: Winkler J, Niranjan M (eds) Uncertainty in Geometric Computations. Kluwer, Dordrecht, pp 131–141
Schölkopf B, Smola AJ (2002) Learning with kernels—support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Schölkopf B, Herbrich R, Smola AJ, Williamson RC (2001) A generalized representer theorem. In: Proceedings of COLT’01, Lecture Notes in Artificial Intelligence. Springer, Heidelberg, pp 416–424
Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. W.H. Winston, Washington
Google Scholar
Treadgold NK, Gedeon TD (1998) Simulated annealing and weight decay in adaptive learning: the SARPROP algorithm. IEEE Trans Neural Netw 9(4): 662–668
Article Google Scholar
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Google Scholar
Vladimirov AA, Nesterov YE, Chekanov YN (1978) On uniformly convex functionals. Vestnik Moskovskogo Universiteta. Seriya 15—Vychislitel’naya Matematika i Kibernetika, vol 3, pp 12–23 (English translation: Moscow University Computational Mathematics and Cybernetics, pp 10–21, 1979)

Download references

Author information

Authors and Affiliations

Department of Communications, Computer, and System Sciences (DIST), University of Genova, Via Opera Pia 13, 16145, Genova, Italy
Giorgio Gnecco & Marcello Sanguineti
Department of Mathematics (DIMA), University of Genova, Via Dodecaneso, 35, 16146, Genova, Italy
Giorgio Gnecco

Authors

Giorgio Gnecco
View author publications
You can also search for this author in PubMed Google Scholar
Marcello Sanguineti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcello Sanguineti.

Additional information

The Authors were partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Models and Algorithms for Robust Network Optimization”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gnecco, G., Sanguineti, M. The weight-decay technique in learning from data: an optimization point of view. Comput Manag Sci 6, 53–79 (2009). https://doi.org/10.1007/s10287-008-0072-5

Download citation

Published: 21 March 2008
Issue Date: February 2009
DOI: https://doi.org/10.1007/s10287-008-0072-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The weight-decay technique in learning from data: an optimization point of view

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Regularization: From Inverse Problems to Large-Scale Machine Learning

Regularization

Optimal Selection of the Regularization Function in a Weighted Total Variation Model. Part II: Algorithm, Its Analysis and Numerical Tests

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

The weight-decay technique in learning from data: an optimization point of view

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Regularization: From Inverse Problems to Large-Scale Machine Learning

Regularization

Optimal Selection of the Regularization Function in a Weighted Total Variation Model. Part II: Algorithm, Its Analysis and Numerical Tests

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation