Abstract
The technique known as “weight decay” in the literature about learning from data is investigated using tools from regularization theory. Weight-decay regularization is compared with Tikhonov’s regularization of the learning problem and with a mixed regularized learning technique. The accuracies of suboptimal solutions to weight-decay learning are estimated for connectionistic models with a-priori fixed numbers of computational units.
Similar content being viewed by others
References
Aarts E, Korst J (1989) Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing. Wiley,
Aronszajn N (1950) Theory of reproducing kernels. Trans AMS 68: 337–404
Bartlett PL (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44(2): 525–536
Berg C, Christensen JPR, Ressel P (1984) Harmonic analysis on semigroups. Springer, New York
Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75: 1–120
Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, London
Bishop C (2006) Pattern recognition and machine learning. Springer, Heidelberg
Burger M, Engl H (2000) Training neural networks with noisy data as an ill-posed problem. Adv Comput Math 13: 335–354
Burger M, Neubauer A (2002) Analysis of Tikhonov regularization for function approximation by neural networks. Neural Netw 16: 79–90
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, London
Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull AMS 39: 1–49
Cucker F, Smale S (2002) Best choices for regularization parameters in learning theory: on the bias-variance problem. Found Comput Math 2: 413–428
Cuesta-Albertos JA, Wschebor M (2003) Some remarks on the condition number of a real random square matrix. J Compl 19: 548–554
Demmel J (1987) The geometry of ill-conditioning. J Compl 3: 201–229
Dontchev AL (1983) Perturbations, approximations and sensitivity analysis of optimal control systems. Lecture Notes in Control and Information Sciences, vol 52. Springer, Berlin
Friedman A (1970) Foundations of Modern Analysis. Holt, Rinehart, and Winston, New York
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7: 219–269
Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10: 1455–1480
Girosi F (1994) Regularization theory, radial basis functions and networks. In: Cherkassky JHFV, Wechsler H(eds) From Statistics to Neural Networks. Theory and pattern recognition applications, ser. NATO ASI Series F, Computer and Systems Sciences. Springer, Berlin, pp 166–187
Gnecco G, Sanguineti M (2007) Accuracy of suboptimal solutions to kernel principal component analysis. Comput Optim Appl. doi:10.1007/s10589-007-9108-y
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading
Golub GH, Loan CFV (1996) Matrix computations. John Hopkins University Press, London
Gupta A, Lam M (1998) The weight decay backpropagation for generalizations with missing values. Ann Oper Res 78: 165–187
Gupta A, Lam M (1998) Weight decay backpropagation for noisy data. Neural Netw 11: 1127–1138
Hofinger A (2006) Nonlinear function approximation: computing smooth solutions with an adaptive greedy algorithm. J Approxim Theory 143: 159–175
Hofinger A, Pillichshammer F (2005) Learning a function from noisy samples at a finite sparse set of points. J. Kepler University, Linz, Technical Report, SFB F013
Kimeldorf GS, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41: 495–502
Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Advances in neural information processing systems, vol. 4. Morgan Kaufmann Pub., pp 950–957
Kůrková V (1997) Dimension-independent rates of approximation by neural networks. In: Warwick K, Kárný M(eds) Computer-intensive methods in control and signal processing. The curse of dimensionality.. Birkhäuser, Boston, pp 261–270
Kůrková V (2004) Learning from data as an inverse problem. In: Antoch J(eds) COMPSTAT 2004—proceedings in computational statistics. Physica-Verlag/Springer, Heidelberg, pp 1377–1384
Kůrková V, Sanguineti M (2001) Bounds on rates of variable-basis and neural-network approximation. IEEE Trans Inf Theory 47: 2659–2665
Kůrková V, Sanguineti M (2005) Error estimates for approximate optimization by the extended Ritz method. SIAM J Optim 15: 461–487
Kůrková V, Sanguineti M (2005) Learning with generalization capability by kernel methods of bounded complexity. J Compl 21: 350–367
Kůrková V, Savický P, Hlaváčková K (1998) Representations and rates of approximation of real-valued Boolean functions by neural networks. Neural Netw 11: 651–659
Levitin ES, Polyak BT (1966) Convergence of minimizing sequences in conditional extremum problems. Dokl Akad Nauk SSSR 168(5): 764–767
Ortega JM (1990) Numerical analysis: a second course. SIAM, Philadelphia
Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78: 1481–1497
Poggio T, Smale S (2003) The mathematics of learning: dealing with data. Notices AMS 50: 536–544
Poggio T, Mukherjee S, Rifkin R, Rakhlin A, Verri A (2002) “b”. In: Winkler J, Niranjan M (eds) Uncertainty in Geometric Computations. Kluwer, Dordrecht, pp 131–141
Schölkopf B, Smola AJ (2002) Learning with kernels—support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Schölkopf B, Herbrich R, Smola AJ, Williamson RC (2001) A generalized representer theorem. In: Proceedings of COLT’01, Lecture Notes in Artificial Intelligence. Springer, Heidelberg, pp 416–424
Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. W.H. Winston, Washington
Treadgold NK, Gedeon TD (1998) Simulated annealing and weight decay in adaptive learning: the SARPROP algorithm. IEEE Trans Neural Netw 9(4): 662–668
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Vladimirov AA, Nesterov YE, Chekanov YN (1978) On uniformly convex functionals. Vestnik Moskovskogo Universiteta. Seriya 15—Vychislitel’naya Matematika i Kibernetika, vol 3, pp 12–23 (English translation: Moscow University Computational Mathematics and Cybernetics, pp 10–21, 1979)
Author information
Authors and Affiliations
Corresponding author
Additional information
The Authors were partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Models and Algorithms for Robust Network Optimization”.
Rights and permissions
About this article
Cite this article
Gnecco, G., Sanguineti, M. The weight-decay technique in learning from data: an optimization point of view. Comput Manag Sci 6, 53–79 (2009). https://doi.org/10.1007/s10287-008-0072-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10287-008-0072-5