A Simple Trick for Estimating the Weight Decay Parameter

Rögnvaldsson, Thorsteinn S.

doi:10.1007/978-3-642-35289-8_6

Thorsteinn S. Rögnvaldsson¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7700))

65k Accesses
1 Citations

Abstract

We present a simple trick to get an approximate estimate of the weight decay parameter λ. The method combines early stopping and weight decay, into the estimate

\( \hat\lambda = \parallel \nabla E(W_{es})\parallel /\parallel 2W_{es}\parallel, \)

where W_es is the set of weights at the early stopping point, and E(W) is the training data fit error.

The estimate is demonstrated and compared to the standard cross-validation procedure for λ selection on one synthetic and four real life data sets. The result is that \(\hat\lambda\) is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute.

The results also show that weight decay can produce solutions that are significantly superior to committees of networks trained with early stopping.

Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abu-Mustafa, Y.S.: Hints. Neural Computation 7, 639–671 (1995)
Article Google Scholar
Bishop, C.M.: Curvature-driven smoothing: A learning algorithm for feedforward networks. IEEE Transactions on Neural Networks 4(5), 882–884 (1993)
Article Google Scholar
Brace, M.C., Schmidt, J., Hadlin, M.: Comparison of the forecast accuracy of neural networks with other established techniques. In: Proceedings of the First International Form on Application of Neural Networks to Power System, Seattle WA, pp. 31–35 (1991)
Google Scholar
Buntine, W.L., Weigend, A.S.: Bayesian back-propagation. Complex Systems 5, 603–643 (1991)
MATH Google Scholar
Cheeseman, P.: On Bayesian model selection. In: The Mathematics of Generalization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pp. 315–330. Addison-Wesley, Reading (1995)
Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 304–314 (1989)
Article MathSciNet MATH Google Scholar
Engle, R., Clive, F., Granger, W.J., Ramanathan, R., Vahid, F., Werner, M.: Construction of the puget sound forecasting model. EPRI Project # RP2919, Quantitative Economics Research Institute, San Diego, CA (1991)
Google Scholar
Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)
Article Google Scholar
Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, 219–269 (1995)
Article Google Scholar
Hansen, L.K., Rasmussen, C.E., Svarer, C., Larsen, J.: Adaptive regularization. In: Vlontzos, J., Hwang, J.-N., Wilson, E. (eds.) Proceedings of the IEEE Workshop on Neural Networks for Signal Processing IV, pp. 78–87. IEEE Press, Piscataway (1994)
Chapter Google Scholar
Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation of nonorthogonal problems. Technometrics 12, 55–67 (1970)
Article MATH Google Scholar
Ishikawa, M.: A structural learning algorithm with forgetting of link weights. Technical Report TR-90-7, Electrotechnical Laboratory, Information Science Division, 1-1-4 Umezono, Tsukuba, Ibaraki 305, Japan (1990)
Google Scholar
Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, 3rd edn. Hafner Publishing Co., New York (1972)
MATH Google Scholar
Moody, J.E., Rögnvaldsson, T.S.: Smoothing regularizers for projective basis function networks. In: Advances in Neural Information Processing Systems 9. MIT Press, Cambridge (1997)
Google Scholar
Nowlan, S., Hinton, G.: Simplifying neural networks by soft weight-sharing. Neural Computation 4, 473–493 (1992)
Article Google Scholar
Perrone, M.P., Cooper, L.C.: When networks disagree: Ensemble methods for hybrid neural networks. In: Artificial Neural Networks for Speech and Vision, pp. 126–142. Chapman and Hall, London (1993)
Google Scholar
Plaut, D., Nowlan, S., Hinton, G.: Experiments on learning by backpropagation. Technical Report CMU-CS-86-126, Carnegie Mellon University, Pittsburg, PA (1986)
Google Scholar
Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: Ruspini, H. (ed.) Proc. of the IEEE Intl. Conference on Neural Networks, San Fransisco, California, pp. 586–591 (1993)
Google Scholar
Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)
Book MATH Google Scholar
Sjöberg, J., Ljung, L.: Overtraining, regularization, and searching for minimum with application to neural nets. Int. J. Control 62(6), 1391–1407 (1995)
Article MATH Google Scholar
Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed problems. V. H. Winston & Sons, Washington D.C. (1977)
MATH Google Scholar
Tong, H.: Non-linear Time Series: A Dynamical System Approach. Clarendon Press, Oxford (1990)
MATH Google Scholar
Utans, J., Moody, J.E.: Selecting neural network architectures via the prediction risk: Application to corporate bond rating prediction. In: Proceedings of the First International Conference on Artificial Intelligence Applications on Wall Street. IEEE Computer Society Press, Los Alamitos (1991)
Google Scholar
Wahba, G., Gu, C., Wang, Y., Chappell, R.: Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In: The Mathematics of Generalization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pp. 331–359. Addison-Wesley, Reading (1995)
Google Scholar
Wahba, G., Wold, S.: A completely automatic french curve. Communications in Statistical Theory & Methods 4, 1–17 (1975)
MATH Google Scholar
Weigend, A., Rumelhart, D., Hubermann, B.: Back-propagation, weight-elimination and time series prediction. In: Sejnowski, T., Hinton, G., Touretzky, D. (eds.) Proc. of the Connectionist Models Summer School. Morgan Kaufmann Publishers, San Mateo (1990)
Google Scholar
Williams, P.M.: Bayesian regularization and pruning using a Laplace prior. Neural Computation 7, 117–143 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Computer Architecture (CCA), Halmstad University, P.O. Box 823, S-301 18, Halmstad, Sweden
Thorsteinn S. Rögnvaldsson

Authors

Thorsteinn S. Rögnvaldsson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science, Technische Universität Berlin, Franklinstr. 28/29, 10587, Berlin, Germany
Grégoire Montavon & Klaus-Robert Müller &
Dept. of computer Science, Willamette University, 900 State Street, 97301, Salem, OR, USA
Geneviève B. Orr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rögnvaldsson, T.S. (2012). A Simple Trick for Estimating the Weight Decay Parameter. In: Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-35289-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35288-1
Online ISBN: 978-3-642-35289-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics