skip to main content
10.1145/1553374.1553415acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

A majorization-minimization algorithm for (multiple) hyperparameter learning

Published:14 June 2009Publication History

ABSTRACT

We present a general Bayesian framework for hyperparameter tuning in L2-regularized supervised learning models. Paradoxically, our algorithm works by first analytically integrating out the hyperparameters from the model. We find a local optimum of the resulting non-convex optimization problem efficiently using a majorization-minimization (MM) algorithm, in which the non-convex problem is reduced to a series of convex L2-regularized parameter estimation tasks. The principal appeal of our method is its simplicity: the updates for choosing the L2-regularized subproblems in each step are trivial to implement (or even perform by hand), and each subproblem can be efficiently solved by adapting existing solvers. Empirical results on a variety of supervised learning models show that our algorithm is competitive with both grid-search and gradient-based algorithms, but is more efficient and far easier to implement.

References

  1. Andersen, L. N., Larsen, J., Hansen, L. K., & Hintz-Madsen, M. (1997). Adaptive regularization of neural classifiers. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing VII, Amelia Island, FL, USA (pp. 24--33).Google ScholarGoogle ScholarCross RefCross Ref
  2. Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603--643.Google ScholarGoogle Scholar
  3. Cawley, G. C., Talbot, N. L., & Girolami, M. (2007). Sparse multinomial logistic regression via Bayesian L1 regularisation. In B. Schöölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, 209--216. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  4. Cawley, G. C., & Talbot, N. L. C. (2006). Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics, 22, 2348--2355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Delaney, A. H., & Bresler, Y. (1998). Globally convergent edge-preserving regularized reconstruction: an application to limited-angle tomography. IEEE Transactions on Image Processing, 7, 204--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Do, C. B., Foo, C.-S., & Ng, A. Y. (2008). Efficient multiple hyperparameter learning for log-linear models. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, 377--384. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  8. Do, C. B., Woods, D. A., & Batzoglou, S. (2006). CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90--e98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fazel, M., Hindi, H., & Boyd, S. P. (2003). Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. Proceedings of the 2003 American Control Conference (pp. 2156--2162 vol. 3).Google ScholarGoogle ScholarCross RefCross Ref
  10. Figueiredo, M. A. T. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150--1159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Glasmachers, T., & Igel, C. (2005). Gradient-based adaptation of general Gaussian kernels. Neural Computation, 17, 2099--2105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Goutte, C., & Larsen, J. (1998). Adaptive regularization of neural networks using conjugate gradient. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, USA (pp. 1201--1204 vol. 2).Google ScholarGoogle ScholarCross RefCross Ref
  13. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., & Eddy., S. R. (2003). Rfam: an RNA family database. Nucleic Acids Research, 31, 439--441.Google ScholarGoogle ScholarCross RefCross Ref
  14. Keerthi, S. S., Sindhwani, V., & Chapelle, O. (2007). An efficient method for gradient-based adaptation of hyperparameters in SVM models. In B. Schöölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, 673--680. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  15. Lange, K., Hunter, D. R., & Y., I. (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9, 1--59.Google ScholarGoogle Scholar
  16. Larsen, J., Hansen, L. K., Svarer, C., & Ohlsson, M. (1996a). Design and regularization of neural networks: the optimal use of a validation set. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing VI, Kyoto, Japan (pp. 62--71).Google ScholarGoogle ScholarCross RefCross Ref
  17. Larsen, J., Svarer, C., Andersen, L. N., & Hansen, L. K. (1996b). Adaptive regularization in neural network modeling. Neural Networks: Tricks of the Trade (pp. 113--132). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4, 415--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tipping, M., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA.Google ScholarGoogle Scholar
  21. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Williams, P. M. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7, 117--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yuille, A. L., & Rangarajan, A. (2003). The concave-convex procedure. Neural Computation, 15, 915--936. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A majorization-minimization algorithm for (multiple) hyperparameter learning

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in
                      • Published in

                        cover image ACM Other conferences
                        ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
                        June 2009
                        1331 pages
                        ISBN:9781605585161
                        DOI:10.1145/1553374

                        Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 14 June 2009

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • research-article

                        Acceptance Rates

                        Overall Acceptance Rate140of548submissions,26%

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader