ABSTRACT
We present a general Bayesian framework for hyperparameter tuning in L2-regularized supervised learning models. Paradoxically, our algorithm works by first analytically integrating out the hyperparameters from the model. We find a local optimum of the resulting non-convex optimization problem efficiently using a majorization-minimization (MM) algorithm, in which the non-convex problem is reduced to a series of convex L2-regularized parameter estimation tasks. The principal appeal of our method is its simplicity: the updates for choosing the L2-regularized subproblems in each step are trivial to implement (or even perform by hand), and each subproblem can be efficiently solved by adapting existing solvers. Empirical results on a variety of supervised learning models show that our algorithm is competitive with both grid-search and gradient-based algorithms, but is more efficient and far easier to implement.
- Andersen, L. N., Larsen, J., Hansen, L. K., & Hintz-Madsen, M. (1997). Adaptive regularization of neural classifiers. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing VII, Amelia Island, FL, USA (pp. 24--33).Google ScholarCross Ref
- Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603--643.Google Scholar
- Cawley, G. C., Talbot, N. L., & Girolami, M. (2007). Sparse multinomial logistic regression via Bayesian L1 regularisation. In B. Schöölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, 209--216. Cambridge, MA: MIT Press.Google Scholar
- Cawley, G. C., & Talbot, N. L. C. (2006). Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics, 22, 2348--2355. Google ScholarDigital Library
- Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131--159. Google ScholarDigital Library
- Delaney, A. H., & Bresler, Y. (1998). Globally convergent edge-preserving regularized reconstruction: an application to limited-angle tomography. IEEE Transactions on Image Processing, 7, 204--221. Google ScholarDigital Library
- Do, C. B., Foo, C.-S., & Ng, A. Y. (2008). Efficient multiple hyperparameter learning for log-linear models. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, 377--384. Cambridge, MA: MIT Press.Google Scholar
- Do, C. B., Woods, D. A., & Batzoglou, S. (2006). CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90--e98. Google ScholarDigital Library
- Fazel, M., Hindi, H., & Boyd, S. P. (2003). Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. Proceedings of the 2003 American Control Conference (pp. 2156--2162 vol. 3).Google ScholarCross Ref
- Figueiredo, M. A. T. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150--1159. Google ScholarDigital Library
- Glasmachers, T., & Igel, C. (2005). Gradient-based adaptation of general Gaussian kernels. Neural Computation, 17, 2099--2105. Google ScholarDigital Library
- Goutte, C., & Larsen, J. (1998). Adaptive regularization of neural networks using conjugate gradient. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, USA (pp. 1201--1204 vol. 2).Google ScholarCross Ref
- Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., & Eddy., S. R. (2003). Rfam: an RNA family database. Nucleic Acids Research, 31, 439--441.Google ScholarCross Ref
- Keerthi, S. S., Sindhwani, V., & Chapelle, O. (2007). An efficient method for gradient-based adaptation of hyperparameters in SVM models. In B. Schöölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, 673--680. Cambridge, MA: MIT Press.Google Scholar
- Lange, K., Hunter, D. R., & Y., I. (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9, 1--59.Google Scholar
- Larsen, J., Hansen, L. K., Svarer, C., & Ohlsson, M. (1996a). Design and regularization of neural networks: the optimal use of a validation set. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing VI, Kyoto, Japan (pp. 62--71).Google ScholarCross Ref
- Larsen, J., Svarer, C., Andersen, L. N., & Hansen, L. K. (1996b). Adaptive regularization in neural network modeling. Neural Networks: Tricks of the Trade (pp. 113--132). Google ScholarDigital Library
- MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4, 415--447. Google ScholarDigital Library
- Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Google ScholarDigital Library
- Tipping, M., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA.Google Scholar
- Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211--244. Google ScholarDigital Library
- Williams, P. M. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7, 117--143. Google ScholarDigital Library
- Yuille, A. L., & Rangarajan, A. (2003). The concave-convex procedure. Neural Computation, 15, 915--936. Google ScholarDigital Library
Index Terms
- A majorization-minimization algorithm for (multiple) hyperparameter learning
Recommendations
Majorization-minimization for blind source separation of sparse sources
In this paper we propose the Majorization-Minimization Blind Spare Source Separation (MM-BSSS) algorithm for solving the blind source separation (BSS) problem when the source signals are known a priori to be sparse, or can be sparsely represented in ...
A general efficient hyperparameter-free algorithm for convolutional sparse learning
AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial IntelligenceStructured sparse learning has become a popular and mature research field. Among all structured sparse models, we found an interesting fact that most structured sparse properties could be captured by convolution operators, most famous ones being total ...
An iterative algorithm for third-order tensor multi-rank minimization
Recent work by Kilmer et al. (A third-order generalization of the matrix SVD as a product of third-order tensors, Department of Computer Science, Tufts University, Medford, MA, 2008; Linear Algebra Appl 435(3):641---658, 2011; SIAM J Matrix Anal Appl 34(...
Comments