ABSTRACT
Many learning algorithms rely on the curvature (in particular, strong convexity) of regularized objective functions to provide good theoretical performance guarantees. In practice, the choice of regularization penalty that gives the best testing set performance may result in objective functions with little or even no curvature. In these cases, algorithms designed specifically for regularized objectives often either fail completely or require some modification that involves a substantial compromise in performance.
We present new online and batch algorithms for training a variety of supervised learning models (such as SVMs, logistic regression, structured prediction models, and CRFs) under conditions where the optimal choice of regularization parameter results in functions with low curvature. We employ a technique called proximal regularization, in which we solve the original learning problem via a sequence of modified optimization tasks whose objectives are chosen to have greater curvature than the original problem. Theoretically, our algorithms achieve low regret bounds in the online setting and fast convergence in the batch setting. Experimentally, our algorithms improve upon state-of-the-art techniques, including Pegasos and bundle methods, on medium and large-scale SVM and structured learning tasks.
- Abernethy, J., Bartlett, P. L., Rakhlin, A., & Tewari, A. (2008). Optimal strategies and minimax lower bounds for online convex games. Proceedings of the 21st Annual Conference on Computational Learning Theory.Google Scholar
- Bartlett, P., Hazan, E., & Rakhlin, A. (2008). Adaptive online gradient descent. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, 65--72. MIT Press.Google Scholar
- Chapelle, O., Le, Q. V., & Smola, A. J. (2007). Large margin optimization of ranking measures. NIPS Workshop: Machine Learning for Web Search.Google Scholar
- Do, C. B., Woods, D. A., & Batzoglou, S. (2006). CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90--e98. Google ScholarDigital Library
- Hazan, E., Agarwal, A., & Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Mach Learn, 69, 169--192. Google ScholarDigital Library
- Joachims, T. (2006). Training linear SVMs in linear time. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 217--226). Google ScholarDigital Library
- Kiwiel, K. C. (1983). Proximity control in bundle methods for convex nondifferentiable minimization. Math Program, 27, 320--341.Google ScholarDigital Library
- Lemaréchal, C., Nemirovskii, A., & Nesterov, Y. (1995). New variants of bundle methods. Math Program, 69, 111--147. Google ScholarDigital Library
- Schramm, H., & Zowe, J. (1992). A version of the bundle idea for minimizing a nonsmooth function: conceptual idea, convergence analysis, numerical results. SIAM J Optim, 2, 121--152.Google ScholarDigital Library
- Shalev-Shwartz, S., & Kakade, S. M. (2009). Mind the duality gap: Logarithmic regret algorithms for online optimization. In D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, 1457--1464. MIT Press.Google Scholar
- Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Proceedings of the 24th Annual International Conference on Machine Learning (pp. 807--814). Google ScholarDigital Library
- Smola, A., Vishwanathan, S. V. N., & Le, Q. (2008). Bundle methods for machine learning. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, 1377--1384. MIT Press.Google Scholar
- Teo, C. H., Smola, A., Vishwanathan, S. V., & Le, Q. V. (2007). A scalable modular convex solver for regularized risk minimization. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 727--736). Google ScholarDigital Library
- Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the 20th Annual International Conference on Machine Learning.Google Scholar
Index Terms
- Proximal regularization for online and batch learning
Recommendations
Linearized symmetric multi-block ADMM with indefinite proximal regularization and optimal proximal parameter
AbstractThe proximal term plays a significant role in the literature of proximal Alternating Direction Method of Multipliers (ADMM), since (positive-definite or indefinite) proximal terms can promote convergence of ADMM and further simplify the involved ...
Damping proximal coordinate descent algorithm for non-convex regularization
Non-convex regularization has attracted much attention in the fields of machine learning, since it is unbiased and improves the performance on many applications compared with the convex counterparts. The optimization is important but difficult for non-...
Inexact Alternating Direction Methods of Multipliers with Logarithmic---Quadratic Proximal Regularization
In the literature, it was shown recently that the Douglas---Rachford alternating direction method of multipliers can be combined with the logarithmic-quadratic proximal regularization for solving a class of variational inequalities with separable ...
Comments