ABSTRACT
We show that several important Bayesian bounds studied in machine learning, both in the batch as well as the online setting, arise by an application of a simple compression lemma. In particular, we derive (i) PAC-Bayesian bounds in the batch setting, (ii) Bayesian log-loss bounds and (iii) Bayesian bounded-loss bounds in the online setting using the compression lemma. Although every setting has different semantics for prior, posterior and loss, we show that the core bound argument is the same. The paper simplifies our understanding of several important and apparently disparate results, as well as brings to light a powerful tool for developing similar arguments for other methods.
- Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated gradient algorithms for large-margin structured classification. Proceedings of the 18th Annual Conference on Neural Information Processing Systems.]]Google Scholar
- Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44, 427--485.]] Google ScholarDigital Library
- Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley-Interscience.]] Google ScholarDigital Library
- Freund, Y., & Schapire, R. (1999a). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29, 79--103.]]Google ScholarCross Ref
- Freund, Y., & Schapire, R. (1999b). Large margin classification using the perceptron algorithm. Machine Learning Journal, 37, 277--296.]] Google ScholarDigital Library
- Freund, Y., Schapire, R., Singer, Y., & Warmuth, M. (1997). Using and combining predictors that specialize. Proceedings of the 29th Annual ACM Symposium on the Theory of Computing (pp. 334--343).]] Google ScholarDigital Library
- Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119--139.]] Google ScholarDigital Library
- Grünwald, P. D., & Dawid, A. (2004). Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Statistics, 32.]]Google ScholarCross Ref
- Haussler, D. (1997). A general minimax result for relative entropy. IEEE Transactions of Information Theory, 43, 1276--1280.]]Google ScholarDigital Library
- Helmbold, D., & Warmuth, M. (1995). On weak learning. Journal of Computer and System Sciences, 50, 551--573.]] Google ScholarDigital Library
- Jaakkola, T., Meila, M., & Jebara, T. (1998). Maximum entropy discrimination. Proceedings of the 12th Annual Conference on Neural Information Processing Systems.]]Google Scholar
- Kakade, S. M., & Ng, A. (2004). Online bounds for Bayesian algorithms. Proceedings of the 18th Annual Conference on Neural Information Processing Systems.]]Google Scholar
- Kakade, S. M., Seeger, M., & Foster, D. (2005). Worst-case bounds for Gaussian process models. Proceedings of the 19th Annual Conference on Neural Information Processing Systems.]]Google Scholar
- Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1--64.]] Google ScholarDigital Library
- Kivinen, J., & Warmuth, M. K. (1999). Boosting as entropy projection. Proceedings of the 12th Annual Conference on Learning Theory (pp. 134--144).]] Google ScholarDigital Library
- Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, 273--306.]] Google ScholarDigital Library
- Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes and margins. Proceedings of the 16th Annual Conference on Neural Information Processing Systems.]]Google Scholar
- Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108, 212--261.]] Google ScholarDigital Library
- Long, P., & Wu, X. (2004). Mistake bounds for maximum entropy discrimination. Proceedings of the 18th Annual Conference on Neural Information Processing Systems.]]Google Scholar
- McAllester, D. (2003a). PAC-Bayesian model averaging. Machine Learning Journal, 5, 5--21.]] Google ScholarDigital Library
- McAllester, D. (2003b). Simplified PAC-Bayesian margin bounds. Proceedings of the 16th Annual Conference on Learning Theory (pp. 203--215).]]Google ScholarCross Ref
- Rockafellar, R. T. (1970). Convex Analysis. Princeton Landmarks in Mathematics. Princeton University Press.]]Google Scholar
- Seeger, M. (2002). PAC-Bayesian generalization bounds for Gaussian processes. Journal of Machine Learning Research, 3, 233--269.]] Google ScholarDigital Library
- Topsoe, F. (1979). Information theoretical optimization techniques. Kybernetika, 15, 8--27.]]Google Scholar
- Vovk, V. G. (1995). A game of prediction with expert advice. Proceedings of the 8th Annual Conference on Computational Learning Theory (pp. 51--60). ACM Press, New York, NY.]] Google ScholarDigital Library
- Williams, D. (1991). Probability with Martingales. Cambridge University Press.]]Google ScholarCross Ref
Index Terms
- On Bayesian bounds
Recommendations
A Fresh Look at the Bayesian Bounds of the Weiss-Weinstein Family
Minimal bounds on the mean square error (MSE) are generally used in order to predict the best achievable performance of an estimator for a given observation model. In this paper, we are interested in the Bayesian bound of the Weiss-Weinstein family. ...
Data-dependent bounds for Bayesian mixture methods
NIPS'02: Proceedings of the 15th International Conference on Neural Information Processing SystemsWe consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently ...
Tightening bounds for Bayesian network structure learning
AAAI'14: Proceedings of the Twenty-Eighth AAAI Conference on Artificial IntelligenceA recent breadth-first branch and bound algorithm (BFBnB) for learning Bayesian network structures (Malone et al. 2011) uses two bounds to prune the search space for better efficiency; one is a lower bound calculated from pattern database heuristics, ...
Comments