Skip to main content

Metric entropy and minimax risk in classification

  • Pattern Matching and Learning
  • Chapter
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1261))

Abstract

We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy two-valued classification problems in terms of the Assouad density and the Vapnik-Chervonenkis dimension.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.

    Google Scholar 

  2. P. Assouad. Densité et dimension. Annales de l'Institut Fourier, 33(3):233–282, 1983.

    Google Scholar 

  3. A. Barron. In T. M. Cover and B. Gopinath, editors, Open Problems in Communication and Computation, chapter 3.20. Are Bayes rules consistent in information?, pages 85–91. 1987.

    Google Scholar 

  4. A. Barron. The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions. Technical Report 7, Dept. of Statistics, U. III. Urbana-Champaign, 1987.

    Google Scholar 

  5. A. Barron, B. Clarke, and D. Haussler. Information bounds for the risk of Bayesian predictions and the redundancy of universal codes. In Proc. International Symposium on Information Theory.

    Google Scholar 

  6. A. Barron and Y. Yang. Information theoretic lower bounds on convergence rates of nonparametric estimators, 1995. unpublished manuscript.

    Google Scholar 

  7. L. Birgé. Approximation dans les espaces métriques et théorie de l'estimation. Zeitschrift fuer Wahrscheinlichkeitstheorie und verwandte gebiete, 65:181–237, 1983.

    Google Scholar 

  8. L. Birgé. On estimating a density using Hellinger distance and some other strange facts. Probability theory and related fields, 71:271–291, 1986.

    Google Scholar 

  9. L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators. Probability Theory and Related Fields, 97:113–150, 1993.

    Google Scholar 

  10. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor. Information Processing Letters, 24:377–380, 1987.

    Google Scholar 

  11. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929–965, 1989.

    Google Scholar 

  12. B. Clarke. Asymptotic cumulative risk and Bayes risk under entropy loss with applications. PhD thesis, Dept. of Statistics, University of Ill., 1989.

    Google Scholar 

  13. B. Clarke and A. Barron. Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36(3):453–471, 1990.

    Google Scholar 

  14. B. Clarke and A. Barron. Jefferys' prior is asymptotically least favorable under entropy risk. J. Statistical Planning and Inference, 41:37–60, 1994.

    Google Scholar 

  15. G. F. Clements. Entropy of several sets of real-valued functions. Pacific J. Math., 13:1085–1095, 1963.

    Google Scholar 

  16. T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.

    Google Scholar 

  17. L. Devroye and L. Györfi. Nonparametric density estimation, the L 1 view. Wiley, 1985.

    Google Scholar 

  18. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.

    Google Scholar 

  19. R. M. Dudley. A course on empirical processes. Lecture Notes in Mathematics, 1097:2–142, 1984.

    Google Scholar 

  20. S. Y. Efroimovich. Information contained in a sequence of observations. Problems in Information Transmission, 15:178–189, 1980.

    Google Scholar 

  21. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82:247–261, 1989.

    Google Scholar 

  22. M. Feder, Y. Freund, and Y. Mansour. Optimal universal learning and prediction of probabilistic concepts. In Proc. of IEEE Information Theory Conference, page 233. IEEE, 1995.

    Google Scholar 

  23. A. Gelman. Bayesian Data Analysis. Chapman and Hall, NY, 1995.

    Google Scholar 

  24. E. Giné and J. Zinn. Some limit theorems for empirical processes. Annals of Probability, 12:929–989, 1984.

    Google Scholar 

  25. R. Hasminskii and I. Ibragimov. On density estimation in the view of Kolmogorov's ideas in approximation theory. Annals of statistics, 18:999–1010, 1990.

    Google Scholar 

  26. D. Haussler and A. Barron. How well do Bayes methods work for on-line prediction of +1,−1 values? In Proceedings of the Third NEC Symposium on Computation and Cognition. SIAM, 1992.

    Google Scholar 

  27. D. Haussler and M. Opper. General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory, 1995.

    Google Scholar 

  28. D. Haussler and M. Opper. Mutual information, metric entropy, and risk in estimation of probability distributions. Technical Report UCSC-CRL-96-27, Univ. of Calif. Computer Research Lab, Santa Cruz, CA, 1996.

    Google Scholar 

  29. I. Ibragimov and R. Hasminskii. On the information in a sample about a parameter. In Second Int. Symp. on Information Theory, pages 295–309, 1972.

    Google Scholar 

  30. A. J. Izenman. Recent developments in nonparametric density estimation. JASA, 86(413):205–224, 1991.

    Google Scholar 

  31. A. N. Kolmogorov and V. M. Tihomirov. ∈-entropy and ∈-capacity of sets in functional spaces. Amer. Math. Soc. Translations (Ser. 2), 17:277–364, 1961.

    Google Scholar 

  32. L. LeCam. Asymptotic methods in statistical decision theory. Springer, 1986.

    Google Scholar 

  33. G. Lorentz. Approxiamtion of Functions. Holt, Rinehart, Winston, 1966.

    Google Scholar 

  34. R. Meir and N. Merhav. On the stochastic complexity of learning realizable and unrealizable rules. Unpublished manuscript, 1994.

    Google Scholar 

  35. M. Opper and D. Haussler. Bounds for predictive errors in the statistical mechanics of in supervised learning. Physical Review Letters, 75(20):3772–3775, 1995.

    Google Scholar 

  36. D. Pollard. Empirical Processes: Theory and Applications, volume 2 of NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Math. Stat. and Am. Stat. Assoc., 1990.

    Google Scholar 

  37. J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100, 1986.

    Google Scholar 

  38. J. Rissanen, T. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE Trans. Info. Th., 38:315–323, 1992.

    Google Scholar 

  39. N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (Series A), 13:145–147, 1972.

    Google Scholar 

  40. L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–42, 1984.

    Google Scholar 

  41. S. van deGeer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Annals of Statistics, 21:14–44, 1993.

    Google Scholar 

  42. A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, NY, 1996.

    Google Scholar 

  43. V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.

    Google Scholar 

  44. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–80, 1971.

    Google Scholar 

  45. W. Wong and X. Shen. Probability inequalities for likelihood ratios and convergence rates for sieve MLE's. Annals of Statistics, 23(2):339–362, 1995.

    Google Scholar 

  46. B. Yu. Lower bounds on expected redundancy for nonparametric classes. IEEE Trans. Info. Th., 42(1), 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Jan Mycielski Grzegorz Rozenberg Arto Salomaa

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Haussler, D., Opper, M. (1997). Metric entropy and minimax risk in classification. In: Mycielski, J., Rozenberg, G., Salomaa, A. (eds) Structures in Logic and Computer Science. Lecture Notes in Computer Science, vol 1261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63246-8_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-63246-8_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63246-7

  • Online ISBN: 978-3-540-69242-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics