Abstract
We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy two-valued classification problems in terms of the Assouad density and the Vapnik-Chervonenkis dimension.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
P. Assouad. Densité et dimension. Annales de l'Institut Fourier, 33(3):233–282, 1983.
A. Barron. In T. M. Cover and B. Gopinath, editors, Open Problems in Communication and Computation, chapter 3.20. Are Bayes rules consistent in information?, pages 85–91. 1987.
A. Barron. The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions. Technical Report 7, Dept. of Statistics, U. III. Urbana-Champaign, 1987.
A. Barron, B. Clarke, and D. Haussler. Information bounds for the risk of Bayesian predictions and the redundancy of universal codes. In Proc. International Symposium on Information Theory.
A. Barron and Y. Yang. Information theoretic lower bounds on convergence rates of nonparametric estimators, 1995. unpublished manuscript.
L. Birgé. Approximation dans les espaces métriques et théorie de l'estimation. Zeitschrift fuer Wahrscheinlichkeitstheorie und verwandte gebiete, 65:181–237, 1983.
L. Birgé. On estimating a density using Hellinger distance and some other strange facts. Probability theory and related fields, 71:271–291, 1986.
L. Birgé and P. Massart. Rates of convergence for minimum contrast estimators. Probability Theory and Related Fields, 97:113–150, 1993.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam's razor. Information Processing Letters, 24:377–380, 1987.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929–965, 1989.
B. Clarke. Asymptotic cumulative risk and Bayes risk under entropy loss with applications. PhD thesis, Dept. of Statistics, University of Ill., 1989.
B. Clarke and A. Barron. Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36(3):453–471, 1990.
B. Clarke and A. Barron. Jefferys' prior is asymptotically least favorable under entropy risk. J. Statistical Planning and Inference, 41:37–60, 1994.
G. F. Clements. Entropy of several sets of real-valued functions. Pacific J. Math., 13:1085–1095, 1963.
T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.
L. Devroye and L. Györfi. Nonparametric density estimation, the L 1 view. Wiley, 1985.
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
R. M. Dudley. A course on empirical processes. Lecture Notes in Mathematics, 1097:2–142, 1984.
S. Y. Efroimovich. Information contained in a sequence of observations. Problems in Information Transmission, 15:178–189, 1980.
A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82:247–261, 1989.
M. Feder, Y. Freund, and Y. Mansour. Optimal universal learning and prediction of probabilistic concepts. In Proc. of IEEE Information Theory Conference, page 233. IEEE, 1995.
A. Gelman. Bayesian Data Analysis. Chapman and Hall, NY, 1995.
E. Giné and J. Zinn. Some limit theorems for empirical processes. Annals of Probability, 12:929–989, 1984.
R. Hasminskii and I. Ibragimov. On density estimation in the view of Kolmogorov's ideas in approximation theory. Annals of statistics, 18:999–1010, 1990.
D. Haussler and A. Barron. How well do Bayes methods work for on-line prediction of +1,−1 values? In Proceedings of the Third NEC Symposium on Computation and Cognition. SIAM, 1992.
D. Haussler and M. Opper. General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory, 1995.
D. Haussler and M. Opper. Mutual information, metric entropy, and risk in estimation of probability distributions. Technical Report UCSC-CRL-96-27, Univ. of Calif. Computer Research Lab, Santa Cruz, CA, 1996.
I. Ibragimov and R. Hasminskii. On the information in a sample about a parameter. In Second Int. Symp. on Information Theory, pages 295–309, 1972.
A. J. Izenman. Recent developments in nonparametric density estimation. JASA, 86(413):205–224, 1991.
A. N. Kolmogorov and V. M. Tihomirov. ∈-entropy and ∈-capacity of sets in functional spaces. Amer. Math. Soc. Translations (Ser. 2), 17:277–364, 1961.
L. LeCam. Asymptotic methods in statistical decision theory. Springer, 1986.
G. Lorentz. Approxiamtion of Functions. Holt, Rinehart, Winston, 1966.
R. Meir and N. Merhav. On the stochastic complexity of learning realizable and unrealizable rules. Unpublished manuscript, 1994.
M. Opper and D. Haussler. Bounds for predictive errors in the statistical mechanics of in supervised learning. Physical Review Letters, 75(20):3772–3775, 1995.
D. Pollard. Empirical Processes: Theory and Applications, volume 2 of NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Math. Stat. and Am. Stat. Assoc., 1990.
J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100, 1986.
J. Rissanen, T. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE Trans. Info. Th., 38:315–323, 1992.
N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (Series A), 13:145–147, 1972.
L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–42, 1984.
S. van deGeer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. Annals of Statistics, 21:14–44, 1993.
A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, NY, 1996.
V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.
V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–80, 1971.
W. Wong and X. Shen. Probability inequalities for likelihood ratios and convergence rates for sieve MLE's. Annals of Statistics, 23(2):339–362, 1995.
B. Yu. Lower bounds on expected redundancy for nonparametric classes. IEEE Trans. Info. Th., 42(1), 1996.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Haussler, D., Opper, M. (1997). Metric entropy and minimax risk in classification. In: Mycielski, J., Rozenberg, G., Salomaa, A. (eds) Structures in Logic and Computer Science. Lecture Notes in Computer Science, vol 1261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63246-8_13
Download citation
DOI: https://doi.org/10.1007/3-540-63246-8_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63246-7
Online ISBN: 978-3-540-69242-3
eBook Packages: Springer Book Archive