Abstract
In this paper we are interested in discrete prediction problems for a decision-theoretic setting, where the task is to compute the predictive distribution for a finite set of possible alternatives. This question is first addressed in a general Bayesian framework, where we consider a set of probability distributions defined by some parametric model class. Given a prior distribution on the model parameters and a set of sample data, one possible approach for determining a predictive distribution is to fix the parameters to the instantiation with the maximum a posteriori probability. A more accurate predictive distribution can be obtained by computing the evidence (marginal likelihood), i.e., the integral over all the individual parameter instantiations. As an alternative to these two approaches, we demonstrate how to use Rissanen's new definition of stochastic complexity for determining predictive distributions, and show how the evidence predictive distribution with Jeffrey's prior approaches the new stochastic complexity predictive distribution in the limit with increasing amount of sample data. To compare the alternative approaches in practice, each of the predictive distributions discussed is instantiated in the Bayesian network model family case. In particular, to determine Jeffrey's prior for this model family, we show how to compute the (expected) Fisher information matrix for a fixed but arbitrary Bayesian network structure. In the empirical part of the paper the predictive distributions are compared by using the simple tree-structured Naive Bayes model, which is used in the experiments for computational reasons. The experimentation with several public domain classification datasets suggest that the evidence approach produces the most accurate predictions in the log-score sense. The evidence-based methods are also quite robust in the sense that they predict surprisingly well even when only a small fraction of the full training set is used.
Similar content being viewed by others
References
Baxter R. and Oliver J. 1994. MDL and MML: similarities and differences. Technical Report 207, Department of Computer Science, Monash University.
Berger J. 1985. Statistical Decision Theory and Bayesian Analysis. New York, Springer-Verlag.
Bernardo J. and Smith A. 1994. Bayesian theory. John Wiley.
Blake C., Keogh E., and Merz C. 1998. UCI repository of machine learning databases. URL: http://www.ics.uci.edu/»mlearn/MLRepository.html.
Castillo E., Gutiérrez J., and Hadi A. 1997. Expert Systems and Probabilistic Network Models, Monographs in Computer Science. New York, NY, Springer-Verlag.
Clarke B. and Barron A. 1990. Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory 36(3): 453–471.
Clarke B. and Barron A. 1994. Jeffrey's Prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference 41: 37–60.
Cooper G. 1990. The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence 42(2–3): 393–405.
Cooper G. and Herskovits E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9: 309–347.
Cover T. and Thomas J. 1991. Elements of Information Theory. New York, NY, John Wiley & Sons.
DeGroot M. 1970. Optimal Statistical Decisions. McGraw-Hill.
Dom B. 1995. MDL estimation with small sample sizes including an application to the problem of segmenting binary strings using Bernoulli models. Technical Report RJ 9997 (89085), IBM Research Division, Almaden Research Center.
Friedman N. and Goldszmidt 1996. Building classifiers using Bayesian networks. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, Oregon, pp. 1277–1284.
Geiger D. and Heckerman D. 1994. A characterization of the Dirichlet distribution through global and local independence. Technical Report MSR-TR-94-16, Microsoft Research.
Grünwald P. 1998. The minimum description length principle and reasoning under uncertainty. Ph.D. Thesis, CWI, ILLC Dissertation Series 1998-03.
Grünwald P., Kontkanen P., Myllymäki P., Silander T., and Tirri H. 1998. Minimum encoding approaches for predictive modeling. In: Cooper G. and Moral S. (Eds.), Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI'98), Madison, WI, pp. 183–192.
Heckerman D., Geiger D., and Chickering D. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20(3): 197–243.
Jensen F. 1996. An Introduction to Bayesian Networks. London, UCL Press.
Kass R. and Voss P. 1997. Geometrical Foundations of Asymptotic Inference. Wiley Interscience.
Kontkanen P., Myllymäki P., Silander T., Tirri H., and Grünwald P. 1997. Comparing predictive inference methods for discrete domains. In: Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida, pp. 311–318.
Kontkanen P., Myllymäki P., Silander T., Tirri H., and Valtonen K. 1999. Exploring the robustness of Bayesian and information-theoretic methods for predictive inference. In: Heckerman D. and Whittaker J. (Eds.), Proceedings of Uncertainty '99: The Seventh International Workshop on Artificial Intelligence and Statistics, Morgan Kaufmann Publishers, pp. 231–236.
Langley P. and Sage S. 1994 Induction of selective Bayesian classifiers. In: Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, Seattle, Oregon, pp. 399–406.
Michie D., Spiegelhalter D., and Taylor C. (Eds.), 1994. Machine Learning, Neural and Statistical Classification, London, Ellis Horwood.
Neapolitan R. 1990. Probabilistic Reasoning in Expert Systems. New York, NY, John Wiley & Sons.
Pearl J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA, Morgan Kaufmann Publishers.
Rissanen J. 1987. Stochastic complexity. Journal of the Royal Statistical Society 49(3): 223–239 and 252–265.
Rissanen J. 1989. Stochastic Complexity in Statistical Inquiry. New Jersey, World Scientific Publishing Company.
Rissanen J. 1996. Fisher information and stochastic complexity. IEEE Transactions on Information Theory 42(1): 40–47.
Shachter R. 1988. Probabilistic inference and influence diagrams. Operations Research 36(4): 589–604.
Takeuchi J. and Barron A. 1998. Asymptotically minimax regret by Bayes mixtures. In: 1998 IEEE International Symposium on Information Theory. Cambridge, MA, August 1998.
Thiesson B. 1995. Score and information for recursive exponential models with incomplete data. Technical Report R-95-2020, Aalborg University, Institute for Electronic Systems, Department of Mathematics and Computer Science.
Tirri H., Kontkanen P., and Myllymüki P. 1996. Probabilistic instancebased learning. In: Saitta L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference (ICML'96), pp. 507–515.
Wallace C. and Boulton D. 1968. An information measure for classification. Computer Journal 11: 185–194.
Wallace C. and Freeman P. 1987. Estimation and inference by compact coding. Journal of the Royal Statistical Society 49(3): 240–265.
Wallace C., Korb K., and Dai H. 1996a. Causal discovery via MML. Technical Report 96=254, Department of Computer Science, Monash University.
Wallace C., Korb K., and Dai H. 1996b. Causal discovery via MML. In: Saitta L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference (ICML'96), pp. 516–524.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kontkanen, P., Myllymäki, P., Silander, T. et al. On predictive distributions and Bayesian networks . Statistics and Computing 10, 39–54 (2000). https://doi.org/10.1023/A:1008984400380
Issue Date:
DOI: https://doi.org/10.1023/A:1008984400380