Skip to main content
Log in

Statistical Themes and Lessons for Data Mining

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data mining is on the interface of Computer Science andStatistics, utilizing advances in both disciplines to make progressin extracting information from large databases. It is an emergingfield that has attracted much attention in a very short period oftime. This article highlights some statistical themes and lessonsthat are directly relevant to data mining and attempts to identifyopportunities where close cooperation between the statistical andcomputational communities might reasonably provide synergy forfurther progress in data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Automat. Contr., AC-19:716–723.

    Google Scholar 

  • Berger, J.O. and Sellke, T. 1987. Testing a point null hypothesis: The irreconcilability of Pvalues and evidence (with discussion). Journal of the American Statistical Association, 82:112–122.

    Google Scholar 

  • Breiman, L. 1996. Bagging predictors. Machine Learning (to appear).

  • Chasnoff, I.J., Griffith, D.R., MacGregor, S., Dirkes, K., and Burns, K.A. 1989. Temporal patterns of cocaine use in pregnancy: Perinatal outcome. Journal of the American Medical Association, 261(12):1741–1744.

    Google Scholar 

  • Chatfield, C. 1995. Model uncertainty, data mining, and statistical inference (with discussion). Journal of the Royal Statistical Society (Series A), 158:419–466.

    Google Scholar 

  • Dalal, S.R., Fowlkes, E.B., and Hoadley, B. 1989. Risk analysis of the space shuttle: Pre-Challenger prediction of failure. Journal of the American Statistical Association, 84:945–957.

    Google Scholar 

  • Diggle, P. and Kenward, M.G. 1994. Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics, 43:49–93.

    Google Scholar 

  • Draper, D. 1995. Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society (Series B), 57:45–97.

    Google Scholar 

  • Draper, D., Gaver, D.P., Goel, P.K., Greenhouse, J.B., Hedges, L.V., Morris, C.N., Tucker, J., and Waternaux, C. 1993. Combining information: National Research Council Panel on Statistical Issues and Opportunities for Research in the Combination of Information. Washington: National Academy Press.

    Google Scholar 

  • Efron, B. and Tibshirani, R.J. 1993. An Introduction to the Boostrap. New York: Chapman and Hall.

    Google Scholar 

  • Energy Modeling Forum. World Oil: Summary report. EMF Report 6, Energy Modeling Forum. Stanford University, Stanford, CA, 1982.

    Google Scholar 

  • Fisher, R.A. 1958. Statistical methods for research workers. New York: Hafner Pub. Co.

    Google Scholar 

  • Freedman, D.A. 1983. A note on screening regression equations. The American Statistician, 37:152–155.

    Google Scholar 

  • Geiger, D., Heckerman, D., and Meek, C. 1996. Asymptotic model selection for directed networks with hidden variables. Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufman.

    Google Scholar 

  • Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. 1996. Markov Chain Monte Carlo in practice. London: Chapman and Hall.

    Google Scholar 

  • Hand, D.J. 1994. Deconstructing statistical questions (with discussion). Journal of the Royal Statistical Society (Series A), 157:317–356, 1994.

    Google Scholar 

  • Hastie, T.J. and Tibshirani, R. 1990. Generalized Additive Models. London: Chapman and Hall.

    Google Scholar 

  • Hoerl, R.W., Hooper, J.H., Jacobs, P.J., and Lucas, J.M. 1993. Skills for industrial statisticians to survive and prosper in the emerging quality environment. The American Statistician, 47:280–292.

    Google Scholar 

  • Huber, P.J. 1981. Robust Statistics. New York: Wiley.

    Google Scholar 

  • Jeffreys, H. 1980. Some general points in probability theory. In Bayesian Analysis in Econometrics and Statistics, A. Zellner (Ed.). Amsterdam: North-Holland, 451–454.

    Google Scholar 

  • Kass, R.E. and Raftery, A.E. 1995. Bayes factors. Journal of the American Statistical Association, 90:773–795.

    Google Scholar 

  • Kiiveri, H. and Speed, T.P. 1982. Structural analysis of multivariate data: A review. Sociological Methodology, 209–289.

  • Kooperberg, C., Bose, S., and Stone, C.J. 1996. Polychotomous regression. Journal of the American Statistical Association (to appear).

  • Lauritzen, S.L. 1996. Graphical Models. Oxford: Oxford University Press.

    Google Scholar 

  • Leamer, E.E. 1978. Specification Searches. Ad Hoc Inference with Nonexperimental Data. New York: Wiley.

    Google Scholar 

  • Madigan, D. and Raftery, A.E. 1994. Model selection and accounting for model uncertainty in graphical models using Occam's Window. Journal of the American Statistical Association, 89:1335–1346.

    Google Scholar 

  • Madigan, D. and York, J. 1995. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232.

    Google Scholar 

  • Matheson, J.E. and Winkler, R.L. 1976. Scoring rules for continuous probability distributions. Management Science, 22:1087–1096.

    Google Scholar 

  • McCullagh, P. and Nelder, J.A. 1989. Generalized Linear Models. London: Chapman and Hall.

    Google Scholar 

  • Michelangeli, P.A., Vautard, R., and Legras, B. 1995.Weather regimes: recurrence and quasi-stationarity. Journal of the Atmospheric Sciences, 52(8):1237–1256.

    Google Scholar 

  • Miller, R.G., Jr. 1981. Simultaneous statistical inference (second edition). New York: Springer-Verlag.

    Google Scholar 

  • Neyman, J. and Pearson, E.S. 1933.Onthe problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society (Series A), 231:289–337.

    Google Scholar 

  • Pearl, J. and Verma, T. 1990. A formal theory of inductive causation. Technical Report R-155, Cognitive Systems Laboratory, Computer Science Dept. UCLA.

    Google Scholar 

  • Pearl, J. and Verma, T. 1991. A theory of inferred causation. Principles of knowledge representation and reasoning. Proceedings of the Second International Conference, Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  • Pearl, J. 1995. Causal diagrams for empirical research. Biometrika. 82(4):669–709.

    Google Scholar 

  • Raftery, A.E. 1995. Bayesian model selection in social research (with discussion). In Sociological Methodology, P.V. Marsden (Ed.). Oxford, U.K.: Blackwells, pp. 111–196.

    Google Scholar 

  • Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–4

    Google Scholar 

  • Scheines, R., Spirtes, P., Glymour, C., and Meek, C. 1994. TETRAD II: Users Manual, Lawrence Erlbaum Associates, Hillsdale, NJ.

    Google Scholar 

  • Scheines, R., Hoijtink, H., and Boomsma, A. 1995. Bayesian estimation and testing of structural equation models. Technical Report CMU-PHIL-66, Dept. of Philosophy, Carnegie Mellon Univ., Pgh, PA, 15213.

    Google Scholar 

  • Schervish, M.J. 1995. Theory of Statistics, New York: Springer Verlag.

    Google Scholar 

  • Schwartz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461–464.

    Google Scholar 

  • Selvin, H. and Stuart, A. 1966. Data dredging procedures in survey analysis. The American Statistician, 20(3):20–23.

    Google Scholar 

  • Simpson, C.H. 1951. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society (Series B), 13:238–241.

    Google Scholar 

  • Smith, A.F.M. and Roberts, G. 1993. Bayesian computation via the Gibbs sampler and related Markov Chain Monte Carlo methods (with discussion). Journal of the Royal Statistical Society (Series B), 55:3–23.

    Google Scholar 

  • Spirtes, P., Glymour, C., and Scheines, R. 1993. Causation, Prediction and Search. Springer Lecture Notes in Statistics. New York: Springer Verlag.

    Google Scholar 

  • Spirtes, P. and Meek, C. 1995. Learning Bayesian networks with discrete variables from data. Proceeding of the First International Conference on Knowledge Discovery and Data Mining, Usama M. Fayyad and Ramasamy Uthurusamy (Eds.), AAI Press, pp. 294–299.

  • Spirtes, P., Meek, C., and Richardson, T. 1995. Causal inference in the presence of latent variables and selection bias. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Philippe Besnard and Steve Hanks (Eds.), Morgan Kaufmann Publishers, Inc., San Mateo, CA, pp. 499–506.

    Google Scholar 

  • Spirtes, P. 1997. Heuristic greedy search algorithms for latent variable models. Proceedings of the Conference on AI and Statistics, Fort Lauderdale, forthcoming.

    Google Scholar 

  • Stigler, S.M. 1986. The history of statistics: The measurement of uncertainty before 1900. Harvard: Harvard University Press.

    Google Scholar 

  • Wen, S.W., Hernandez, R., and Naylor, C.D. 1995. Pitfalls in nonrandomized studies: The case of incidental appendectomy with open cholecystectomy. Journal of the American Medical Association, 274:1687–1691. Wright, S. 1921. Correlation and causation. Journal of Agricultural Research, 20:557-585.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Glymour, C., Madigan, D., Pregibon, D. et al. Statistical Themes and Lessons for Data Mining. Data Mining and Knowledge Discovery 1, 11–28 (1997). https://doi.org/10.1023/A:1009773905005

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009773905005

Navigation