Abstract
Data mining is on the interface of Computer Science andStatistics, utilizing advances in both disciplines to make progressin extracting information from large databases. It is an emergingfield that has attracted much attention in a very short period oftime. This article highlights some statistical themes and lessonsthat are directly relevant to data mining and attempts to identifyopportunities where close cooperation between the statistical andcomputational communities might reasonably provide synergy forfurther progress in data analysis.
Similar content being viewed by others
References
Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Automat. Contr., AC-19:716–723.
Berger, J.O. and Sellke, T. 1987. Testing a point null hypothesis: The irreconcilability of Pvalues and evidence (with discussion). Journal of the American Statistical Association, 82:112–122.
Breiman, L. 1996. Bagging predictors. Machine Learning (to appear).
Chasnoff, I.J., Griffith, D.R., MacGregor, S., Dirkes, K., and Burns, K.A. 1989. Temporal patterns of cocaine use in pregnancy: Perinatal outcome. Journal of the American Medical Association, 261(12):1741–1744.
Chatfield, C. 1995. Model uncertainty, data mining, and statistical inference (with discussion). Journal of the Royal Statistical Society (Series A), 158:419–466.
Dalal, S.R., Fowlkes, E.B., and Hoadley, B. 1989. Risk analysis of the space shuttle: Pre-Challenger prediction of failure. Journal of the American Statistical Association, 84:945–957.
Diggle, P. and Kenward, M.G. 1994. Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics, 43:49–93.
Draper, D. 1995. Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society (Series B), 57:45–97.
Draper, D., Gaver, D.P., Goel, P.K., Greenhouse, J.B., Hedges, L.V., Morris, C.N., Tucker, J., and Waternaux, C. 1993. Combining information: National Research Council Panel on Statistical Issues and Opportunities for Research in the Combination of Information. Washington: National Academy Press.
Efron, B. and Tibshirani, R.J. 1993. An Introduction to the Boostrap. New York: Chapman and Hall.
Energy Modeling Forum. World Oil: Summary report. EMF Report 6, Energy Modeling Forum. Stanford University, Stanford, CA, 1982.
Fisher, R.A. 1958. Statistical methods for research workers. New York: Hafner Pub. Co.
Freedman, D.A. 1983. A note on screening regression equations. The American Statistician, 37:152–155.
Geiger, D., Heckerman, D., and Meek, C. 1996. Asymptotic model selection for directed networks with hidden variables. Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufman.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. 1996. Markov Chain Monte Carlo in practice. London: Chapman and Hall.
Hand, D.J. 1994. Deconstructing statistical questions (with discussion). Journal of the Royal Statistical Society (Series A), 157:317–356, 1994.
Hastie, T.J. and Tibshirani, R. 1990. Generalized Additive Models. London: Chapman and Hall.
Hoerl, R.W., Hooper, J.H., Jacobs, P.J., and Lucas, J.M. 1993. Skills for industrial statisticians to survive and prosper in the emerging quality environment. The American Statistician, 47:280–292.
Huber, P.J. 1981. Robust Statistics. New York: Wiley.
Jeffreys, H. 1980. Some general points in probability theory. In Bayesian Analysis in Econometrics and Statistics, A. Zellner (Ed.). Amsterdam: North-Holland, 451–454.
Kass, R.E. and Raftery, A.E. 1995. Bayes factors. Journal of the American Statistical Association, 90:773–795.
Kiiveri, H. and Speed, T.P. 1982. Structural analysis of multivariate data: A review. Sociological Methodology, 209–289.
Kooperberg, C., Bose, S., and Stone, C.J. 1996. Polychotomous regression. Journal of the American Statistical Association (to appear).
Lauritzen, S.L. 1996. Graphical Models. Oxford: Oxford University Press.
Leamer, E.E. 1978. Specification Searches. Ad Hoc Inference with Nonexperimental Data. New York: Wiley.
Madigan, D. and Raftery, A.E. 1994. Model selection and accounting for model uncertainty in graphical models using Occam's Window. Journal of the American Statistical Association, 89:1335–1346.
Madigan, D. and York, J. 1995. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232.
Matheson, J.E. and Winkler, R.L. 1976. Scoring rules for continuous probability distributions. Management Science, 22:1087–1096.
McCullagh, P. and Nelder, J.A. 1989. Generalized Linear Models. London: Chapman and Hall.
Michelangeli, P.A., Vautard, R., and Legras, B. 1995.Weather regimes: recurrence and quasi-stationarity. Journal of the Atmospheric Sciences, 52(8):1237–1256.
Miller, R.G., Jr. 1981. Simultaneous statistical inference (second edition). New York: Springer-Verlag.
Neyman, J. and Pearson, E.S. 1933.Onthe problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society (Series A), 231:289–337.
Pearl, J. and Verma, T. 1990. A formal theory of inductive causation. Technical Report R-155, Cognitive Systems Laboratory, Computer Science Dept. UCLA.
Pearl, J. and Verma, T. 1991. A theory of inferred causation. Principles of knowledge representation and reasoning. Proceedings of the Second International Conference, Morgan Kaufmann, San Mateo, CA.
Pearl, J. 1995. Causal diagrams for empirical research. Biometrika. 82(4):669–709.
Raftery, A.E. 1995. Bayesian model selection in social research (with discussion). In Sociological Methodology, P.V. Marsden (Ed.). Oxford, U.K.: Blackwells, pp. 111–196.
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–4
Scheines, R., Spirtes, P., Glymour, C., and Meek, C. 1994. TETRAD II: Users Manual, Lawrence Erlbaum Associates, Hillsdale, NJ.
Scheines, R., Hoijtink, H., and Boomsma, A. 1995. Bayesian estimation and testing of structural equation models. Technical Report CMU-PHIL-66, Dept. of Philosophy, Carnegie Mellon Univ., Pgh, PA, 15213.
Schervish, M.J. 1995. Theory of Statistics, New York: Springer Verlag.
Schwartz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461–464.
Selvin, H. and Stuart, A. 1966. Data dredging procedures in survey analysis. The American Statistician, 20(3):20–23.
Simpson, C.H. 1951. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society (Series B), 13:238–241.
Smith, A.F.M. and Roberts, G. 1993. Bayesian computation via the Gibbs sampler and related Markov Chain Monte Carlo methods (with discussion). Journal of the Royal Statistical Society (Series B), 55:3–23.
Spirtes, P., Glymour, C., and Scheines, R. 1993. Causation, Prediction and Search. Springer Lecture Notes in Statistics. New York: Springer Verlag.
Spirtes, P. and Meek, C. 1995. Learning Bayesian networks with discrete variables from data. Proceeding of the First International Conference on Knowledge Discovery and Data Mining, Usama M. Fayyad and Ramasamy Uthurusamy (Eds.), AAI Press, pp. 294–299.
Spirtes, P., Meek, C., and Richardson, T. 1995. Causal inference in the presence of latent variables and selection bias. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Philippe Besnard and Steve Hanks (Eds.), Morgan Kaufmann Publishers, Inc., San Mateo, CA, pp. 499–506.
Spirtes, P. 1997. Heuristic greedy search algorithms for latent variable models. Proceedings of the Conference on AI and Statistics, Fort Lauderdale, forthcoming.
Stigler, S.M. 1986. The history of statistics: The measurement of uncertainty before 1900. Harvard: Harvard University Press.
Wen, S.W., Hernandez, R., and Naylor, C.D. 1995. Pitfalls in nonrandomized studies: The case of incidental appendectomy with open cholecystectomy. Journal of the American Medical Association, 274:1687–1691. Wright, S. 1921. Correlation and causation. Journal of Agricultural Research, 20:557-585.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Glymour, C., Madigan, D., Pregibon, D. et al. Statistical Themes and Lessons for Data Mining. Data Mining and Knowledge Discovery 1, 11–28 (1997). https://doi.org/10.1023/A:1009773905005
Issue Date:
DOI: https://doi.org/10.1023/A:1009773905005