Statistical Themes and Lessons for Data Mining

Glymour, Clark; Madigan, David; Pregibon, Daryl; Smyth, Padhraic

doi:10.1023/A:1009773905005

Statistical Themes and Lessons for Data Mining

Published: March 1997

Volume 1, pages 11–28, (1997)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Clark Glymour¹,
David Madigan²,
Daryl Pregibon³ &
…
Padhraic Smyth⁴

972 Accesses
107 Citations
Explore all metrics

Abstract

Data mining is on the interface of Computer Science andStatistics, utilizing advances in both disciplines to make progressin extracting information from large databases. It is an emergingfield that has attracted much attention in a very short period oftime. This article highlights some statistical themes and lessonsthat are directly relevant to data mining and attempts to identifyopportunities where close cooperation between the statistical andcomputational communities might reasonably provide synergy forfurther progress in data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Automat. Contr., AC-19:716–723.
Google Scholar
Berger, J.O. and Sellke, T. 1987. Testing a point null hypothesis: The irreconcilability of Pvalues and evidence (with discussion). Journal of the American Statistical Association, 82:112–122.
Google Scholar
Breiman, L. 1996. Bagging predictors. Machine Learning (to appear).
Chasnoff, I.J., Griffith, D.R., MacGregor, S., Dirkes, K., and Burns, K.A. 1989. Temporal patterns of cocaine use in pregnancy: Perinatal outcome. Journal of the American Medical Association, 261(12):1741–1744.
Google Scholar
Chatfield, C. 1995. Model uncertainty, data mining, and statistical inference (with discussion). Journal of the Royal Statistical Society (Series A), 158:419–466.
Google Scholar
Dalal, S.R., Fowlkes, E.B., and Hoadley, B. 1989. Risk analysis of the space shuttle: Pre-Challenger prediction of failure. Journal of the American Statistical Association, 84:945–957.
Google Scholar
Diggle, P. and Kenward, M.G. 1994. Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics, 43:49–93.
Google Scholar
Draper, D. 1995. Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society (Series B), 57:45–97.
Google Scholar
Draper, D., Gaver, D.P., Goel, P.K., Greenhouse, J.B., Hedges, L.V., Morris, C.N., Tucker, J., and Waternaux, C. 1993. Combining information: National Research Council Panel on Statistical Issues and Opportunities for Research in the Combination of Information. Washington: National Academy Press.
Google Scholar
Efron, B. and Tibshirani, R.J. 1993. An Introduction to the Boostrap. New York: Chapman and Hall.
Google Scholar
Energy Modeling Forum. World Oil: Summary report. EMF Report 6, Energy Modeling Forum. Stanford University, Stanford, CA, 1982.
Google Scholar
Fisher, R.A. 1958. Statistical methods for research workers. New York: Hafner Pub. Co.
Google Scholar
Freedman, D.A. 1983. A note on screening regression equations. The American Statistician, 37:152–155.
Google Scholar
Geiger, D., Heckerman, D., and Meek, C. 1996. Asymptotic model selection for directed networks with hidden variables. Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufman.
Google Scholar
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. 1996. Markov Chain Monte Carlo in practice. London: Chapman and Hall.
Google Scholar
Hand, D.J. 1994. Deconstructing statistical questions (with discussion). Journal of the Royal Statistical Society (Series A), 157:317–356, 1994.
Google Scholar
Hastie, T.J. and Tibshirani, R. 1990. Generalized Additive Models. London: Chapman and Hall.
Google Scholar
Hoerl, R.W., Hooper, J.H., Jacobs, P.J., and Lucas, J.M. 1993. Skills for industrial statisticians to survive and prosper in the emerging quality environment. The American Statistician, 47:280–292.
Google Scholar
Huber, P.J. 1981. Robust Statistics. New York: Wiley.
Google Scholar
Jeffreys, H. 1980. Some general points in probability theory. In Bayesian Analysis in Econometrics and Statistics, A. Zellner (Ed.). Amsterdam: North-Holland, 451–454.
Google Scholar
Kass, R.E. and Raftery, A.E. 1995. Bayes factors. Journal of the American Statistical Association, 90:773–795.
Google Scholar
Kiiveri, H. and Speed, T.P. 1982. Structural analysis of multivariate data: A review. Sociological Methodology, 209–289.
Kooperberg, C., Bose, S., and Stone, C.J. 1996. Polychotomous regression. Journal of the American Statistical Association (to appear).
Lauritzen, S.L. 1996. Graphical Models. Oxford: Oxford University Press.
Google Scholar
Leamer, E.E. 1978. Specification Searches. Ad Hoc Inference with Nonexperimental Data. New York: Wiley.
Google Scholar
Madigan, D. and Raftery, A.E. 1994. Model selection and accounting for model uncertainty in graphical models using Occam's Window. Journal of the American Statistical Association, 89:1335–1346.
Google Scholar
Madigan, D. and York, J. 1995. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232.
Google Scholar
Matheson, J.E. and Winkler, R.L. 1976. Scoring rules for continuous probability distributions. Management Science, 22:1087–1096.
Google Scholar
McCullagh, P. and Nelder, J.A. 1989. Generalized Linear Models. London: Chapman and Hall.
Google Scholar
Michelangeli, P.A., Vautard, R., and Legras, B. 1995.Weather regimes: recurrence and quasi-stationarity. Journal of the Atmospheric Sciences, 52(8):1237–1256.
Google Scholar
Miller, R.G., Jr. 1981. Simultaneous statistical inference (second edition). New York: Springer-Verlag.
Google Scholar
Neyman, J. and Pearson, E.S. 1933.Onthe problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society (Series A), 231:289–337.
Google Scholar
Pearl, J. and Verma, T. 1990. A formal theory of inductive causation. Technical Report R-155, Cognitive Systems Laboratory, Computer Science Dept. UCLA.
Google Scholar
Pearl, J. and Verma, T. 1991. A theory of inferred causation. Principles of knowledge representation and reasoning. Proceedings of the Second International Conference, Morgan Kaufmann, San Mateo, CA.
Google Scholar
Pearl, J. 1995. Causal diagrams for empirical research. Biometrika. 82(4):669–709.
Google Scholar
Raftery, A.E. 1995. Bayesian model selection in social research (with discussion). In Sociological Methodology, P.V. Marsden (Ed.). Oxford, U.K.: Blackwells, pp. 111–196.
Google Scholar
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–4
Google Scholar
Scheines, R., Spirtes, P., Glymour, C., and Meek, C. 1994. TETRAD II: Users Manual, Lawrence Erlbaum Associates, Hillsdale, NJ.
Google Scholar
Scheines, R., Hoijtink, H., and Boomsma, A. 1995. Bayesian estimation and testing of structural equation models. Technical Report CMU-PHIL-66, Dept. of Philosophy, Carnegie Mellon Univ., Pgh, PA, 15213.
Google Scholar
Schervish, M.J. 1995. Theory of Statistics, New York: Springer Verlag.
Google Scholar
Schwartz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461–464.
Google Scholar
Selvin, H. and Stuart, A. 1966. Data dredging procedures in survey analysis. The American Statistician, 20(3):20–23.
Google Scholar
Simpson, C.H. 1951. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society (Series B), 13:238–241.
Google Scholar
Smith, A.F.M. and Roberts, G. 1993. Bayesian computation via the Gibbs sampler and related Markov Chain Monte Carlo methods (with discussion). Journal of the Royal Statistical Society (Series B), 55:3–23.
Google Scholar
Spirtes, P., Glymour, C., and Scheines, R. 1993. Causation, Prediction and Search. Springer Lecture Notes in Statistics. New York: Springer Verlag.
Google Scholar
Spirtes, P. and Meek, C. 1995. Learning Bayesian networks with discrete variables from data. Proceeding of the First International Conference on Knowledge Discovery and Data Mining, Usama M. Fayyad and Ramasamy Uthurusamy (Eds.), AAI Press, pp. 294–299.
Spirtes, P., Meek, C., and Richardson, T. 1995. Causal inference in the presence of latent variables and selection bias. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Philippe Besnard and Steve Hanks (Eds.), Morgan Kaufmann Publishers, Inc., San Mateo, CA, pp. 499–506.
Google Scholar
Spirtes, P. 1997. Heuristic greedy search algorithms for latent variable models. Proceedings of the Conference on AI and Statistics, Fort Lauderdale, forthcoming.
Google Scholar
Stigler, S.M. 1986. The history of statistics: The measurement of uncertainty before 1900. Harvard: Harvard University Press.
Google Scholar
Wen, S.W., Hernandez, R., and Naylor, C.D. 1995. Pitfalls in nonrandomized studies: The case of incidental appendectomy with open cholecystectomy. Journal of the American Medical Association, 274:1687–1691. Wright, S. 1921. Correlation and causation. Journal of Agricultural Research, 20:557-585.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cognitive Psychology, Carnegie Mellon University, Pittsburgh, PA, 15213
Clark Glymour
Department of Statistics, University of Washington, Box 354322, Seattle, WA, 98195
David Madigan
AT&T Laboratories, Statistics Research, Murray Hill, NJ, 07974
Daryl Pregibon
Information and Computer Science, University of California, Irvine, CA, 92717
Padhraic Smyth

Authors

Clark Glymour
View author publications
You can also search for this author in PubMed Google Scholar
David Madigan
View author publications
You can also search for this author in PubMed Google Scholar
Daryl Pregibon
View author publications
You can also search for this author in PubMed Google Scholar
Padhraic Smyth
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Glymour, C., Madigan, D., Pregibon, D. et al. Statistical Themes and Lessons for Data Mining. Data Mining and Knowledge Discovery 1, 11–28 (1997). https://doi.org/10.1023/A:1009773905005

Download citation

Issue Date: March 1997
DOI: https://doi.org/10.1023/A:1009773905005

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical Themes and Lessons for Data Mining

Abstract

Access this article

Similar content being viewed by others

Data Analysis

United Statistical Algorithms and Data Science: An Introduction to the Principles

Big data: the next challenge for statistics

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Statistical Themes and Lessons for Data Mining

Abstract

Access this article

Similar content being viewed by others

Data Analysis

United Statistical Algorithms and Data Science: An Introduction to the Principles

Big data: the next challenge for statistics

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation