Skip to main content

Statistical Methods for Data Mining

  • Chapter
  • First Online:
Data Mining and Knowledge Discovery Handbook

Summary

The aim of this chapter is to present the main statistical issues in Data Mining (DM) and Knowledge Data Discovery (KDD) and to examine whether traditional statistics approach and methods substantially differ from the new trend of KDD and DM. We address and emphasize some central issues of statistics which are highly relevant to DM and have much to offer to DM

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abramovich F. and Benjamini Y., (1996). Adaptive thresholding of wavelet coefficients. Computational Statistics & Data Analysis, 22:351–361.

    Article  MathSciNet  Google Scholar 

  • Abramovich F., Bailey T .C. and Sapatinas T., (2000). Wavelet analysis and its statistical applications. Journal of the Royal Statistical Society Series D-The Statistician, 49:1–29.

    Article  Google Scholar 

  • Abramovich F., Benjamini Y., Donoho D. and Johnstone I., (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000-19, Department of Statistics, Stanford University.

    Google Scholar 

  • Benjamini Y. and Hochberg Y., (1995). Controlling the false discover rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57:289–300.

    MATH  MathSciNet  Google Scholar 

  • Benjamini Y. and Hochberg Y., (2000). On the adaptive control of the false discovery fate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60–83.

    Google Scholar 

  • Benjamini Y., Krieger A.M. and Yekutieli D., (2001). Two staged linear step up for controlling procedure. Technical report, Department of Statistics and O.R., Tel Aviv University.

    Google Scholar 

  • Benjamini Y. and Yekutieli D., (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29:1165–1188.

    Article  MATH  MathSciNet  Google Scholar 

  • Berthold M. and Hand D., (1999). Intelligent Data Analysis: An Introduction. Springer.

    Google Scholar 

  • Birge L. and Massart P., (2001). Gaussian model selection. Journal of the European Mathematical Society, 3:203–268.

    Article  MATH  MathSciNet  Google Scholar 

  • Chatfield C., (1995). Model uncertainty, Data Mining and statistical inference. Journal of the Royal Statistical Society A, 158:419–466.

    Article  Google Scholar 

  • Cochran W.G., (1977). Sampling Techniques. Wiley.

    Google Scholar 

  • Cox D.R., (1972). Regressio models and life-tables. Journal of the Royal Statistical Society B, 34:187–220.

    MATH  Google Scholar 

  • Dell’Aquila R. and Ronchetti E.M., (2004). Introduction to Robust Statistics with Economic and Financial Applications. Wiley.

    Google Scholar 

  • Donoho D.L. and Johnstone I.M., (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90:1200–1224.

    Article  MATH  MathSciNet  Google Scholar 

  • Donoho D., (2000). American math. society: Math challenges of the 21st century: Highdimensional data analysis: The curses and blessings of dimensionality.

    Google Scholar 

  • Efron B., Tibshirani R.J., Storey J.D. and Tusher V., (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96:1151–1160.

    Article  MATH  MathSciNet  Google Scholar 

  • Friedman J.H., (1998). Data Mining and Statistics: What’s the connections?, Proc. 29th Symposium on the Interface (D. Scott, editor).

    Google Scholar 

  • Foster D.P. and Stine R.A., (2004). Variable selection in Data Mining: Building a predictive model for bankruptcy. Journal of the American Statistical Association, 99:303–313.

    Article  MATH  MathSciNet  Google Scholar 

  • Gavrilov Y., (2003). Using the falls discovery rate criteria for model selection in linear regression. M.Sc. Thesis, Department of Statistics, Tel Aviv University.

    Google Scholar 

  • Genovese C. and Wasserman L., (2002a). Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society Series B, 64:499–517.

    Article  MATH  MathSciNet  Google Scholar 

  • Genovese C. and Wasserman L., (2002b). A stochastic process approach to false discovery rates. Technical Report 762, Department of Statistics, Carnegie Mellon University.

    Google Scholar 

  • George E.I. and Foster D.P., (2000). Calibration and empirical Bayes variable selection. Biometrika, 87:731–748.

    Article  MATH  MathSciNet  Google Scholar 

  • Hand D., (1998). Data Mining: Statistics and more? The American Statistician, 52:112–118.

    Article  Google Scholar 

  • Hand D., Mannila H. and Smyth P., (2001). Principles of Data Mining. MIT Press.

    Google Scholar 

  • Han J. and Kamber M., (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher.

    Google Scholar 

  • Hastie T., Tibshirani R. and Friedman J., (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

    Google Scholar 

  • Hochberg Y. and Benjamini Y., (1990). More powerful procedures for multiple significance testing. Statistics in Medicine, 9:811–818.

    Article  Google Scholar 

  • Leshno M., Lin V.Y., Pinkus A. and Schocken S., (1993). Multilayer feedforward networks with a non polynomial activation function can approximate any function. Neural Networks, 6:861–867.

    Article  Google Scholar 

  • McCullagh P. and Nelder J.A., (1991). Generalized Linear Model. Chapman & Hall.

    Google Scholar 

  • Meilijson I., (1991). The expected value of some functions of the convex hull of a random set of points sampled in rd. Isr. J. of Math., 72:341–352.

    Article  MathSciNet  Google Scholar 

  • Mosteller F. and Tukey J.W., (1977). Data Analysis and Regression : A Second Course in Statistics. Wiley.

    Google Scholar 

  • Roberts S. and Everson R. (editors), (2001). Independent Component Analysis : Principles and Practice. Cambridge University Press.

    Google Scholar 

  • Ronchetti E.M., Hampel F.R., Rousseeuw P.J. and Stahel W.A., (1986). Robust Statistics : The Approach Based on Influence Functions. Wiley.

    Google Scholar 

  • Sarkar S.K., (2002). Some results on false discovery rate in stepwise multiple testing procedures. Annals of Statistics, 30:239–257.

    Article  MATH  MathSciNet  Google Scholar 

  • Schweder T. and Spjotvoll E., (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika, 69:493–502.

    Google Scholar 

  • Seeger P., (1968). A note on a method for the analysis of significances en mass. Technometrics, 10:586–593.

    Article  Google Scholar 

  • Simes R.J., (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73:751–754.

    Article  MATH  MathSciNet  Google Scholar 

  • Storey J.D., (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B, 64:479–498.

    Article  MATH  MathSciNet  Google Scholar 

  • Storey J.D., Taylor J.E. and Siegmund D., (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society Series B, 66:187–205.

    Article  MATH  MathSciNet  Google Scholar 

  • Therneau T.M. and Grambsch P.M., (2000). Modeling Survival Data, Extending the Cox Model. Springer.

    Google Scholar 

  • Tibshirani R. and Knight K., (1999). The covariance inflation criterion for adaptive model selection. Journal of the Royal Statistical Society Series B, 61:Part 3 529–546.

    MATH  MathSciNet  Google Scholar 

  • Zembowicz R. and Zytkov J.M., (1996). From contingency tables to various froms of knowledge in databases. In U.M. Fayyad, R. Uthurusamy, G. Piatetsky-Shapiro and P. Smyth (editors) Advances in Knowledge Discovery and Data Mining (pp. 329-349). MIT Press.

    Google Scholar 

  • Zytkov J.M. and Zembowicz R., (1997). Contingency tables as the foundation for concepts, concept hierarchies and rules: The 49er system approach. Fundamenta Informaticae 30:383–399.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoav Benjamini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Benjamini, Y., Leshno, M. (2009). Statistical Methods for Data Mining. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-09823-4_25

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-09822-7

  • Online ISBN: 978-0-387-09823-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics