Statistical Methods for Data Mining

Benjamini, Yoav; Leshno, Moshe

doi:10.1007/978-0-387-09823-4_25

Yoav Benjamini³ &
Moshe Leshno⁴

16k Accesses
3 Citations

Summary

The aim of this chapter is to present the main statistical issues in Data Mining (DM) and Knowledge Data Discovery (KDD) and to examine whether traditional statistics approach and methods substantially differ from the new trend of KDD and DM. We address and emphasize some central issues of statistics which are highly relevant to DM and have much to offer to DM

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abramovich F. and Benjamini Y., (1996). Adaptive thresholding of wavelet coefficients. Computational Statistics & Data Analysis, 22:351–361.
Article MathSciNet Google Scholar
Abramovich F., Bailey T .C. and Sapatinas T., (2000). Wavelet analysis and its statistical applications. Journal of the Royal Statistical Society Series D-The Statistician, 49:1–29.
Article Google Scholar
Abramovich F., Benjamini Y., Donoho D. and Johnstone I., (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000-19, Department of Statistics, Stanford University.
Google Scholar
Benjamini Y. and Hochberg Y., (1995). Controlling the false discover rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57:289–300.
MATH MathSciNet Google Scholar
Benjamini Y. and Hochberg Y., (2000). On the adaptive control of the false discovery fate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60–83.
Google Scholar
Benjamini Y., Krieger A.M. and Yekutieli D., (2001). Two staged linear step up for controlling procedure. Technical report, Department of Statistics and O.R., Tel Aviv University.
Google Scholar
Benjamini Y. and Yekutieli D., (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29:1165–1188.
Article MATH MathSciNet Google Scholar
Berthold M. and Hand D., (1999). Intelligent Data Analysis: An Introduction. Springer.
Google Scholar
Birge L. and Massart P., (2001). Gaussian model selection. Journal of the European Mathematical Society, 3:203–268.
Article MATH MathSciNet Google Scholar
Chatfield C., (1995). Model uncertainty, Data Mining and statistical inference. Journal of the Royal Statistical Society A, 158:419–466.
Article Google Scholar
Cochran W.G., (1977). Sampling Techniques. Wiley.
Google Scholar
Cox D.R., (1972). Regressio models and life-tables. Journal of the Royal Statistical Society B, 34:187–220.
MATH Google Scholar
Dell’Aquila R. and Ronchetti E.M., (2004). Introduction to Robust Statistics with Economic and Financial Applications. Wiley.
Google Scholar
Donoho D.L. and Johnstone I.M., (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90:1200–1224.
Article MATH MathSciNet Google Scholar
Donoho D., (2000). American math. society: Math challenges of the 21st century: Highdimensional data analysis: The curses and blessings of dimensionality.
Google Scholar
Efron B., Tibshirani R.J., Storey J.D. and Tusher V., (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96:1151–1160.
Article MATH MathSciNet Google Scholar
Friedman J.H., (1998). Data Mining and Statistics: What’s the connections?, Proc. 29th Symposium on the Interface (D. Scott, editor).
Google Scholar
Foster D.P. and Stine R.A., (2004). Variable selection in Data Mining: Building a predictive model for bankruptcy. Journal of the American Statistical Association, 99:303–313.
Article MATH MathSciNet Google Scholar
Gavrilov Y., (2003). Using the falls discovery rate criteria for model selection in linear regression. M.Sc. Thesis, Department of Statistics, Tel Aviv University.
Google Scholar
Genovese C. and Wasserman L., (2002a). Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society Series B, 64:499–517.
Article MATH MathSciNet Google Scholar
Genovese C. and Wasserman L., (2002b). A stochastic process approach to false discovery rates. Technical Report 762, Department of Statistics, Carnegie Mellon University.
Google Scholar
George E.I. and Foster D.P., (2000). Calibration and empirical Bayes variable selection. Biometrika, 87:731–748.
Article MATH MathSciNet Google Scholar
Hand D., (1998). Data Mining: Statistics and more? The American Statistician, 52:112–118.
Article Google Scholar
Hand D., Mannila H. and Smyth P., (2001). Principles of Data Mining. MIT Press.
Google Scholar
Han J. and Kamber M., (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher.
Google Scholar
Hastie T., Tibshirani R. and Friedman J., (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
Google Scholar
Hochberg Y. and Benjamini Y., (1990). More powerful procedures for multiple significance testing. Statistics in Medicine, 9:811–818.
Article Google Scholar
Leshno M., Lin V.Y., Pinkus A. and Schocken S., (1993). Multilayer feedforward networks with a non polynomial activation function can approximate any function. Neural Networks, 6:861–867.
Article Google Scholar
McCullagh P. and Nelder J.A., (1991). Generalized Linear Model. Chapman & Hall.
Google Scholar
Meilijson I., (1991). The expected value of some functions of the convex hull of a random set of points sampled in rd. Isr. J. of Math., 72:341–352.
Article MathSciNet Google Scholar
Mosteller F. and Tukey J.W., (1977). Data Analysis and Regression : A Second Course in Statistics. Wiley.
Google Scholar
Roberts S. and Everson R. (editors), (2001). Independent Component Analysis : Principles and Practice. Cambridge University Press.
Google Scholar
Ronchetti E.M., Hampel F.R., Rousseeuw P.J. and Stahel W.A., (1986). Robust Statistics : The Approach Based on Influence Functions. Wiley.
Google Scholar
Sarkar S.K., (2002). Some results on false discovery rate in stepwise multiple testing procedures. Annals of Statistics, 30:239–257.
Article MATH MathSciNet Google Scholar
Schweder T. and Spjotvoll E., (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika, 69:493–502.
Google Scholar
Seeger P., (1968). A note on a method for the analysis of significances en mass. Technometrics, 10:586–593.
Article Google Scholar
Simes R.J., (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73:751–754.
Article MATH MathSciNet Google Scholar
Storey J.D., (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B, 64:479–498.
Article MATH MathSciNet Google Scholar
Storey J.D., Taylor J.E. and Siegmund D., (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society Series B, 66:187–205.
Article MATH MathSciNet Google Scholar
Therneau T.M. and Grambsch P.M., (2000). Modeling Survival Data, Extending the Cox Model. Springer.
Google Scholar
Tibshirani R. and Knight K., (1999). The covariance inflation criterion for adaptive model selection. Journal of the Royal Statistical Society Series B, 61:Part 3 529–546.
MATH MathSciNet Google Scholar
Zembowicz R. and Zytkov J.M., (1996). From contingency tables to various froms of knowledge in databases. In U.M. Fayyad, R. Uthurusamy, G. Piatetsky-Shapiro and P. Smyth (editors) Advances in Knowledge Discovery and Data Mining (pp. 329-349). MIT Press.
Google Scholar
Zytkov J.M. and Zembowicz R., (1997). Contingency tables as the foundation for concepts, concept hierarchies and rules: The 49er system approach. Fundamenta Informaticae 30:383–399.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, School of Mathematical Sciences, Sackler Faculty for Exact Sciences Tel Aviv University, Tel-Aviv, Israel
Yoav Benjamini
Faculty of Management and Sackler Faculty of Medicine Tel Aviv University, Tel-Aviv, Israel
Moshe Leshno

Authors

Yoav Benjamini
View author publications
You can also search for this author in PubMed Google Scholar
Moshe Leshno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoav Benjamini .

Editor information

Editors and Affiliations

, Dept. Industrial Engineering, Tel Aviv University, Ramat Aviv, 69978, Israel
Oded Maimon
, Dept. Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Lior Rokach

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Benjamini, Y., Leshno, M. (2009). Statistical Methods for Data Mining. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_25

Download citation

DOI: https://doi.org/10.1007/978-0-387-09823-4_25
Published: 07 July 2010
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics