Summary
The aim of this chapter is to present the main statistical issues in Data Mining (DM) and Knowledge Data Discovery (KDD) and to examine whether traditional statistics approach and methods substantially differ from the new trend of KDD and DM. We address and emphasize some central issues of statistics which are highly relevant to DM and have much to offer to DM
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abramovich F. and Benjamini Y., (1996). Adaptive thresholding of wavelet coefficients. Computational Statistics & Data Analysis, 22:351–361.
Abramovich F., Bailey T .C. and Sapatinas T., (2000). Wavelet analysis and its statistical applications. Journal of the Royal Statistical Society Series D-The Statistician, 49:1–29.
Abramovich F., Benjamini Y., Donoho D. and Johnstone I., (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000-19, Department of Statistics, Stanford University.
Benjamini Y. and Hochberg Y., (1995). Controlling the false discover rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57:289–300.
Benjamini Y. and Hochberg Y., (2000). On the adaptive control of the false discovery fate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60–83.
Benjamini Y., Krieger A.M. and Yekutieli D., (2001). Two staged linear step up for controlling procedure. Technical report, Department of Statistics and O.R., Tel Aviv University.
Benjamini Y. and Yekutieli D., (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29:1165–1188.
Berthold M. and Hand D., (1999). Intelligent Data Analysis: An Introduction. Springer.
Birge L. and Massart P., (2001). Gaussian model selection. Journal of the European Mathematical Society, 3:203–268.
Chatfield C., (1995). Model uncertainty, Data Mining and statistical inference. Journal of the Royal Statistical Society A, 158:419–466.
Cochran W.G., (1977). Sampling Techniques. Wiley.
Cox D.R., (1972). Regressio models and life-tables. Journal of the Royal Statistical Society B, 34:187–220.
Dell’Aquila R. and Ronchetti E.M., (2004). Introduction to Robust Statistics with Economic and Financial Applications. Wiley.
Donoho D.L. and Johnstone I.M., (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90:1200–1224.
Donoho D., (2000). American math. society: Math challenges of the 21st century: Highdimensional data analysis: The curses and blessings of dimensionality.
Efron B., Tibshirani R.J., Storey J.D. and Tusher V., (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96:1151–1160.
Friedman J.H., (1998). Data Mining and Statistics: What’s the connections?, Proc. 29th Symposium on the Interface (D. Scott, editor).
Foster D.P. and Stine R.A., (2004). Variable selection in Data Mining: Building a predictive model for bankruptcy. Journal of the American Statistical Association, 99:303–313.
Gavrilov Y., (2003). Using the falls discovery rate criteria for model selection in linear regression. M.Sc. Thesis, Department of Statistics, Tel Aviv University.
Genovese C. and Wasserman L., (2002a). Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society Series B, 64:499–517.
Genovese C. and Wasserman L., (2002b). A stochastic process approach to false discovery rates. Technical Report 762, Department of Statistics, Carnegie Mellon University.
George E.I. and Foster D.P., (2000). Calibration and empirical Bayes variable selection. Biometrika, 87:731–748.
Hand D., (1998). Data Mining: Statistics and more? The American Statistician, 52:112–118.
Hand D., Mannila H. and Smyth P., (2001). Principles of Data Mining. MIT Press.
Han J. and Kamber M., (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher.
Hastie T., Tibshirani R. and Friedman J., (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
Hochberg Y. and Benjamini Y., (1990). More powerful procedures for multiple significance testing. Statistics in Medicine, 9:811–818.
Leshno M., Lin V.Y., Pinkus A. and Schocken S., (1993). Multilayer feedforward networks with a non polynomial activation function can approximate any function. Neural Networks, 6:861–867.
McCullagh P. and Nelder J.A., (1991). Generalized Linear Model. Chapman & Hall.
Meilijson I., (1991). The expected value of some functions of the convex hull of a random set of points sampled in rd. Isr. J. of Math., 72:341–352.
Mosteller F. and Tukey J.W., (1977). Data Analysis and Regression : A Second Course in Statistics. Wiley.
Roberts S. and Everson R. (editors), (2001). Independent Component Analysis : Principles and Practice. Cambridge University Press.
Ronchetti E.M., Hampel F.R., Rousseeuw P.J. and Stahel W.A., (1986). Robust Statistics : The Approach Based on Influence Functions. Wiley.
Sarkar S.K., (2002). Some results on false discovery rate in stepwise multiple testing procedures. Annals of Statistics, 30:239–257.
Schweder T. and Spjotvoll E., (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika, 69:493–502.
Seeger P., (1968). A note on a method for the analysis of significances en mass. Technometrics, 10:586–593.
Simes R.J., (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73:751–754.
Storey J.D., (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B, 64:479–498.
Storey J.D., Taylor J.E. and Siegmund D., (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society Series B, 66:187–205.
Therneau T.M. and Grambsch P.M., (2000). Modeling Survival Data, Extending the Cox Model. Springer.
Tibshirani R. and Knight K., (1999). The covariance inflation criterion for adaptive model selection. Journal of the Royal Statistical Society Series B, 61:Part 3 529–546.
Zembowicz R. and Zytkov J.M., (1996). From contingency tables to various froms of knowledge in databases. In U.M. Fayyad, R. Uthurusamy, G. Piatetsky-Shapiro and P. Smyth (editors) Advances in Knowledge Discovery and Data Mining (pp. 329-349). MIT Press.
Zytkov J.M. and Zembowicz R., (1997). Contingency tables as the foundation for concepts, concept hierarchies and rules: The 49er system approach. Fundamenta Informaticae 30:383–399.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Benjamini, Y., Leshno, M. (2009). Statistical Methods for Data Mining. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_25
Download citation
DOI: https://doi.org/10.1007/978-0-387-09823-4_25
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)