Abstract
Feature selection in high-dimensional biomedical data, such as gene expression arrays or biomedical spectra constitutes and important step towards biomarker discovery. Controlling feature selection bias is considered a major issue for a realistic assessment of the feature selection process. We propose a theoretical, probabilistic framework for the analysis of selection bias. In particular, we derive the means of calculating the true selection error when the performance estimates of the feature subsets are mutually dependent and the distribution density of the true error is arbitrary. We demonstrate in an extensive series of experiments the utility of the theoretical derivations with real-world datasets. We discuss the importance of understanding feature selection bias for the small sample size (n) / high dimensionality (p) situation, typical for biomedical data (genomics, proteomics, spectroscopy).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Raudys, S.J., Jain, A.K.: Small sample size effects in statistical pattern recognition: Recommendation for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3), 242–254 (1991)
Raudys, S.: Statistical and Neural Classifiers - An Integrated Approach to Design. Springer, London (2001)
Somorjai, R.L., Dolenko, B., Baumgartner, R.: Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: Curses, Caveats, Cautions. Bioinformatics 19(12), 1484–1491 (2003)
Estes, S.E.: Measurement selection for linear discriminant used in pattern classification. PhD. Thesis. Stanford University (1965)
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99(10), 6562–6566 (2002)
Raudys, S.: Influence of sample size on the accuracy of model selection in pattern recognition. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 50, pp. 9–30. Institute of Mathematics and Informatics, Vilnius (1981) (in Russian)
Golub, T.R., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Raudys, S., Pikelis, V.: Collective selection of the best version of a pattern recognition system. Pattern Recognition Letters 1(1), 7–13 (1982)
Raudys, S.: Classification errors when features are selected. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 38, pp. 9–26. Institute of Mathematics and Informatics, Vilnius (1979) (in Russian)
Pikelis, V.: Calculating statistical characteristics of experimental process for selecting the best version. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 93, pp. 46–56. Institute of Mathematics and Informatics, Vilnius (1991) (in Russian)
Himmelreich, U., et al.: Rapid identification of Candida species by using nuclear magnetic resonance spectroscopy and a statistical classification strategy. Appl. Environ. Microbiol. 69(8), 4566–4574 (2003)
Petricoin, E., et al.: Use of proteomics patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002)
Lean, C., et al.: Accurate diagnosis and prognosis of human cancers by proton MRS and a three-stage classification strategy. Annual Reports on NMR Spectroscopy 48, 71–111 (2002)
Gumbel, E.: Statistics of extremes. Dover Publications, New York (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raudys, Š., Baumgartner, R., Somorjai, R. (2005). On Understanding and Assessing Feature Selection Bias. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds) Artificial Intelligence in Medicine. AIME 2005. Lecture Notes in Computer Science(), vol 3581. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527770_63
Download citation
DOI: https://doi.org/10.1007/11527770_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27831-3
Online ISBN: 978-3-540-31884-2
eBook Packages: Computer ScienceComputer Science (R0)