Skip to main content

On Understanding and Assessing Feature Selection Bias

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3581))

Abstract

Feature selection in high-dimensional biomedical data, such as gene expression arrays or biomedical spectra constitutes and important step towards biomarker discovery. Controlling feature selection bias is considered a major issue for a realistic assessment of the feature selection process. We propose a theoretical, probabilistic framework for the analysis of selection bias. In particular, we derive the means of calculating the true selection error when the performance estimates of the feature subsets are mutually dependent and the distribution density of the true error is arbitrary. We demonstrate in an extensive series of experiments the utility of the theoretical derivations with real-world datasets. We discuss the importance of understanding feature selection bias for the small sample size (n) / high dimensionality (p) situation, typical for biomedical data (genomics, proteomics, spectroscopy).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Raudys, S.J., Jain, A.K.: Small sample size effects in statistical pattern recognition: Recommendation for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3), 242–254 (1991)

    Article  Google Scholar 

  2. Raudys, S.: Statistical and Neural Classifiers - An Integrated Approach to Design. Springer, London (2001)

    MATH  Google Scholar 

  3. Somorjai, R.L., Dolenko, B., Baumgartner, R.: Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: Curses, Caveats, Cautions. Bioinformatics 19(12), 1484–1491 (2003)

    Article  Google Scholar 

  4. Estes, S.E.: Measurement selection for linear discriminant used in pattern classification. PhD. Thesis. Stanford University (1965)

    Google Scholar 

  5. Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99(10), 6562–6566 (2002)

    Article  MATH  Google Scholar 

  6. Raudys, S.: Influence of sample size on the accuracy of model selection in pattern recognition. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 50, pp. 9–30. Institute of Mathematics and Informatics, Vilnius (1981) (in Russian)

    Google Scholar 

  7. Golub, T.R., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)

    Article  Google Scholar 

  8. Raudys, S., Pikelis, V.: Collective selection of the best version of a pattern recognition system. Pattern Recognition Letters 1(1), 7–13 (1982)

    Article  Google Scholar 

  9. Raudys, S.: Classification errors when features are selected. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 38, pp. 9–26. Institute of Mathematics and Informatics, Vilnius (1979) (in Russian)

    Google Scholar 

  10. Pikelis, V.: Calculating statistical characteristics of experimental process for selecting the best version. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 93, pp. 46–56. Institute of Mathematics and Informatics, Vilnius (1991) (in Russian)

    Google Scholar 

  11. Himmelreich, U., et al.: Rapid identification of Candida species by using nuclear magnetic resonance spectroscopy and a statistical classification strategy. Appl. Environ. Microbiol. 69(8), 4566–4574 (2003)

    Article  Google Scholar 

  12. Petricoin, E., et al.: Use of proteomics patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002)

    Article  Google Scholar 

  13. Lean, C., et al.: Accurate diagnosis and prognosis of human cancers by proton MRS and a three-stage classification strategy. Annual Reports on NMR Spectroscopy 48, 71–111 (2002)

    Article  Google Scholar 

  14. Gumbel, E.: Statistics of extremes. Dover Publications, New York (2004)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Raudys, Š., Baumgartner, R., Somorjai, R. (2005). On Understanding and Assessing Feature Selection Bias. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds) Artificial Intelligence in Medicine. AIME 2005. Lecture Notes in Computer Science(), vol 3581. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527770_63

Download citation

  • DOI: https://doi.org/10.1007/11527770_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-27831-3

  • Online ISBN: 978-3-540-31884-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics