Ensemble feature selection for high dimensional data: a new method and a comparative study

Ben Brahim, Afef; Limam, Mohamed

doi:10.1007/s11634-017-0285-y

Ensemble feature selection for high dimensional data: a new method and a comparative study

Regular Article
Published: 24 April 2017

Volume 12, pages 937–952, (2018)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Afef Ben Brahim¹ &
Mohamed Limam²

2542 Accesses
65 Citations
Explore all metrics

Abstract

The curse of dimensionality is based on the fact that high dimensional data is often difficult to work with. A large number of features can increase the noise of the data and thus the error of a learning algorithm. Feature selection is a solution for such problems where there is a need to reduce the data dimensionality. Different feature selection algorithms may yield feature subsets that can be considered local optima in the space of feature subsets. Ensemble feature selection combines independent feature subsets and might give a better approximation to the optimal subset of features. We propose an ensemble feature selection approach based on feature selectors’ reliability assessment. It aims at providing a unique and stable feature selection without ignoring the predictive accuracy aspect. A classification algorithm is used as an evaluator to assign a confidence to features selected by ensemble members based on their associated classification performance. We compare our proposed approach to several existing techniques and to individual feature selection algorithms. Results show that our approach often improves classification performance and feature selection stability for high dimensional data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

References

Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398
Article Google Scholar
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson JJ, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Chan D, Bridges SM, Burgess SC (2008) An ensemble method for identifying robust features for biomarker discovery. Chapman and Hall/CRC Press, Boca Raton
Google Scholar
Dietterich TG (2000) Ensemble methods in machine learning. In: Proceedings of the first international workshop on multiple classifier systems. Springer-Verlag, London, UK, UK, pp 1–15
Google Scholar
Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen JL, Marcussen N, Hamilton-Dutoit S, Wolf H, Orntoft TF (2003) Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet. 33:90–96
Article Google Scholar
Garcia MA, Puig D (2003) Robust aggregation of expert opinions based on conflict analysis and resolution. In: CAEPIA, Lecture Notes in Computer Science, Springer, pp 488–497
Chapter Google Scholar
Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, Ramaswamy S, Richards W, Sugarbaker D, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967
Google Scholar
Gosset WS (1908) The probable error of a mean. Biometrika 1:1–25
Google Scholar
Guyon I, Elisseff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hastie TJ, Tibshirani RJ, Friedman JH (2009) The elements of statistical learning : data mining, inference, and prediction. Springer series in statistics. Springer, New York
Book Google Scholar
Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12(1):95–116
Article Google Scholar
Kira K, Rendell L (1992) A practical approach to feature selection. In: Sleeman D, Edwards P (eds) International conference on machine learning, pp 368–377
Chapter Google Scholar
Kohane IS, Kho AT, Butte AJ (2003) Microarrays for an integrative genomics. MIT Press, Cambridge
Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol 2, Morgan Kaufmann Publishers Inc., pp 1137–1143
Kolde R, Laur S, Adler P, Vilo J (2012) Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4):573–580
Article Google Scholar
Kuncheva L (2007) A stability index for feature selection. In: Proceedings of the 25th IASTED international multi-conference: artificial intelligence and applications, Innsbruck, Austria, pp 390–395
Mitchell L, Sloan T, Mewissen M, Ghazal P, Forster T, Piotrowski M, Trew A (2014) Parallel classification and feature selection in microarray data using sprint. Concurr Comput Pract Exp 26(4):854–865
Article Google Scholar
Okun O (2011) Feature selection and ensemble methods for bioinformatics: algorithmic classification and implementations. IGI Global, Hershy, PA
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238
Article Google Scholar
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442
Article Google Scholar
Saeys Y, Abeel T, Peer Y (2008) Robust feature selection using ensemble feature selection techniques. In: Proceedings of the European conference on machine learning and knowledge discovery in databases—Part II, ECML PKDD ’08, Springer-Verlag, Berlin, Heidelberg, pp 313–325
Chapter Google Scholar
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517
Article Google Scholar
Schowe B, Morik K (2011) Fast-ensembles of minimum redundancy feature selection. In: Ensembles in machine learning applications: studies in computational intelligence, vol 373, pp 75–95
Chapter Google Scholar
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS (2002) Diffuse large b(cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 9:68–74
Article Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Article Google Scholar
Troyanskaya OG, Cantor M, Sherlock G, Brown PO, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17(6):520–525
Article Google Scholar
van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002, January) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536
Article Google Scholar

Download references

Author information

Authors and Affiliations

Université de Tunis, Tunis Business School, LARODEC, BP 65, 2059, BIR El Kassaa, Tunisia
Afef Ben Brahim
Dhofar University, Salalah, Sultanate of Oman
Mohamed Limam

Authors

Afef Ben Brahim
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Limam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Afef Ben Brahim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben Brahim, A., Limam, M. Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv Data Anal Classif 12, 937–952 (2018). https://doi.org/10.1007/s11634-017-0285-y

Download citation

Received: 15 January 2015
Revised: 30 October 2016
Accepted: 17 April 2017
Published: 24 April 2017
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11634-017-0285-y

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble feature selection for high dimensional data: a new method and a comparative study

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature dimensionality reduction: a review

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Ensemble feature selection for high dimensional data: a new method and a comparative study

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Feature dimensionality reduction: a review

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation