On the scalability of feature selection methods on high-dimensional data

Bolón-Canedo, V.; Rego-Fernández, D.; Peteiro-Barral, D.; Alonso-Betanzos, A.; Guijarro-Berdiñas, B.; Sánchez-Maroño, N.

doi:10.1007/s10115-017-1140-3

On the scalability of feature selection methods on high-dimensional data

Regular Paper
Published: 16 December 2017

Volume 56, pages 395–442, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

V. Bolón-Canedo ORCID: orcid.org/0000-0002-0524-6427¹,
D. Rego-Fernández¹,
D. Peteiro-Barral¹,
A. Alonso-Betanzos¹,
B. Guijarro-Berdiñas¹ &
…
N. Sánchez-Maroño¹

1272 Accesses
32 Citations
Explore all metrics

Abstract

Lately, derived from the explosion of high dimensionality, researchers in machine learning became interested not only in accuracy, but also in scalability. Although scalability of learning methods is a trending issue, scalability of feature selection methods has not received the same amount of attention. This research analyzes the scalability of state-of-the-art feature selection methods, belonging to filter, embedded and wrapper approaches. For this purpose, several new measures are presented, based not only on accuracy but also on execution time and stability. The results on seven classical artificial datasets are presented and discussed, as well as two cases study analyzing the particularities of microarray data and the effect of redundancy. Trying to check whether the results can be generalized, we included some experiments with two real datasets. As expected, filters are the most scalable feature selection approach, being INTERACT, ReliefF and mRMR the most accurate methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FeatureSelect: a software for feature selection based on machine learning approaches

Article Open access 03 April 2019

Yosef Masoudi-Sobhanzadeh, Habib Motieghader & Ali Masoudi-Nejad

Py_FS: A Python Package for Feature Selection Using Meta-Heuristic Optimization Algorithms

Powershap: A Power-Full Shapley Feature Selection Method

Notes

http://www.lidiagroup.org/index.php/en/materials-en.html.
Colon Cancer dataset is available on http://datam.i2r.a-star.edu.sg/datasets/krbd.
KDD Cup 99 dataset is available on http://kdd.ics.uci.edu/kddcup99/kddcup99.html.
http://www.lidiagroup.org/index.php/en/materials-en.html.

References

Ahmed A, Xing EP (2013) Scalable dynamic nonparametric Bayesian models of contents and users. In: International joint conference on artificial intelligence, IJCAI, pp 3111–3116
Alonso-Betanzos A, Bolón-Canedo V, Fernández-Francos D, Porto-Díaz I, Sánchez-Maroño N (2013) Efficiency and scalability methods for computational intellect, chapter up-to-date feature selection methods for scalable and efficient machine learning, IGI Global, pp 1–26
Bolon-Canedo V, Sanchez-Marono N, Alonso-Betanzos A (2010) On the effectiveness of discretization on gene selection of microarray data. In: The 2010 international joint conference on neural networks (IJCNN), IEEE, pp 3167–3174
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2011) Feature selection and classification in multiple class datasets: an application to kdd cup 99 dataset. Expert Syst Appl 38(5):5947–5957
Article Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2013) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519
Article Google Scholar
Bottou L, Bousquet O (2011) The tradeoffs of large-scale learning. In: Optimization for machine learning, pp 351–368
Breinman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks-Cole Advanced Books and Software, Pacific Grove
Google Scholar
Brown G, Pocock A, Zhao M-J, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
MathSciNet MATH Google Scholar
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176
Article MathSciNet MATH Google Scholar
Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):3
Article Google Scholar
Fahad A, Tari Z, Khalil I, Habib I, Alnuweiri H (2013) Toward an efficient and scalable feature selection approach for internet traffic classification. Comput Netw 57:2040–2057
Article Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Article MATH Google Scholar
Gulgezen G, Cataltepe Z, Yu L (2009) Stable and accurate feature selection. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 455–468
Guyon I (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin
Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Article MATH Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: McDonald C (ed) Computer science ’98 proceedings of the 21st Australasian computer science conference ACSC’98, Perth, 4–6 February, 1998. Springer, Berlin, pp 181–191
Hoi SC, Wang J, Zhao P, Jin R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, ACM, pp 93–100
Hughes G (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14(1):55–63
Article Google Scholar
John GH, Kohavi R, Pfleger K et al (1994) Irrelevant features and the subset selection problem. ICML 94:121–129
Google Scholar
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Article MATH Google Scholar
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning, pp 249–256. Morgan Kaufmann Publishers Inc
Koller D, Sahami M (1995) Toward optimal feature selection. In: 13th international conference on machine learning, pp 284–292
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94, Springer, pp 171–182
Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. In: Proceedings of the 19th international conference on world wide web, ACM, pp 571–580
Liu H, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of seventh international conference on tools with artificial intelligence, 1995, IEEE, pp 388–391
Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: Proceedings of the 13th international conference on machine learning, pp 319–327. Morgan Kaufmann
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Article Google Scholar
Luo D, Wang F, Sun J, Markatou M, Hu J, Ebadollahi S (2012) Sor: Scalable orthogonal regression for non-redundant feature selection and its healthcare applications. In: SIAM data mining conference, pp 576–587
Mejía-Lavalle M, Sucar E, Arroyo G (2006) Feature selection with a perceptron neural net. In: Proceedings of the international workshop on feature selection for data mining, pp 131–135
Nemenyi P (1963) Distribution-free multiple comparisons. Ph.D. thesis, Princeton University
Nogueira S, Brown G (2016) Measuring the stability of feature selection. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, pp 442–457
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Peteiro-Barral D, Bolon-Canedo V, Alonso-Betanzos A, Guijarro-Berdinas B, Sanchez-Marono N (2012) Scalability analysis of filter-based methods for feature selection. Adv Smart Syst Res 2(1):21–26
Google Scholar
Peteiro-Barral D, Bolón-Canedo V, Alonso-Betanzos A, Guijarro-Berdiñas B, Sánchez-Maroño N (2012) Toward the scalability of neural networks through feature selection. Expert Syst Appl 4(8):2807–2816
Article Google Scholar
Peteiro-Barral D, Guijarro-Berdiñas B (2013) A study on the scalability of artificial neural networks training algorithms using multiple-criteria decision-making methods. In: Artificial intelligence and soft computing, volume 7894 of lecture notes in computer science, Springer, pp 162–173
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Raykar VC, Duraiswami R, Zhao LH (2010) Fast computation of kernel estimators. J Comput Graph Stat 19(1):205–220
Article MathSciNet Google Scholar
Rokach L, Schclar A, Itach E (2013) Ensemble methods for multi-label classification. arXiv preprint arXiv:1307.1769
Sonnenburg S, Franc V, Yom-Tov E, Sebag M (2008) Pascal large scale learning challenge. In: 25th international conference on machine learning (ICML2008) workshop. Journal of Machine Learning Research, vol 10, pp 1937–1953
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Article Google Scholar
Sun Y, Todorovic S, Goodison S (2008) A feature selection algorithm capable of handling extremely large data dimensionality. In: Proceedings of the 2008 SIAM international conference in data mining, pp 530–540
Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, De Jong K, Dzeroski S, Fahlman SE, Fisher D, et al(1991) The monk’s problems a performance comparison of different learning algorithms
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Data mining and knowledge discovery handbook, Springer, pp 667–685
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
MathSciNet MATH Google Scholar
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
MathSciNet MATH Google Scholar
Yui M, Kojima I (2013) A database-hadoop hybrid approach to scalable machine learning. In: IEEE international congress on Big Data 2013. IEEE, pp 1–8
Zhang M-L, Peña JM, Robles V (2009) Feature selection for multi-label naive bayes classification. Inf Sci 179(19):3218–3229
Article MATH Google Scholar
Zhao Z, Liu H (2007) Searching for interacting features. IJCAI 7:1156–1161
Google Scholar
Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92:195–220
Article MathSciNet MATH Google Scholar
Zhao ZA, Liu H (2011) Spectral feature selection for data mining. Chapman & Hall/CRC, Boca Raton
Book Google Scholar
Zhu Z, Ong Y-S, Zurada JM (2010) Identification of full and partial class relevant genes. IEEE/ACM Trans Comput Biol Bioinf 7(2):263–277
Article Google Scholar

Download references

Acknowledgements

This research has been economically supported in part by the Ministerio de Economía y Competitividad of the Spanish Government (Research Project TIN2015-65069-C2-1-R), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project GRC2014/035). Financial support from the Xunta de Galicia (Centro singular de investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund-ERDF) is gratefully acknowledged (Research Project ED431G/01).

Author information

Authors and Affiliations

Laboratory for Research and Development in Artificial Intelligence (LIDIA), Computer Science Department, University of A Coruña, 15071, A Coruña, Spain
V. Bolón-Canedo, D. Rego-Fernández, D. Peteiro-Barral, A. Alonso-Betanzos, B. Guijarro-Berdiñas & N. Sánchez-Maroño

Authors

V. Bolón-Canedo
View author publications
You can also search for this author in PubMed Google Scholar
D. Rego-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
D. Peteiro-Barral
View author publications
You can also search for this author in PubMed Google Scholar
A. Alonso-Betanzos
View author publications
You can also search for this author in PubMed Google Scholar
B. Guijarro-Berdiñas
View author publications
You can also search for this author in PubMed Google Scholar
N. Sánchez-Maroño
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. Bolón-Canedo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bolón-Canedo, V., Rego-Fernández, D., Peteiro-Barral, D. et al. On the scalability of feature selection methods on high-dimensional data. Knowl Inf Syst 56, 395–442 (2018). https://doi.org/10.1007/s10115-017-1140-3

Download citation

Received: 10 May 2016
Revised: 09 October 2017
Accepted: 20 November 2017
Published: 16 December 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10115-017-1140-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the scalability of feature selection methods on high-dimensional data

Abstract

Access this article

Similar content being viewed by others

FeatureSelect: a software for feature selection based on machine learning approaches

Py_FS: A Python Package for Feature Selection Using Meta-Heuristic Optimization Algorithms

Powershap: A Power-Full Shapley Feature Selection Method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the scalability of feature selection methods on high-dimensional data

Abstract

Access this article

Similar content being viewed by others

FeatureSelect: a software for feature selection based on machine learning approaches

Py_FS: A Python Package for Feature Selection Using Meta-Heuristic Optimization Algorithms

Powershap: A Power-Full Shapley Feature Selection Method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation