Abstract
Lately, derived from the explosion of high dimensionality, researchers in machine learning became interested not only in accuracy, but also in scalability. Although scalability of learning methods is a trending issue, scalability of feature selection methods has not received the same amount of attention. This research analyzes the scalability of state-of-the-art feature selection methods, belonging to filter, embedded and wrapper approaches. For this purpose, several new measures are presented, based not only on accuracy but also on execution time and stability. The results on seven classical artificial datasets are presented and discussed, as well as two cases study analyzing the particularities of microarray data and the effect of redundancy. Trying to check whether the results can be generalized, we included some experiments with two real datasets. As expected, filters are the most scalable feature selection approach, being INTERACT, ReliefF and mRMR the most accurate methods.
Similar content being viewed by others
Notes
Colon Cancer dataset is available on http://datam.i2r.a-star.edu.sg/datasets/krbd.
KDD Cup 99 dataset is available on http://kdd.ics.uci.edu/kddcup99/kddcup99.html.
References
Ahmed A, Xing EP (2013) Scalable dynamic nonparametric Bayesian models of contents and users. In: International joint conference on artificial intelligence, IJCAI, pp 3111–3116
Alonso-Betanzos A, Bolón-Canedo V, Fernández-Francos D, Porto-Díaz I, Sánchez-Maroño N (2013) Efficiency and scalability methods for computational intellect, chapter up-to-date feature selection methods for scalable and efficient machine learning, IGI Global, pp 1–26
Bolon-Canedo V, Sanchez-Marono N, Alonso-Betanzos A (2010) On the effectiveness of discretization on gene selection of microarray data. In: The 2010 international joint conference on neural networks (IJCNN), IEEE, pp 3167–3174
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2011) Feature selection and classification in multiple class datasets: an application to kdd cup 99 dataset. Expert Syst Appl 38(5):5947–5957
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2013) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519
Bottou L, Bousquet O (2011) The tradeoffs of large-scale learning. In: Optimization for machine learning, pp 351–368
Breinman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks-Cole Advanced Books and Software, Pacific Grove
Brown G, Pocock A, Zhao M-J, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176
Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):3
Fahad A, Tari Z, Khalil I, Habib I, Alnuweiri H (2013) Toward an efficient and scalable feature selection approach for internet traffic classification. Comput Netw 57:2040–2057
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Gulgezen G, Cataltepe Z, Yu L (2009) Stable and accurate feature selection. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 455–468
Guyon I (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: McDonald C (ed) Computer science ’98 proceedings of the 21st Australasian computer science conference ACSC’98, Perth, 4–6 February, 1998. Springer, Berlin, pp 181–191
Hoi SC, Wang J, Zhao P, Jin R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, ACM, pp 93–100
Hughes G (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14(1):55–63
John GH, Kohavi R, Pfleger K et al (1994) Irrelevant features and the subset selection problem. ICML 94:121–129
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning, pp 249–256. Morgan Kaufmann Publishers Inc
Koller D, Sahami M (1995) Toward optimal feature selection. In: 13th international conference on machine learning, pp 284–292
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94, Springer, pp 171–182
Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. In: Proceedings of the 19th international conference on world wide web, ACM, pp 571–580
Liu H, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of seventh international conference on tools with artificial intelligence, 1995, IEEE, pp 388–391
Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: Proceedings of the 13th international conference on machine learning, pp 319–327. Morgan Kaufmann
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Luo D, Wang F, Sun J, Markatou M, Hu J, Ebadollahi S (2012) Sor: Scalable orthogonal regression for non-redundant feature selection and its healthcare applications. In: SIAM data mining conference, pp 576–587
Mejía-Lavalle M, Sucar E, Arroyo G (2006) Feature selection with a perceptron neural net. In: Proceedings of the international workshop on feature selection for data mining, pp 131–135
Nemenyi P (1963) Distribution-free multiple comparisons. Ph.D. thesis, Princeton University
Nogueira S, Brown G (2016) Measuring the stability of feature selection. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, pp 442–457
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Peteiro-Barral D, Bolon-Canedo V, Alonso-Betanzos A, Guijarro-Berdinas B, Sanchez-Marono N (2012) Scalability analysis of filter-based methods for feature selection. Adv Smart Syst Res 2(1):21–26
Peteiro-Barral D, Bolón-Canedo V, Alonso-Betanzos A, Guijarro-Berdiñas B, Sánchez-Maroño N (2012) Toward the scalability of neural networks through feature selection. Expert Syst Appl 4(8):2807–2816
Peteiro-Barral D, Guijarro-Berdiñas B (2013) A study on the scalability of artificial neural networks training algorithms using multiple-criteria decision-making methods. In: Artificial intelligence and soft computing, volume 7894 of lecture notes in computer science, Springer, pp 162–173
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Raykar VC, Duraiswami R, Zhao LH (2010) Fast computation of kernel estimators. J Comput Graph Stat 19(1):205–220
Rokach L, Schclar A, Itach E (2013) Ensemble methods for multi-label classification. arXiv preprint arXiv:1307.1769
Sonnenburg S, Franc V, Yom-Tov E, Sebag M (2008) Pascal large scale learning challenge. In: 25th international conference on machine learning (ICML2008) workshop. Journal of Machine Learning Research, vol 10, pp 1937–1953
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Sun Y, Todorovic S, Goodison S (2008) A feature selection algorithm capable of handling extremely large data dimensionality. In: Proceedings of the 2008 SIAM international conference in data mining, pp 530–540
Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, De Jong K, Dzeroski S, Fahlman SE, Fisher D, et al(1991) The monk’s problems a performance comparison of different learning algorithms
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Data mining and knowledge discovery handbook, Springer, pp 667–685
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Yui M, Kojima I (2013) A database-hadoop hybrid approach to scalable machine learning. In: IEEE international congress on Big Data 2013. IEEE, pp 1–8
Zhang M-L, Peña JM, Robles V (2009) Feature selection for multi-label naive bayes classification. Inf Sci 179(19):3218–3229
Zhao Z, Liu H (2007) Searching for interacting features. IJCAI 7:1156–1161
Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92:195–220
Zhao ZA, Liu H (2011) Spectral feature selection for data mining. Chapman & Hall/CRC, Boca Raton
Zhu Z, Ong Y-S, Zurada JM (2010) Identification of full and partial class relevant genes. IEEE/ACM Trans Comput Biol Bioinf 7(2):263–277
Acknowledgements
This research has been economically supported in part by the Ministerio de Economía y Competitividad of the Spanish Government (Research Project TIN2015-65069-C2-1-R), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project GRC2014/035). Financial support from the Xunta de Galicia (Centro singular de investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund-ERDF) is gratefully acknowledged (Research Project ED431G/01).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bolón-Canedo, V., Rego-Fernández, D., Peteiro-Barral, D. et al. On the scalability of feature selection methods on high-dimensional data. Knowl Inf Syst 56, 395–442 (2018). https://doi.org/10.1007/s10115-017-1140-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1140-3