ABSTRACT
High dimensionality is one of the data quality problems that affects the performance of machine learning models. Feature selection which aims to identify and remove as many redundant and irrelevant features as possible allows to boot the overall performance of the models while reducing the computational cost. However, the choice of an appropriate feature selection method is still a big challenge as there is no the best selection criterion that fits to all datasets. It is then essential to comparatively analyze the performance of feature selection criteria according to different characteristics of high-dimensional datasets, particularly large financial datasets where features are highly-correlated and redundant. In this paper, we explore nine different feature selection criteria which are typically categorized into two classes: (i) information theoretical based criteria and (ii) similarity based criteria over seven public financial datasets. To the best of our knowledge, no previous comprehensive empirical investigation has been carried out to demonstrate the positive effects of feature selection criteria on financial data. Experimental results indicate that the information theoretical-based methods suffer from a high computation time in case of high dimensional data (i.e., high number of features) while the similarity-based methods require significant computations to deal with high volume dataset (i.e., high number of samples).
- Raymond Anderson. 2007. The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford University Press.Google Scholar
- Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks 5, 4 (1994), 537–550.Google ScholarDigital Library
- Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-based systems 86 (2015), 33–45.Google Scholar
- Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research 13 (2012), 27–66.Google Scholar
- Zheng Chen, Meng Pang, Zixin Zhao, Shuainan Li, Rui Miao, Yifan Zhang, Xiaoyue Feng, Xin Feng, Yexian Zhang, Meiyu Duan, 2020. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 36, 5 (2020), 1542–1552.Google Scholar
- Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2017. A large-scale study of the impact of feature selection techniques on defect classification models. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 146–157.Google ScholarDigital Library
- D Asir Antony Gnana, S Appavu Alias Balamurugan, and E Jebamalar Leavline. 2016. Literature review on feature selection methods for high-dimensional data. International Journal of Computer Applications 136, 1(2016), 9–17.Google ScholarCross Ref
- Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.Google Scholar
- Peter E Hart, David G Stork, and Richard O Duda. 2000. Pattern classification. Wiley Hoboken.Google Scholar
- Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. Advances in neural information processing systems 18 (2005).Google Scholar
- Firuz Kamalov and Fadi Thabtah. 2017. A feature selection method based on ranked vector scores of features for classification. Annals of Data Science 4, 4 (2017), 483–502.Google ScholarCross Ref
- Nikita Kozodoi, Stefan Lessmann, Konstantinos Papakonstantinou, Yiannis Gatsoulis, and Bart Baesens. 2019. A multi-objective approach for profit-driven feature selection in credit scoring. Decision support systems 120 (2019), 106–117.Google Scholar
- Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, and Lyn C Thomas. 2015. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research 247, 1 (2015), 124–136.Google ScholarCross Ref
- David D Lewis. 1992. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.Google ScholarDigital Library
- Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM computing surveys (CSUR) 50, 6 (2017), 1–45.Google Scholar
- Hongliang Liang, Lu Sun, Meilin Wang, and Yuxing Yang. 2019. Deep learning with customized abstract syntax tree for bug localization. IEEE Access 7(2019), 116309–116320.Google ScholarCross Ref
- Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In European conference on computer vision. Springer, 68–82.Google ScholarDigital Library
- Sebastián Maldonado, Juan Pérez, and Cristián Bravo. 2017. Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research 261, 2 (2017), 656–665.Google ScholarCross Ref
- David Alfred Ostrowski. 2014. Feature selection for twitter classification. In 2014 IEEE International Conference on Semantic Computing. IEEE, 267–272.Google ScholarDigital Library
- Kunal Pahwa and Neha Agarwal. 2019. Stock market analysis using supervised machine learning. In 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). IEEE, 197–200.Google ScholarCross Ref
- Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8(2005), 1226–1238.Google ScholarDigital Library
- Khairan D Rajab. 2017. New hybrid features selection method: A case study on websites phishing. Security and Communication Networks 2017 (2017).Google Scholar
- Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53, 1 (2003), 23–69.Google Scholar
- Lyn Thomas, Jonathan Crook, and David Edelman. 2017. Credit scoring and its applications. SIAM.Google Scholar
- Shrawan Kumar Trivedi. 2020. A study on credit scoring modeling with different feature selection and machine learning approaches. Technology in Society 63(2020), 101413.Google ScholarCross Ref
Index Terms
- Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets
Recommendations
Empirical Study of Individual Feature Evaluators and Cutting Criteria for Feature Selection in Classification
ISDA '09: Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and ApplicationsThe use of feature selection can improve accuracy, efficiency, applicability and understandability of a learning process and its resulting model. For this reason, many methods of automatic feature selection have been developed. By using a modularization ...
Feature Selection for financial data – comparison
AbstractData analysis is currently one the key for the success of good condition of the companies. Feature selection as a preprocessing of data method the estimator accuracy scores can be improved, as well the performance on very high-dimensional data set ...
Correlation Based Feature Selection Algorithms for Varying Datasets of Different Dimensionality
AbstractCurse of dimensionality problem needs to be addressed carefully when designing a classifier. Given a huge dimensional dataset, one interesting problem is the choice of optimal selection of features for classification. Feature selection is an ...
Comments