skip to main content
10.1145/3568562.3568604acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets

Authors Info & Claims
Published:01 December 2022Publication History

ABSTRACT

High dimensionality is one of the data quality problems that affects the performance of machine learning models. Feature selection which aims to identify and remove as many redundant and irrelevant features as possible allows to boot the overall performance of the models while reducing the computational cost. However, the choice of an appropriate feature selection method is still a big challenge as there is no the best selection criterion that fits to all datasets. It is then essential to comparatively analyze the performance of feature selection criteria according to different characteristics of high-dimensional datasets, particularly large financial datasets where features are highly-correlated and redundant. In this paper, we explore nine different feature selection criteria which are typically categorized into two classes: (i) information theoretical based criteria and (ii) similarity based criteria over seven public financial datasets. To the best of our knowledge, no previous comprehensive empirical investigation has been carried out to demonstrate the positive effects of feature selection criteria on financial data. Experimental results indicate that the information theoretical-based methods suffer from a high computation time in case of high dimensional data (i.e., high number of features) while the similarity-based methods require significant computations to deal with high volume dataset (i.e., high number of samples).

References

  1. Raymond Anderson. 2007. The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford University Press.Google ScholarGoogle Scholar
  2. Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks 5, 4 (1994), 537–550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-based systems 86 (2015), 33–45.Google ScholarGoogle Scholar
  4. Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research 13 (2012), 27–66.Google ScholarGoogle Scholar
  5. Zheng Chen, Meng Pang, Zixin Zhao, Shuainan Li, Rui Miao, Yifan Zhang, Xiaoyue Feng, Xin Feng, Yexian Zhang, Meiyu Duan, 2020. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 36, 5 (2020), 1542–1552.Google ScholarGoogle Scholar
  6. Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2017. A large-scale study of the impact of feature selection techniques on defect classification models. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 146–157.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D Asir Antony Gnana, S Appavu Alias Balamurugan, and E Jebamalar Leavline. 2016. Literature review on feature selection methods for high-dimensional data. International Journal of Computer Applications 136, 1(2016), 9–17.Google ScholarGoogle ScholarCross RefCross Ref
  8. Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.Google ScholarGoogle Scholar
  9. Peter E Hart, David G Stork, and Richard O Duda. 2000. Pattern classification. Wiley Hoboken.Google ScholarGoogle Scholar
  10. Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. Advances in neural information processing systems 18 (2005).Google ScholarGoogle Scholar
  11. Firuz Kamalov and Fadi Thabtah. 2017. A feature selection method based on ranked vector scores of features for classification. Annals of Data Science 4, 4 (2017), 483–502.Google ScholarGoogle ScholarCross RefCross Ref
  12. Nikita Kozodoi, Stefan Lessmann, Konstantinos Papakonstantinou, Yiannis Gatsoulis, and Bart Baesens. 2019. A multi-objective approach for profit-driven feature selection in credit scoring. Decision support systems 120 (2019), 106–117.Google ScholarGoogle Scholar
  13. Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, and Lyn C Thomas. 2015. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research 247, 1 (2015), 124–136.Google ScholarGoogle ScholarCross RefCross Ref
  14. David D Lewis. 1992. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM computing surveys (CSUR) 50, 6 (2017), 1–45.Google ScholarGoogle Scholar
  16. Hongliang Liang, Lu Sun, Meilin Wang, and Yuxing Yang. 2019. Deep learning with customized abstract syntax tree for bug localization. IEEE Access 7(2019), 116309–116320.Google ScholarGoogle ScholarCross RefCross Ref
  17. Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In European conference on computer vision. Springer, 68–82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sebastián Maldonado, Juan Pérez, and Cristián Bravo. 2017. Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research 261, 2 (2017), 656–665.Google ScholarGoogle ScholarCross RefCross Ref
  19. David Alfred Ostrowski. 2014. Feature selection for twitter classification. In 2014 IEEE International Conference on Semantic Computing. IEEE, 267–272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kunal Pahwa and Neha Agarwal. 2019. Stock market analysis using supervised machine learning. In 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). IEEE, 197–200.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8(2005), 1226–1238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Khairan D Rajab. 2017. New hybrid features selection method: A case study on websites phishing. Security and Communication Networks 2017 (2017).Google ScholarGoogle Scholar
  23. Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53, 1 (2003), 23–69.Google ScholarGoogle Scholar
  24. Lyn Thomas, Jonathan Crook, and David Edelman. 2017. Credit scoring and its applications. SIAM.Google ScholarGoogle Scholar
  25. Shrawan Kumar Trivedi. 2020. A study on credit scoring modeling with different feature selection and machine learning approaches. Technology in Society 63(2020), 101413.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology
          December 2022
          474 pages
          ISBN:9781450397254
          DOI:10.1145/3568562

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 December 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate147of318submissions,46%
        • Article Metrics

          • Downloads (Last 12 months)10
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format