Skip to main content
Log in

Analysis of impact of balanced level on MI-based and non-MI-based feature selection methods

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Advancements in high-speed computer technology play an ever-increasing role in analyzing various types and massive size data. However, handling big high-dimensional data sets is a challenge in terms of computational storage and capacity. Through feature selection methods, data dimensions can be reduced by eliminating the dummy variables, allowing for more extensive analysis. In this paper, data are classified based on the ratio of response into two types: balanced (almost the same ratio for each class) and partially balanced (consists of majority and minority, with virtually the same ratio for minority classes). Performance comparisons of various feature selection methods for balanced and partially balanced data are provided. This approach will help in selecting sampling strategy and feature selection methods that perform well while utilizing appropriate resources for high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Alibeigi M, Hashemi S, Hamzeh A (2011) Unsupervised feature selection based on the distribution of features attributed to imbalanced data sets. Int J Artif Intell Expert Syst 2(1):14–22

    Google Scholar 

  2. Amiri F, Yousefi MR, Lucas C et al (2011) Mutual information-based feature selection for intrusion detection systems. J Netw Comput Appl 34(4):1184–1199

    Article  Google Scholar 

  3. Ang JC, Mirzal A, Haron H et al (2015) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE ACM Trans Comput Biol Bioinf 13(5):971–989

    Article  Google Scholar 

  4. Audrino F, Kostrov A, Ortega JP (2019) Predicting us bank failures with MIDAS logit models. J Financ Quantitative Anal 54(6):2575–2603

    Article  Google Scholar 

  5. Bach M, Werner A, Żywiec J et al (2017) The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174–190

  6. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550

    Article  Google Scholar 

  7. Bernardo JM (1976) Algorithm as 103: psi (digamma) function. J Roy Stat Soc Ser C (Appl Stat) 25(3):315–317

    Google Scholar 

  8. Brown G, Pocock A, Zhao MJ et al (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13(1):27–66

    MathSciNet  MATH  Google Scholar 

  9. Cai D, He X, Hu Y et al (2007) Learning a spatially smooth subspace for face recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp 1–7. IEEE

  10. Cai D, He X, Han J et al (2010) Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell 33(8):1548–1560

    Google Scholar 

  11. Cai D, Zhang C, He X (2010b) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 333–342

  12. Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6

    Article  Google Scholar 

  13. Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  14. Gao K, Khoshgoftaar TM, Napolitano A (2011) Impact of data sampling on stability of feature selection for software measurement data. In: 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence, pp 1004–1011. IEEE

  15. Gao X, Chen Z, Tang S et al (2016) Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173:1927–1935

    Article  Google Scholar 

  16. Goldberger J, Hinton GE, Roweis ST et al (2005) Neighbourhood components analysis. In: Advances in neural information processing systems, pp 513–520

  17. Gu Q, Li Z, Han J (2012) Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725

  18. Hasanin T, Khoshgoftaar T (2018) The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp 70–79. IEEE

  19. He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Adv Neural Inf Process Syst 18

  20. Jakulin A (2005) Machine learning based on attribute interactions. PhD thesis, Univerza v Ljubljani

  21. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232

    Article  Google Scholar 

  22. Krawczyk B, Koziarski M, Woźniak M (2019) Radial-based oversampling for multiclass imbalanced data classification. IEEE Trans Neural Netw Learn Syst 31(8):2818–2831

    Article  MathSciNet  Google Scholar 

  23. Kwak N, Choi CH (2003) Feature extraction based on ICA for binary classification problems. IEEE Trans Knowl Data Eng 15(6):1374–1388

    Article  Google Scholar 

  24. Li G, Hu X, Shen X et al (2008) A novel unsupervised feature selection method for bioinformatics data sets through feature clustering. In: 2008 IEEE International Conference on Granular Computing, pp 41–47. IEEE

  25. Li S, Wang Z, Zhou G et al (2011) Semi-supervised learning for imbalanced sentiment classification. In: Twenty-Second International Joint Conference on Artificial Intelligence

  26. Li Z, Yang Y, Liu J et al (2012) Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence

  27. Lin D, Tang X (2006) Conditional infomax learning: An integrated framework for feature extraction and fusion. In: European Conference on Computer Vision. Springer, pp 68–82

  28. Liu S, Wang Y, Zhang J et al (2017) Addressing the class imbalance problem in twitter spam detection using ensemble learning. Comput Secur 69:35–49

    Article  Google Scholar 

  29. Liu Y, Wang Y, Ren X et al (2019) A classification method based on feature selection for imbalanced data. IEEE Access 7:81794–81807

    Article  Google Scholar 

  30. Majeed A (2019) Improving time complexity and accuracy of the machine learning algorithms through selection of highly weighted top k features from complex datasets. Ann Data Sci 6(4):599–621

    Article  Google Scholar 

  31. Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246

    Article  Google Scholar 

  32. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes-which naive bayes? CEAS. Mountain View, CA, pp 28–69

    Google Scholar 

  33. Meyer PE, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274

    Article  Google Scholar 

  34. Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312

    Article  Google Scholar 

  35. Munkhdalai T, Namsrai OE, Ryu KH (2015) Self-training in significance space of support vectors for imbalanced biomedical event data. BMC Bioinform 16(7):1–8

    Google Scholar 

  36. Nie F, Xiang S, Jia Y et al (2008) Trace ratio criterion for feature selection. In: AAAI, pp 671–676

  37. Nie F, Huang H, Cai X et al (2010) Efficient and robust feature selection via joint \(\ell \)2, 1-norms minimization. Adv Neural Inf Process Syst 23

  38. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  39. Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: a small target detection benchmark. J Vis Commun Image Represent 34:187–203

    Article  Google Scholar 

  40. Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn 53(1):23–69

    Article  Google Scholar 

  41. Ross BC (2014) Mutual information between discrete and continuous data sets. PloS One 9(2):e87357

    Article  Google Scholar 

  42. Sakaguchi Y, Ozawa S, Kotani M (2002) Feature extraction using supervised independent component analysis by maximizing class distance. In: Proceedings of the 9th International Conference on Neural Information Processing. ICONIP’02, pp 2502–2506. IEEE

  43. Shang W, Huang H, Zhu H et al (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5

    Article  Google Scholar 

  44. Tourassi GD, Frederick ED, Markey MK et al (2001) Application of the mutual information criterion for feature selection in computer-aided diagnosis. Med Phys 28(12):2394–2402

    Article  Google Scholar 

  45. Tukey JW et al (1977) Exploratory data analysis. In: Reading, vol 2. MA

  46. Vanschoren J, Van Rijn JN, Bischl B et al (2014) Openml: networked science in machine learning. ACM SIGKDD Explorations Newsl 15(2):49–60

    Article  Google Scholar 

  47. Yang Y, Shen HT, Ma Z et al (2011) L2, 1-norm regularized discriminative feature selection for unsupervised. In: Twenty-Second International Joint Conference on Artificial Intelligence

  48. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 856–863

  49. Zhao X, Deng W, Shi Y (2013) Feature selection with attributes clustering by maximal information coefficient. Procedia Comput Sci 17:70–79

    Article  Google Scholar 

  50. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th International Conference on Machine Learning, pp 1151–1157

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nasser Fard.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, K., Fard, N. Analysis of impact of balanced level on MI-based and non-MI-based feature selection methods. J Supercomput 78, 16485–16497 (2022). https://doi.org/10.1007/s11227-022-04504-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04504-5

Keywords

Navigation