Abstract
Advancements in high-speed computer technology play an ever-increasing role in analyzing various types and massive size data. However, handling big high-dimensional data sets is a challenge in terms of computational storage and capacity. Through feature selection methods, data dimensions can be reduced by eliminating the dummy variables, allowing for more extensive analysis. In this paper, data are classified based on the ratio of response into two types: balanced (almost the same ratio for each class) and partially balanced (consists of majority and minority, with virtually the same ratio for minority classes). Performance comparisons of various feature selection methods for balanced and partially balanced data are provided. This approach will help in selecting sampling strategy and feature selection methods that perform well while utilizing appropriate resources for high-dimensional data.



Similar content being viewed by others
References
Alibeigi M, Hashemi S, Hamzeh A (2011) Unsupervised feature selection based on the distribution of features attributed to imbalanced data sets. Int J Artif Intell Expert Syst 2(1):14–22
Amiri F, Yousefi MR, Lucas C et al (2011) Mutual information-based feature selection for intrusion detection systems. J Netw Comput Appl 34(4):1184–1199
Ang JC, Mirzal A, Haron H et al (2015) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE ACM Trans Comput Biol Bioinf 13(5):971–989
Audrino F, Kostrov A, Ortega JP (2019) Predicting us bank failures with MIDAS logit models. J Financ Quantitative Anal 54(6):2575–2603
Bach M, Werner A, Żywiec J et al (2017) The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174–190
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Bernardo JM (1976) Algorithm as 103: psi (digamma) function. J Roy Stat Soc Ser C (Appl Stat) 25(3):315–317
Brown G, Pocock A, Zhao MJ et al (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13(1):27–66
Cai D, He X, Hu Y et al (2007) Learning a spatially smooth subspace for face recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp 1–7. IEEE
Cai D, He X, Han J et al (2010) Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell 33(8):1548–1560
Cai D, Zhang C, He X (2010b) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 333–342
Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Gao K, Khoshgoftaar TM, Napolitano A (2011) Impact of data sampling on stability of feature selection for software measurement data. In: 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence, pp 1004–1011. IEEE
Gao X, Chen Z, Tang S et al (2016) Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173:1927–1935
Goldberger J, Hinton GE, Roweis ST et al (2005) Neighbourhood components analysis. In: Advances in neural information processing systems, pp 513–520
Gu Q, Li Z, Han J (2012) Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725
Hasanin T, Khoshgoftaar T (2018) The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp 70–79. IEEE
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Adv Neural Inf Process Syst 18
Jakulin A (2005) Machine learning based on attribute interactions. PhD thesis, Univerza v Ljubljani
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Krawczyk B, Koziarski M, Woźniak M (2019) Radial-based oversampling for multiclass imbalanced data classification. IEEE Trans Neural Netw Learn Syst 31(8):2818–2831
Kwak N, Choi CH (2003) Feature extraction based on ICA for binary classification problems. IEEE Trans Knowl Data Eng 15(6):1374–1388
Li G, Hu X, Shen X et al (2008) A novel unsupervised feature selection method for bioinformatics data sets through feature clustering. In: 2008 IEEE International Conference on Granular Computing, pp 41–47. IEEE
Li S, Wang Z, Zhou G et al (2011) Semi-supervised learning for imbalanced sentiment classification. In: Twenty-Second International Joint Conference on Artificial Intelligence
Li Z, Yang Y, Liu J et al (2012) Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence
Lin D, Tang X (2006) Conditional infomax learning: An integrated framework for feature extraction and fusion. In: European Conference on Computer Vision. Springer, pp 68–82
Liu S, Wang Y, Zhang J et al (2017) Addressing the class imbalance problem in twitter spam detection using ensemble learning. Comput Secur 69:35–49
Liu Y, Wang Y, Ren X et al (2019) A classification method based on feature selection for imbalanced data. IEEE Access 7:81794–81807
Majeed A (2019) Improving time complexity and accuracy of the machine learning algorithms through selection of highly weighted top k features from complex datasets. Ann Data Sci 6(4):599–621
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes-which naive bayes? CEAS. Mountain View, CA, pp 28–69
Meyer PE, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274
Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Munkhdalai T, Namsrai OE, Ryu KH (2015) Self-training in significance space of support vectors for imbalanced biomedical event data. BMC Bioinform 16(7):1–8
Nie F, Xiang S, Jia Y et al (2008) Trace ratio criterion for feature selection. In: AAAI, pp 671–676
Nie F, Huang H, Cai X et al (2010) Efficient and robust feature selection via joint \(\ell \)2, 1-norms minimization. Adv Neural Inf Process Syst 23
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: a small target detection benchmark. J Vis Commun Image Represent 34:187–203
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Mach Learn 53(1):23–69
Ross BC (2014) Mutual information between discrete and continuous data sets. PloS One 9(2):e87357
Sakaguchi Y, Ozawa S, Kotani M (2002) Feature extraction using supervised independent component analysis by maximizing class distance. In: Proceedings of the 9th International Conference on Neural Information Processing. ICONIP’02, pp 2502–2506. IEEE
Shang W, Huang H, Zhu H et al (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5
Tourassi GD, Frederick ED, Markey MK et al (2001) Application of the mutual information criterion for feature selection in computer-aided diagnosis. Med Phys 28(12):2394–2402
Tukey JW et al (1977) Exploratory data analysis. In: Reading, vol 2. MA
Vanschoren J, Van Rijn JN, Bischl B et al (2014) Openml: networked science in machine learning. ACM SIGKDD Explorations Newsl 15(2):49–60
Yang Y, Shen HT, Ma Z et al (2011) L2, 1-norm regularized discriminative feature selection for unsupervised. In: Twenty-Second International Joint Conference on Artificial Intelligence
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 856–863
Zhao X, Deng W, Shi Y (2013) Feature selection with attributes clustering by maximal information coefficient. Procedia Comput Sci 17:70–79
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th International Conference on Machine Learning, pp 1151–1157
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, K., Fard, N. Analysis of impact of balanced level on MI-based and non-MI-based feature selection methods. J Supercomput 78, 16485–16497 (2022). https://doi.org/10.1007/s11227-022-04504-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04504-5