Abstract
Recently, big data are widely noticed in many fields like machine learning, pattern recognition, medical, financial, and transportation fields. Data analysis is crucial to converting data into more specific information fed to the decision-making systems. With the diverse and complex types of datasets, knowledge discovery becomes more difficult. One solution is to use feature subset selection preprocessing that reduces this complexity, so the computation and analysis become convenient. Preprocessing produces a reliable and suitable source for any data-mining algorithm. The effective features’ selection can improve a model’s performance and help us understand the characteristics and underlying structure of complex data. This study introduces a novel hybrid feature selection cloud-based model for imbalanced data based on the k nearest neighbor algorithm. The proposed model showed good performance compared with the simple weighted nearest neighbor. The proposed model combines the firefly distance metric and the Euclidean distance used in the k nearest neighbor. The experimental results showed good insights in both time usage and feature weights compared with the weighted nearest neighbor. It also showed improvement in the classification accuracy by 12% compared with the weighted nearest neighbor algorithm. And using the cloud-distributed model reduced the processing time up to 30%, which is deliberated to be substantial compared with the recent state-of-the-art methods.
Similar content being viewed by others
References
García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Zheng W, Zhu X, Wen G, Zhu Y, Yu H, Gan J (2020) Unsupervised feature selection by self-paced learning regularization. Pattern Recogni Lett 132:4–11
Mohamad M, Selamat A, Krejcar O, Fujita H, Wu T (2020) An analysis on new hybrid parameter selection model performance over big data set. Knowl-Based Syst 192:105441
V. Kumar, A. Verma, N. Mittal, S. V. Gromov, (2019) Anatomy of preprocessing of big data for monolingual corpora paraphrase extraction: source language sentence selection. In: Emerging Technologies in Data Mining and Information Security, Springer, pp. 495–505
A. Sinha, B. Sahoo, S. S. Rautaray, M. Pandey, (2020) Predictive model prototype for the diagnosis of breast cancer using big data technology. In: Advances in Data and Information Sciences, Springer, pp. 455–464
Ashabi A, Sahibuddin SB, Haghighi MS, ((2020)) Big data: Current challenges and future scope. In: IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE). IEEE 2020, pp. 131–134
García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9
García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, Berlin
Yang C, Lan S, Wang L, Shen W, Huang GG (2020) Big data driven edge-cloud collaboration architecture for cloud manufacturing: a software defined perspective. IEEE Access 8:45938–45950
Russom P et al (2011) Big data analytics. TDWI best practices report, fourth quarter 19(4):1–34
Di Martino B, Aversa R, Cretella G, Esposito A, Kołodziej J (2014) Big data (lost) in the cloud. Int J Big Data Intell 1(1–2):3–17
Zhang Y, Chen M, Mao S, Hu L, Leung VC (2014) Cap: community activity prediction based on big data analysis. IEEE Netw 28(4):52–57
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on smote and gaussian distribution. Inf Sci 512:1214–1233. https://doi.org/10.1016/j.ins.2019.10.048
Li Y, Wei D, Chen J, Cao S, Zhou H, Zhu Y, Wu J, Lan L, Sun W, Qian T et al (2020) Efficient and effective training of covid-19 classification networks with self-supervised dual-track learning to rank. IEEE J Biomed Health Inf 24(10):2787–2797
Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3):131–156
Lusa L et al (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinf 14(1):64
F. Ros, S. Guillaume, (2020) From supervised instance and feature selection algorithms to dual selection: a review. In: Sampling Techniques for Supervised or Unsupervised Tasks, Springer, pp. 83–128
Murillo J, Guillaume S, Spetale F, Tapia E, Bulacio P (2015) Set characterization-selection towards classification based on interaction index. Fuzzy Sets Syst 270:74–89
Zhao L, Chen Z, Hu Y, Min G, Jiang Z (2016) Distributed feature selection for efficient economic big data analysis. IEEE Trans Big Data 4(2):164–176
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis 143:106839
Zhang R, Nie F, Li X, Wei X (2019) Feature selection with multi-view data: a survey. Inf Fusion 50:158–167
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2020) A review of unsupervised feature selection methods. Artif Intell Rev 53(2):907–948
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Adv Neural Inf Process Syst 18:507–514
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Mohamed KS (2020) Neuromorphic computing and beyond: parallel. Near memory, and quantum approximation, Springer Nature
Bugata P, Drotár P (2019) Weighted nearest neighbors feature selection. Knowl-Based Syst 163:749–761
Yang X-S, He X (2013) Firefly algorithm: recent advances and applications. Int J Swarm Intell 1(1):36–50
Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ-Comput Inf Sci 30(4):431–448
A. Sapountzi, K. E. Psannis, (2020) Big data preprocessing: an application on online social networks. In: Principles of data science, Springer, pp. 49–78
Aremu OO, Hyland-Wood D, McAree PR (2020) A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliab Eng Syst Saf 195:106706
Fu G-H, Wu Y-J, Zong M-J, Yi L-Z (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom Intell Lab Syst 196:103906
Manikandan G, Abirami S (2021) Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data. Springer International Publishing, Cham, pp 177–196. https://doi.org/10.1007/978-3-030-35280-6_9
Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94
Feng F, Li K-C, Shen J, Zhou Q, Yang X (2020) Using cost-sensitive learning and feature selection algorithms to improve the performance of imbalanced classification. IEEE Access 8:69979–69996
Jedrzejowicz J, Jedrzejowicz P (2021) Gep-based classifier for mining imbalanced data. Exp Syst Appl 164:114058
Zhou P, Chen J, Fan M, Du L, Shen Y-D, Li X (2020) Unsupervised feature selection for balanced clustering. Knowl-Based Syst 193:105417
J. G. Figueira-Domínguez, V. Bolón-Canedo, B. Remeseiro, (2020) Feature selection in big image datasets. In: Multidisciplinary Digital Publishing Institute Proceedings, Vol. 54, p. 40
Zhang Y, Zhu R, Chen Z, Gao J, Xia D (2020) Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data. Eur J Oper Res 290(1):235–247
Rashid AB, Ahmed M, Sikos LF, Haskell-Dowland P (2020) A novel penalty-based wrapper objective function for feature selection in big data using cooperative co-evolution. IEEE Access 8:150113–150129
Rostami M, Berahmand K, Forouzandeh S (2020) A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty. J Big Data 7(1):1–21
Maleki N, Zeinali Y, Niaki STA (2020) A k-nn method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Exp Syst Appl 164:113981
Sang B, Chen H, Li T, Xu W, Yu H (2020) Incremental approaches for heterogeneous feature selection in dynamic ordered data. Inf Sci 541:475–501
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210
Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3(2):85–101
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shehab, N., Badawy, M. & Ali, H.A. Toward feature selection in big data preprocessing based on hybrid cloud-based model. J Supercomput 78, 3226–3265 (2022). https://doi.org/10.1007/s11227-021-03970-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03970-7