Toward feature selection in big data preprocessing based on hybrid cloud-based model

Shehab, Noha; Badawy, Mahmoud; Ali, H Arafat

doi:10.1007/s11227-021-03970-7

Toward feature selection in big data preprocessing based on hybrid cloud-based model

Published: 21 July 2021

Volume 78, pages 3226–3265, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

596 Accesses
7 Citations
Explore all metrics

Abstract

Recently, big data are widely noticed in many fields like machine learning, pattern recognition, medical, financial, and transportation fields. Data analysis is crucial to converting data into more specific information fed to the decision-making systems. With the diverse and complex types of datasets, knowledge discovery becomes more difficult. One solution is to use feature subset selection preprocessing that reduces this complexity, so the computation and analysis become convenient. Preprocessing produces a reliable and suitable source for any data-mining algorithm. The effective features’ selection can improve a model’s performance and help us understand the characteristics and underlying structure of complex data. This study introduces a novel hybrid feature selection cloud-based model for imbalanced data based on the k nearest neighbor algorithm. The proposed model showed good performance compared with the simple weighted nearest neighbor. The proposed model combines the firefly distance metric and the Euclidean distance used in the k nearest neighbor. The experimental results showed good insights in both time usage and feature weights compared with the weighted nearest neighbor. It also showed improvement in the classification accuracy by 12% compared with the weighted nearest neighbor algorithm. And using the cloud-distributed model reduced the processing time up to 30%, which is deliberated to be substantial compared with the recent state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review on Random Forest: An Ensemble Classifier

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of unsupervised feature selection methods

Article 29 January 2019

References

García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Article Google Scholar
Zheng W, Zhu X, Wen G, Zhu Y, Yu H, Gan J (2020) Unsupervised feature selection by self-paced learning regularization. Pattern Recogni Lett 132:4–11
Article Google Scholar
Mohamad M, Selamat A, Krejcar O, Fujita H, Wu T (2020) An analysis on new hybrid parameter selection model performance over big data set. Knowl-Based Syst 192:105441
Article Google Scholar
V. Kumar, A. Verma, N. Mittal, S. V. Gromov, (2019) Anatomy of preprocessing of big data for monolingual corpora paraphrase extraction: source language sentence selection. In: Emerging Technologies in Data Mining and Information Security, Springer, pp. 495–505
A. Sinha, B. Sahoo, S. S. Rautaray, M. Pandey, (2020) Predictive model prototype for the diagnosis of breast cancer using big data technology. In: Advances in Data and Information Sciences, Springer, pp. 455–464
Ashabi A, Sahibuddin SB, Haghighi MS, ((2020)) Big data: Current challenges and future scope. In: IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE). IEEE 2020, pp. 131–134
García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9
Article Google Scholar
García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, Berlin
Book Google Scholar
Yang C, Lan S, Wang L, Shen W, Huang GG (2020) Big data driven edge-cloud collaboration architecture for cloud manufacturing: a software defined perspective. IEEE Access 8:45938–45950
Article Google Scholar
Russom P et al (2011) Big data analytics. TDWI best practices report, fourth quarter 19(4):1–34
Di Martino B, Aversa R, Cretella G, Esposito A, Kołodziej J (2014) Big data (lost) in the cloud. Int J Big Data Intell 1(1–2):3–17
Article Google Scholar
Zhang Y, Chen M, Mao S, Hu L, Leung VC (2014) Cap: community activity prediction based on big data analysis. IEEE Netw 28(4):52–57
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on smote and gaussian distribution. Inf Sci 512:1214–1233. https://doi.org/10.1016/j.ins.2019.10.048
Article Google Scholar
Li Y, Wei D, Chen J, Cao S, Zhou H, Zhu Y, Wu J, Lan L, Sun W, Qian T et al (2020) Efficient and effective training of covid-19 classification networks with self-supervised dual-track learning to rank. IEEE J Biomed Health Inf 24(10):2787–2797
Article Google Scholar
Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135
Article Google Scholar
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3):131–156
Article Google Scholar
Lusa L et al (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinf 14(1):64
Article Google Scholar
F. Ros, S. Guillaume, (2020) From supervised instance and feature selection algorithms to dual selection: a review. In: Sampling Techniques for Supervised or Unsupervised Tasks, Springer, pp. 83–128
Murillo J, Guillaume S, Spetale F, Tapia E, Bulacio P (2015) Set characterization-selection towards classification based on interaction index. Fuzzy Sets Syst 270:74–89
Article MathSciNet Google Scholar
Zhao L, Chen Z, Hu Y, Min G, Jiang Z (2016) Distributed feature selection for efficient economic big data analysis. IEEE Trans Big Data 4(2):164–176
Article Google Scholar
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis 143:106839
Article MathSciNet Google Scholar
Zhang R, Nie F, Li X, Wei X (2019) Feature selection with multi-view data: a survey. Inf Fusion 50:158–167
Article Google Scholar
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Article Google Scholar
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2020) A review of unsupervised feature selection methods. Artif Intell Rev 53(2):907–948
Article Google Scholar
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Adv Neural Inf Process Syst 18:507–514
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Mohamed KS (2020) Neuromorphic computing and beyond: parallel. Near memory, and quantum approximation, Springer Nature
Bugata P, Drotár P (2019) Weighted nearest neighbors feature selection. Knowl-Based Syst 163:749–761
Article Google Scholar
Yang X-S, He X (2013) Firefly algorithm: recent advances and applications. Int J Swarm Intell 1(1):36–50
Article Google Scholar
Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ-Comput Inf Sci 30(4):431–448
Google Scholar
A. Sapountzi, K. E. Psannis, (2020) Big data preprocessing: an application on online social networks. In: Principles of data science, Springer, pp. 49–78
Aremu OO, Hyland-Wood D, McAree PR (2020) A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliab Eng Syst Saf 195:106706
Article Google Scholar
Fu G-H, Wu Y-J, Zong M-J, Yi L-Z (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom Intell Lab Syst 196:103906
Article Google Scholar
Manikandan G, Abirami S (2021) Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data. Springer International Publishing, Cham, pp 177–196. https://doi.org/10.1007/978-3-030-35280-6_9
Book Google Scholar
Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94
Article Google Scholar
Feng F, Li K-C, Shen J, Zhou Q, Yang X (2020) Using cost-sensitive learning and feature selection algorithms to improve the performance of imbalanced classification. IEEE Access 8:69979–69996
Article Google Scholar
Jedrzejowicz J, Jedrzejowicz P (2021) Gep-based classifier for mining imbalanced data. Exp Syst Appl 164:114058
Article Google Scholar
Zhou P, Chen J, Fan M, Du L, Shen Y-D, Li X (2020) Unsupervised feature selection for balanced clustering. Knowl-Based Syst 193:105417
Article Google Scholar
J. G. Figueira-Domínguez, V. Bolón-Canedo, B. Remeseiro, (2020) Feature selection in big image datasets. In: Multidisciplinary Digital Publishing Institute Proceedings, Vol. 54, p. 40
Zhang Y, Zhu R, Chen Z, Gao J, Xia D (2020) Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data. Eur J Oper Res 290(1):235–247
Article MathSciNet Google Scholar
Rashid AB, Ahmed M, Sikos LF, Haskell-Dowland P (2020) A novel penalty-based wrapper objective function for feature selection in big data using cooperative co-evolution. IEEE Access 8:150113–150129
Article Google Scholar
Rostami M, Berahmand K, Forouzandeh S (2020) A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty. J Big Data 7(1):1–21
Article Google Scholar
Maleki N, Zeinali Y, Niaki STA (2020) A k-nn method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Exp Syst Appl 164:113981
Article Google Scholar
Sang B, Chen H, Li T, Xu W, Yu H (2020) Incremental approaches for heterogeneous feature selection in dynamic ordered data. Inf Sci 541:475–501
Article MathSciNet Google Scholar
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210
Article Google Scholar
Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3(2):85–101
Article Google Scholar
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computers and Control Systems Engineering Department, Faculty of Engineering, Mansoura University, Mansoura, 35516, Egypt
Noha Shehab, Mahmoud Badawy & H Arafat Ali
Information Technology Institute, Open Source Dept., Ministry of Communications and Information Technology., Cairo, Egypt
Noha Shehab
Taibah University, Computer Science and Information Dept., Madinah, Saudi Arabia
Mahmoud Badawy

Authors

Noha Shehab
View author publications
You can also search for this author in PubMed Google Scholar
Mahmoud Badawy
View author publications
You can also search for this author in PubMed Google Scholar
H Arafat Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Noha Shehab.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shehab, N., Badawy, M. & Ali, H.A. Toward feature selection in big data preprocessing based on hybrid cloud-based model. J Supercomput 78, 3226–3265 (2022). https://doi.org/10.1007/s11227-021-03970-7

Download citation

Accepted: 21 June 2021
Published: 21 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-03970-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward feature selection in big data preprocessing based on hybrid cloud-based model

Abstract

Access this article

Similar content being viewed by others

A Review on Random Forest: An Ensemble Classifier

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Toward feature selection in big data preprocessing based on hybrid cloud-based model

Abstract

Access this article

Similar content being viewed by others

A Review on Random Forest: An Ensemble Classifier

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation