Skip to main content
Log in

Toward feature selection in big data preprocessing based on hybrid cloud-based model

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Recently, big data are widely noticed in many fields like machine learning, pattern recognition, medical, financial, and transportation fields. Data analysis is crucial to converting data into more specific information fed to the decision-making systems. With the diverse and complex types of datasets, knowledge discovery becomes more difficult. One solution is to use feature subset selection preprocessing that reduces this complexity, so the computation and analysis become convenient. Preprocessing produces a reliable and suitable source for any data-mining algorithm. The effective features’ selection can improve a model’s performance and help us understand the characteristics and underlying structure of complex data. This study introduces a novel hybrid feature selection cloud-based model for imbalanced data based on the k nearest neighbor algorithm. The proposed model showed good performance compared with the simple weighted nearest neighbor. The proposed model combines the firefly distance metric and the Euclidean distance used in the k nearest neighbor. The experimental results showed good insights in both time usage and feature weights compared with the weighted nearest neighbor. It also showed improvement in the classification accuracy by 12% compared with the weighted nearest neighbor algorithm. And using the cloud-distributed model reduced the processing time up to 30%, which is deliberated to be substantial compared with the recent state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

References

  1. García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152

    Article  Google Scholar 

  2. Zheng W, Zhu X, Wen G, Zhu Y, Yu H, Gan J (2020) Unsupervised feature selection by self-paced learning regularization. Pattern Recogni Lett 132:4–11

    Article  Google Scholar 

  3. Mohamad M, Selamat A, Krejcar O, Fujita H, Wu T (2020) An analysis on new hybrid parameter selection model performance over big data set. Knowl-Based Syst 192:105441

    Article  Google Scholar 

  4. V. Kumar, A. Verma, N. Mittal, S. V. Gromov, (2019) Anatomy of preprocessing of big data for monolingual corpora paraphrase extraction: source language sentence selection. In: Emerging Technologies in Data Mining and Information Security, Springer, pp. 495–505

  5. A. Sinha, B. Sahoo, S. S. Rautaray, M. Pandey, (2020) Predictive model prototype for the diagnosis of breast cancer using big data technology. In: Advances in Data and Information Sciences, Springer, pp. 455–464

  6. Ashabi A, Sahibuddin SB, Haghighi MS, ((2020)) Big data: Current challenges and future scope. In: IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE). IEEE 2020, pp. 131–134

  7. García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9

    Article  Google Scholar 

  8. García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, Berlin

    Book  Google Scholar 

  9. Yang C, Lan S, Wang L, Shen W, Huang GG (2020) Big data driven edge-cloud collaboration architecture for cloud manufacturing: a software defined perspective. IEEE Access 8:45938–45950

    Article  Google Scholar 

  10. Russom P et al (2011) Big data analytics. TDWI best practices report, fourth quarter 19(4):1–34

  11. Di Martino B, Aversa R, Cretella G, Esposito A, Kołodziej J (2014) Big data (lost) in the cloud. Int J Big Data Intell 1(1–2):3–17

    Article  Google Scholar 

  12. Zhang Y, Chen M, Mao S, Hu L, Leung VC (2014) Cap: community activity prediction based on big data analysis. IEEE Netw 28(4):52–57

    Article  Google Scholar 

  13. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  14. Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on smote and gaussian distribution. Inf Sci 512:1214–1233. https://doi.org/10.1016/j.ins.2019.10.048

    Article  Google Scholar 

  15. Li Y, Wei D, Chen J, Cao S, Zhou H, Zhu Y, Wu J, Lan L, Sun W, Qian T et al (2020) Efficient and effective training of covid-19 classification networks with self-supervised dual-track learning to rank. IEEE J Biomed Health Inf 24(10):2787–2797

    Article  Google Scholar 

  16. Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135

    Article  Google Scholar 

  17. Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3):131–156

    Article  Google Scholar 

  18. Lusa L et al (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinf 14(1):64

    Article  Google Scholar 

  19. F. Ros, S. Guillaume, (2020) From supervised instance and feature selection algorithms to dual selection: a review. In: Sampling Techniques for Supervised or Unsupervised Tasks, Springer, pp. 83–128

  20. Murillo J, Guillaume S, Spetale F, Tapia E, Bulacio P (2015) Set characterization-selection towards classification based on interaction index. Fuzzy Sets Syst 270:74–89

    Article  MathSciNet  Google Scholar 

  21. Zhao L, Chen Z, Hu Y, Min G, Jiang Z (2016) Distributed feature selection for efficient economic big data analysis. IEEE Trans Big Data 4(2):164–176

    Article  Google Scholar 

  22. Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis 143:106839

    Article  MathSciNet  Google Scholar 

  23. Zhang R, Nie F, Li X, Wei X (2019) Feature selection with multi-view data: a survey. Inf Fusion 50:158–167

    Article  Google Scholar 

  24. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  25. Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2020) A review of unsupervised feature selection methods. Artif Intell Rev 53(2):907–948

    Article  Google Scholar 

  26. He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. Adv Neural Inf Process Syst 18:507–514

    Google Scholar 

  27. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  28. Mohamed KS (2020) Neuromorphic computing and beyond: parallel. Near memory, and quantum approximation, Springer Nature

  29. Bugata P, Drotár P (2019) Weighted nearest neighbors feature selection. Knowl-Based Syst 163:749–761

    Article  Google Scholar 

  30. Yang X-S, He X (2013) Firefly algorithm: recent advances and applications. Int J Swarm Intell 1(1):36–50

    Article  Google Scholar 

  31. Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ-Comput Inf Sci 30(4):431–448

    Google Scholar 

  32. A. Sapountzi, K. E. Psannis, (2020) Big data preprocessing: an application on online social networks. In: Principles of data science, Springer, pp. 49–78

  33. Aremu OO, Hyland-Wood D, McAree PR (2020) A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliab Eng Syst Saf 195:106706

    Article  Google Scholar 

  34. Fu G-H, Wu Y-J, Zong M-J, Yi L-Z (2020) Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom Intell Lab Syst 196:103906

    Article  Google Scholar 

  35. Manikandan G, Abirami S (2021) Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data. Springer International Publishing, Cham, pp 177–196. https://doi.org/10.1007/978-3-030-35280-6_9

    Book  Google Scholar 

  36. Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94

    Article  Google Scholar 

  37. Feng F, Li K-C, Shen J, Zhou Q, Yang X (2020) Using cost-sensitive learning and feature selection algorithms to improve the performance of imbalanced classification. IEEE Access 8:69979–69996

    Article  Google Scholar 

  38. Jedrzejowicz J, Jedrzejowicz P (2021) Gep-based classifier for mining imbalanced data. Exp Syst Appl 164:114058

    Article  Google Scholar 

  39. Zhou P, Chen J, Fan M, Du L, Shen Y-D, Li X (2020) Unsupervised feature selection for balanced clustering. Knowl-Based Syst 193:105417

    Article  Google Scholar 

  40. J. G. Figueira-Domínguez, V. Bolón-Canedo, B. Remeseiro, (2020) Feature selection in big image datasets. In: Multidisciplinary Digital Publishing Institute Proceedings, Vol. 54, p. 40

  41. Zhang Y, Zhu R, Chen Z, Gao J, Xia D (2020) Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data. Eur J Oper Res 290(1):235–247

    Article  MathSciNet  Google Scholar 

  42. Rashid AB, Ahmed M, Sikos LF, Haskell-Dowland P (2020) A novel penalty-based wrapper objective function for feature selection in big data using cooperative co-evolution. IEEE Access 8:150113–150129

    Article  Google Scholar 

  43. Rostami M, Berahmand K, Forouzandeh S (2020) A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty. J Big Data 7(1):1–21

    Article  Google Scholar 

  44. Maleki N, Zeinali Y, Niaki STA (2020) A k-nn method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Exp Syst Appl 164:113981

    Article  Google Scholar 

  45. Sang B, Chen H, Li T, Xu W, Yu H (2020) Incremental approaches for heterogeneous feature selection in dynamic ordered data. Inf Sci 541:475–501

    Article  MathSciNet  Google Scholar 

  46. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

    Article  Google Scholar 

  47. Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3(2):85–101

    Article  Google Scholar 

  48. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noha Shehab.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shehab, N., Badawy, M. & Ali, H.A. Toward feature selection in big data preprocessing based on hybrid cloud-based model. J Supercomput 78, 3226–3265 (2022). https://doi.org/10.1007/s11227-021-03970-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03970-7

Keywords

Navigation