A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

Al-Janabi, Samaher; Alkaim, Ayad F.

doi:10.1007/s00500-019-03972-x

A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

Methodologies and Application
Published: 11 April 2019

Volume 24, pages 555–569, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

988 Accesses
130 Citations
Explore all metrics

Abstract

One of the important trends in an intelligent data analysis will be the growing importance of data processing. But this point faces problems similar to those of data mining (i.e., high-dimensional data, missing value imputation and data integration); one of the challenges in estimation missing value methods is how to select the optimal number of nearest neighbors of those values. This paper, attempting to search the capability of building a novel tool to estimate missing values of various datasets called developed random forest and local least squares (DRFLLS). By developing random forest algorithm, seven categories of similarity measures were defined. These categories are person similarity coefficient, simple similarity, and fuzzy similarity (M1, M2, M3, M4 and M5). They are sufficient to estimate the optimal number of neighborhoods of missing values in this application. Hereafter, local least squares (LLS) has been used to estimate the missing values. Imputation accuracy can be measured in different ways: Pearson correlation (PC) and NRMSE. Then, the optimal number of neighborhoods is associated with the highest value of PC and a smaller value of NRMSE. The experimental results were carried out on six datasets obtained from different disciplines, and DRFLLS proves the dataset which has a small rate of missing values gave the best estimation to the number of nearest neighbors by DRFPC and in the second degree by DRFFSM1 when r = 4, while if the dataset has high rate of missing values, then it gave the best estimation to number of nearest neighbors by DRFFSM5 and in the second degree by DRFFSM3. After that, the missing value was estimated by LLS, and the results accuracy was measured by NRMSE and Pearson correlation. The smallest value of NRMSE for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset. The highest value of PC for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Ensemble Learning for Heterogeneous Missing Data Imputation

Data-Driven Machine Learning Approach for Predicting Missing Values in Large Data Sets: A Comparison Study

References

Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19
Google Scholar
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Google Scholar
Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36
Google Scholar
Adam E, Mutanga O, Odindi J, Abdel-Rahman EM (2014) Land-use/cover classification in a heterogeneous coastal landscape using Rapid Eye imagery: evaluating the performance of random forest and support vector machines classifiers. Int J Rem Sens 35(10):3440–3458
Google Scholar
Ali SH (2012a) Miner for OACCR: case of medical data analysis in knowledge discovery. In: IEEE, 2012 6th international conference on sciences of electronics, technologies of information and telecommunications (SETIT), Sousse pp 962–975. https://doi.org/10.1109/setit.2012.6482043
Ali SH (2012b) A novel tool (FP-KC) for handle the three main dimensions reduction and association rule mining. In: IEEE, 2012 6th international conference on sciences of electronics, technologies of information and telecommunications (SETIT), Sousse, pp 951–961. https://doi.org/10.1109/setit.2012.6482042
Ali SH (2013) Novel approach for generating the key of stream cipher system using random forest data mining algorithm. In: IEEE, 2013 sixth international conference on developments in e-systems engineering, Abu Dhabi, pp 259–269 (2013). https://doi.org/10.1109/dese.2013.54
Al-Janabi S (2017) Pragmatic miner to risk analysis for intrusion detection (PMRA-ID). In: Mohamed A, Berry M, Yap B (eds) Soft computing in data science. SCDS 2017. Communications in Computer and Information Science, vol 788. Springer, Singapore. https://doi.org/10.1007/978-981-10-7242-0_23
Google Scholar
Al-Janabi S (2018) Smart system to create optimal higher education environment using IDA and IOTs. Int J Comput Appl. https://doi.org/10.1080/1206212X.2018.1512460
Google Scholar
Aljarah I, Mafarja M, Heidari AA, Hossam F, Yong Z, Mirjalili S (2018) Asynchronous accelerating multi-leader Salp chains for feature selection. Appl Soft Comput 71:964–979. https://doi.org/10.1016/j.asoc.2018.07.040
Google Scholar
Bose S, Das C, Chakraborty A, Chattopadhyay S (2013) Effectiveness of different partition based clustering algorithms for estimation of missing values in microarray gene expression data. In: Advances in computing and information technology. Springer, Berlin, pp 37–47
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
MATH Google Scholar
Bruggeman J, Heringa J, Brandt B (2009) PhyloPars: estimation of missing parameter values using phylogeny. Nucleic Acids Res 37(2):W179–W184
Google Scholar
Carranza EJM, Laborte AG (2015) Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines). Comput Geosci 74:60–70
Google Scholar
Center for Machine Learning and Intelligent Systems, USA (2010a) http://archive.ics.uci.edu/ml/datasets/p53+Mutants
Center for Machine Learning and Intelligent Systems, USA (2010b). https://www.nationalgeographic.org/encyclopedia/geographic-information-system-gis
Chiu CC, Chan SY, Wang CC, Wu WS (2013) Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC Syst Biol 7(6):S12
Google Scholar
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88(11):2783–2792
Google Scholar
Elyan E, Gaber MM (2016) A fine-grained Random Forests using class decomposition: an application to medical diagnosis. Neural Comput Appl 27(8):2279–2288
Google Scholar
Genbank 64.1 (1992) http://archive.ics.uci.edu/ml/machine-learning/datasets/DNA/
Genbank 64.1 (2018). http://idke.ruc.edu.cn/news/2008/dataset.htm
Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236
Google Scholar
Golub GH, Kim H, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Google Scholar
Graham JW (2012) Missing data: analysis and design. Springer, New York
MATH Google Scholar
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. University of Illinois at Urbana-Champaign. San Francisco. Elsevier 2006. www.books.elsevier.com
Hapfelmeier A, Hothorn T, Ulm K (2012) Recursive partitioning on incomplete data using surrogate decisions and multiple imputation. Comput Stat Data Anal 56(6):1552–1565
MathSciNet MATH Google Scholar
Heidari AA, Faris H, Aljarah I, Mirjalili S (2018) An efficient hybrid multilayer perceptron neural network with grasshopper optimization. Soft Comput. https://doi.org/10.1007/s00500-018-3424-2
Google Scholar
Heidari AA, Aljarah I, Faris H, Chen H, Luo J, Mirjalili S (2019) An enhanced associative learning-based exploratory whale optimizer for global optimization. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04015-0
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning. Springer, New York
MATH Google Scholar
Kumar V, Wu X, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37
Google Scholar
Liew AWC, Law NF, Yan H (2010) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12(5):498–513
Google Scholar
Mafarja M, Aljarah I, Heidari AA, Faris H, Fournier-Viger P, Li X, Mirjalili S (2018) Binary dragonfly optimization for feature selection using time-varying transfer functions. Knowl Based Syst 161:185–204. https://doi.org/10.1016/j.knosys.2018.08.003
Google Scholar
McCandless T, Haupt SE, Young G (2011) Replacing missing data for ensemble systems. J Comput 6(2):162–171
Google Scholar
Moorthy K, Saberi Mohamad M, Deris S (2014) A review on missing value imputation algorithms for microarray gene expression data. Curr Bioinform 9(1):18–22
Google Scholar
Pantanowitz A, Marwala T (2009) Missing data imputation through the use of the random forest algorithm. In: Yu W, Sanchez EN (eds) Advances in computational intelligence. Advances in Intelligent and Soft Computing, vol 116, Springer, Berlin, pp 53–62
Google Scholar
Qi Y, Klein-Seetharaman J, Bar Z (2005) Random forest similarity for protein-protein interaction prediction from multiple sources. Pac Symp Biocomp 10:531–542
Google Scholar
Redmond M (2009) Center for machine learning and intelligent systems. Computer Science, La Salle University, Philadelphia, PA
Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical Report Number 79, Department of Statistics, Ludwig-Maximilians-Universität, Munich
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
MathSciNet MATH Google Scholar
Rubin DB (1996) Multiple imputation after 18 + years. J Am Stat Assoc 91(434):473–489
MATH Google Scholar
Ryan C, Green D, Cagney G, Cunningham P (2010) Missing value imputation for epistatic MAPs. Bioinformatics 11:197
Google Scholar
Saul LK, Savage S, Ma J, Voelker GM (2009) Identifying suspicious URLs: an application of large-scale online learning. In: 26th annual international conference on machine learning (ICML), Montreal (2009) pp 681–688
Stekhoven DJ, Bühlmann P (2012) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Google Scholar
Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: a survey and results of new tests. Pattern Recognit 44(2):330–349
Google Scholar
Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Higgins PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3(8):e002847
Google Scholar
Wasito I, Mirkin B (2006) Nearest neighbours in least-squares data imputation algorithms with different missing patterns. Comput Stat Data Anal 50(4):926–949
MathSciNet MATH Google Scholar
Waske B, Chi M, Benediktsson JA, van der Linden S, Koetz B (2010) Algorithms and applications for land cover classification—a review. In: Li D, Shan J, Gong J (eds) Geospatial technology for earth observation. Springer, Boston, MA, pp 203–233
Google Scholar
Xie Y, Li X, Ngai EWT, Ying W (2009) Customer churn prediction using improved balanced random forests. Expert Syst Appl 36(3):5445–5449
Google Scholar
Zhou Z, Zhang R, Lin Y, Wang R (2015) A comparison of similarity measures of intuitionistic fuzzy sets. In: LISS 2014, pp 1237–1242
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Science for Women (SCIW), University of Babylon, Babylon, Iraq
Samaher Al-Janabi
Department of Chemistry Science, Faculty of Science for Women (SCIW), University of Babylon, Babylon, Iraq
Ayad F. Alkaim

Authors

Samaher Al-Janabi
View author publications
You can also search for this author in PubMed Google Scholar
Ayad F. Alkaim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samaher Al-Janabi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Janabi, S., Alkaim, A.F. A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation. Soft Comput 24, 555–569 (2020). https://doi.org/10.1007/s00500-019-03972-x

Download citation

Published: 11 April 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s00500-019-03972-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

Abstract

Access this article

Similar content being viewed by others

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Ensemble Learning for Heterogeneous Missing Data Imputation

Data-Driven Machine Learning Approach for Predicting Missing Values in Large Data Sets: A Comparison Study

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

Abstract

Access this article

Similar content being viewed by others

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Ensemble Learning for Heterogeneous Missing Data Imputation

Data-Driven Machine Learning Approach for Predicting Missing Values in Large Data Sets: A Comparison Study

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation