Medical Imbalanced Data Classification Based on Random Forests

El-shafeiy, Engy; Abohany, Amr

doi:10.1007/978-3-030-44289-7_8

Engy El-shafeiy^19,21 &
Amr Abohany^20,21

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1153))

Included in the following conference series:

The International Conference on Artificial Intelligence and Computer Vision

3323 Accesses
4 Citations

Abstract

This paper studies the problem of imbalance in medical datasets. Today, modern machine learning techniques are becoming increasingly popular for this type of problem, with examples in the areas of health and medical. One of the major difficulties with this technique is that the database handled is highly imbalance. Under-sampling and over-sampling techniques are used to work around this problem. In this paper, we apply random forests, which are combinations of decision trees fitted to subsamples of the data, built using under-sampling and over-sampling. At the end of the work, we compare fit metrics obtained in the various specifications of the models tested and evaluate their results inside and outside the sample. We observed that random forest techniques using imbalanced sub-samples smaller than the original sample presented the best performance among the random forests used and an improvement compared to that practiced in the medical dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lu, H., et al.: Kernel principal component analysis combining rotation forest method for linearly inseparable data. Cogn. Syst. Res. 53, 111–122 (2019)
Article Google Scholar
Zhu, H.-J., et al.: DroidDet: effective and robust detection of android malware using static analysis along with rotation forest model. Neurocomputing 272, 638–646 (2018)
Article Google Scholar
Hong, H., et al.: Landslide susceptibility mapping using J48 decision tree with AdaBoost, bagging and rotation forest ensembles in the Guangchang area (China). Catena 163, 399–413 (2018)
Article Google Scholar
Wang, L., et al.: An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 22(10), 3373–3381 (2018)
Article Google Scholar
Pham, B.T., et al.: A hybrid machine learning ensemble approach based on a radial basis function neural network and rotation forest for landslide susceptibility modeling: a case study in the Himalayan area India. Int. J. Sediment Res. 33(2), 157–170 (2018)
Article Google Scholar
Lee, S.-J., et al.: A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J. Biomed. Inform. 78, 144–155 (2018)
Article Google Scholar
Gul, A., et al.: Ensemble of a subset of kNN classifiers. Adv. Data Anal. Classif. 12(4), 827–840 (2018)
Article MathSciNet MATH Google Scholar
Sun, J., et al.: Unbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf. Sci. 425, 76–91 (2018)
Article Google Scholar
Lango, M., Stefanowski, J.: Multi-class and feature selection extensions of roughly balanced bagging for unbalanced data. J. Intell. Inf. Syst. 50(1), 97–127 (2018)
Article Google Scholar
Chen, W., et al.: Novel hybrid integration approach of bagging-based fisher’s linear discriminant function for groundwater potential analysis. Nat. Resour. Res. 1–20 (2019)‏
Google Scholar
García, S., et al.: Dynamic ensemble selection for multi-class unbalanced datasets. Inf. Sci. 445, 22–37 (2018)
Article Google Scholar
Maldonado, S., López, J.: Dealing with high-dimensional class-unbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comput. 67, 94–105 (2018)
Article Google Scholar
Piri, S., Delen, D., Liu, T.: A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from unbalanced datasets. Decis. Support Syst. 106, 15–29 (2018)
Article Google Scholar
Zhang, C., et al.: Research on classification method of high-dimensional class-unbalanced datasets based on SVM. Int. J. Mach. Learn. Cybern. 10(7), 1765–1778 (2019)
Article Google Scholar
Douzas, G., Bacao, F.: Effective data generation for unbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 91, 464–471 (2018)
Article Google Scholar
Veganzones, D., Séverin, E.: An investigation of bankruptcy prediction in unbalanced datasets. Decis. Support Syst. 112, 111–124 (2018)
Article Google Scholar
Tahan, M.H., Asadi, S.: EMDID: evolutionary multi-objective discretization for unbalanced datasets. Inf. Sci. 432, 442–461 (2018)
Article Google Scholar
Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied logistic regression, vol. 398. Wiley, Hoboken (2013)
Book MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, Springer Series in Statistics. vol. 1. no. 10. Springer, New York (2001)‏
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11. Citeseer, Washington DC (2003)‏
Google Scholar
Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36(3), 4626–4636 (2009)
Article Google Scholar
Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newslett. 6(1), 7–19 (2004)
Article Google Scholar
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Univ. Calif. Berkeley 110(1-12), 24 (2004)
Google Scholar
Bekkar, M., Djemaa, H.K., Alitouche, T.A.: Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3(10) (2013)‏
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97 (1997)‏
Google Scholar
Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML 2007 Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvalis, OR, USA (2007)
Google Scholar
Sanz, J.A., et al.: An evolutionary underbagging approach to tackle the survival prediction of trauma patients: a case study at the hospital of Navarre. IEEE Access 7, 76009–76021 (2019)
Article Google Scholar
Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 324–331 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computers Engineering, Faculty of Engineering, Mansoura University, Mansoura, Egypt
Engy El-shafeiy
Department of Information Systems, Faculty of Computer Science and Information, Kafrelsheikh University, Kafr El Sheikh, Egypt
Amr Abohany
Scientific Research Group in Egypt, Cairo, Egypt
Engy El-shafeiy & Amr Abohany

Authors

Engy El-shafeiy
View author publications
You can also search for this author in PubMed Google Scholar
Amr Abohany
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Engy El-shafeiy .

Editor information

Editors and Affiliations

Information Technology Department, Cairo University, Faculty of Computers and Information, Giza, Egypt
Aboul-Ella Hassanien
Faculty of Computers and Information, Benha University, Banha, Egypt
Ahmad Taher Azar
Faculty of Computers and Information, Suez Canal University, Ismailia, Egypt
Tarek Gaber
Departamento de Ciencias Computacionales, Universidad de Guadalajara, CUCEI, Guadajalara, Jalisco, Mexico
Diego Oliva
Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
Fahmy M. Tolba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

El-shafeiy, E., Abohany, A. (2020). Medical Imbalanced Data Classification Based on Random Forests. In: Hassanien, AE., Azar, A., Gaber, T., Oliva, D., Tolba, F. (eds) Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020). AICV 2020. Advances in Intelligent Systems and Computing, vol 1153. Springer, Cham. https://doi.org/10.1007/978-3-030-44289-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-44289-7_8
Published: 24 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44288-0
Online ISBN: 978-3-030-44289-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics