Abstract
Despite recent significant advances in big data analytics, there is substantial evidence of machine learning techniques that perform poorly when building prediction models. This research aimed to investigate the performance and effectiveness of machine learning techniques including Naive Bayes (NB), PART, Random Forest (RF), Support Vector Machine (SVM), Adaboost, and Bagging in order to advance existing understandings of model behavior with big data. A large dataset of hospital-based breast cancer from the SEER data file with diagnostic information was used from 2005 to 2014. To address outliers and imbalance issues, we used C4.5 and Synthetic Minority Oversampling TEchnique (SMOTE) to eliminate outliers and balance the dataset. Stratified 10-fold cross-validation was used to divide the dataset to reduce bias and variance of experimental results. Accuracy, G-mean (G), F-measure, and Matthews correlation coefficient (MCC) are employed as criteria to present the overall performance of the models. Moreover, sensitivity, specificity, and precision are utilized as criteria to show the insightful performance of the models. The experimental results indicate that RF is superior to Naive Bayes (NB), PART, Support Vector Machine (SVM), Adaboost, and Bagging in all criteria. Also, models generated from datasets with few outliers and balanced data outperform the original dataset in terms of insight and overall performances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Current year estimates for breast cancer. https://www.cancer.org/cancer/breast-cancer/about/how-common-is-breast-cancer.html. Accessed 18 Jan 2021
U.S. Breast cancer statistics. https://www.breastcancer.org/symptoms/understand_bc /statistics. Accessed 04 Feb 2021
Bray, F., Ferlay J., Soerjomataram I., Siegel R.L., Torre L.A., Jemal, A.: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: Cancer J. clin. 68(6), 394–424 (2018)
Ekwueme, D.U., Guy, G.P., Rim, S.H., White, A., Hall, I.J., Fairley, T.L., et al.: Health and economic impact of breast cancer mortality in young woman. Am. J. Prev. Med. 46(1), 71–79 (2014)
The financial burden of breast cancer. https://www.forbes.com/sites/nextavenue/2020/01/21 /the-financial-burden-of-breast-cancer/?sh=13f53854d217. Accessed 12 Feb 2021
What are the risk factors for breast cancer?. https://www.cdc.gov/cancer/breast/basic_info /risk_factors.htm. Accessed 12 Jan 2021
Momenimovahed, Z., Salehiniya, H.: Epidemiological characteristics of and risk factors for breast cancer in the world. Breast Cancer (Dove Med Press). 11, 151–164 (2019)
Tejera Hernández, A.A., Vega, B.V., M., Rocca Cardenas J.C., Gutiérrez Giner M.I., DÃaz Chico J.C., Hernández Hernández J.R.: Factors predicting local relapse and survival in patients treated with surgery for breast cancer. Asian J. Surg. 42(7), 755–760 (2018)
Tapak, L., Shirmohammadi-Khorram, N., Amini, P., Alafchi, B., Hamidi, O., Poorolajal, J.: Prediction of survival and metastasis in breast cancer patients using machine learning classifiers. Clin. Epidemiol. Glob. Health. 7(3), 293–299 (2019)
John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–45 (1995)
Liu, B., Blasch E., Chen Y., Shen D., Chen G.: Scalable sentiment classification for big data analysis using Naïve Bayes classifier. In: 2013 IEEE International Conference on Big Data, pp. 99–104. (2013)
Sun, N., Sun, B., Lin, J., Wu, M.Y.-C.: Lossless pruned naive Bayes for big data classifications. Big Data Res. 14, 27–36 (2018)
Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization. In: 5th International Conference on Machine Learning, pp. 144–51 (1998)
Exarchos, T.P., Tzallas, A.T., Baga, D., Chaloglou, D., Fotiadis, D.I., Tsouli, S., et al.: Using partial decision trees to predict Parkinson’s symptoms: a new approach for diagnosis and therapy in patients suffering from Parkinson’s disease. Comput. Biol. Med. 42(2), 195–204 (2012)
Chang, C., Lai, C., Wu, R.: Decision tree rules for insulation condition assessment of pre-molded power cable joints with artificial defects. IEEE Trans. Dielectr. Electr. Insul. 26(5), 1636–1644 (2019)
Fan, R.-E., Chen, P.-H., Lin, C.-J.: Working set selection using second order information for training SVM. Mach. Learn. Res. 6, 1889–1918 (2005)
Zou, H., Jin Z.: Comparative study of big data classification algorithm based on SVM. In: 2018 Cross Strait Quad-Regional Radio Science and Wireless Technology Conference (CSQRWC), pp. 1–3 (2018)
Ganggayah, M.D., Taib, N.A., Har, Y.C., Lio, P., Dhillon, S.K.: Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med. Inform. Decis. Mak. 19(48), 1–17 (2019)
Wu, Z., Li, N., Peng, J., Cui, H., Liu, P., Li, H., et al.: Using an ensemble machine learning methodology-Bagging to predict occupants’ thermal comfort in buildings. Energy Build. 173, 117–127 (2018)
Wu, Y., Ke Y., Chen Z., Liang S., Zhao H., Hong H.: Application of alternating decision tree with AdaBoost and bagging ensembles for landslide susceptibility mapping. CATENA. 187, 104396 (2020)
Selvathi, D., Selvaraj H.: Segmentation of brain tumor tissues in MR images using multiresolution transforms and random forest classifiers with Adaboost technique. In: 2018 26th International Conference on Systems Engineering (ICSEng), pp. 1–7 (2018)
Jia, W., Xia, H., Jia, L., Deng, Y., Liu, X.: The selection of wart treatment method based on synthetic minority over-sampling technique and axiomatic fuzzy set theory. Biocybern. Biomed. Eng. (2020)
Baldomero-Naranjo, M., MartÃnez-Merino, L.I., RodrÃguez-ChÃa, A.M.: A robust SVM-based approach with feature selection and outliers detection for classification problems. Expert Syst. Appl. 178, 115017 (2021)
Trabelsi, S., Elouedi, Z., Mellouli, K.: Pruning belief decision tree methods in averaging and conjunctive approaches. Int. J. Approximate Reasoning 46(3), 568–595 (2007)
Acknowledgments
This research was financially supported by the faculty of Informatics, Mahasarakham University (Grant year 2019). The researchers would like to thanks the SEER website for providing the data used for analyzing the survival model of patients with breast cancer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Thomgkam, J., Sukmak, V., Klangnok, P. (2021). Application of Machine Learning Techniques to Predict Breast Cancer Survival. In: Chomphuwiset, P., Kim, J., Pawara, P. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2021. Lecture Notes in Computer Science(), vol 12832. Springer, Cham. https://doi.org/10.1007/978-3-030-80253-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-80253-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80252-3
Online ISBN: 978-3-030-80253-0
eBook Packages: Computer ScienceComputer Science (R0)