Abstract
Computer-aided diagnosis (CAD) systems based on machine learning (ML) techniques have altered the field of medical research. The deployement of such models to classify breast cancer is one area of many where exactness has been the main preoccupation. CAD systems aim to reach the performance of trained clinicians in identifying breast cancer at its early stages, thus optimizing the outcome for breast cancer patients while reducing the cost of treatment. This paper presents a supervised machine learning CAD system for breast cancer classification based on feature selection, PCA, grid search for hyperparameter tuning, and cross-validation. The system draws on seven ML classifiers ANN, k-NN, SVM, DT, RF, XGboost, and Adaboost. Two ensemble models were developed by concatenating the prediction of each ML model using Majority voting and stacking with Logistic Regression S-LR for the final prediction. The system's performance is evaluated by computing various evaluation metrics, mainly accuracy, specificity, precision, recall, Matthews Correlation Coefficient, Jaccard, and F1-score. To this end, the data sets used are Wisconsin and Mass mammography. The results indicate that the XGboost model achieved the highest recall of over 96% for the Mammographic Mass dataset. While for the WBCD, both the AdaBoost and the S-LR models outperformed the others with a Recall of 95.35%. The stacking with logistic regression ensemble model obtained the highest accuracies of 93.37% for the Mammographic Mass dataset and 97.37% for the WBCD. Accordingly, the proposed model can be suggested to assist in decision-making in classifying breast cancer tumors. Therefore, a Flask application using the S-LR model is developed.
Similar content being viewed by others
Data availability
The data used in this paper is available on request to corresponding author.
References
Ginsburg O et al (2020) Breast cancer early detection: A phased approach to implementation. Cancer 126(S10):2379–2393. https://doi.org/10.1002/cncr.32887
Madaminov FSM (2022) Breast cancer detection methods, symptoms, causes, treatment. 10.5281/ZENODO.7401437
Mutebi M et al (2020) Breast cancer treatment: A phased approach to implementation. Cancer 126(S10):2365–2378. https://doi.org/10.1002/cncr.32910
Niell BL, Freer PE, Weinfurtner RJ, Arleo EK, Drukteinis JS (2017) Screening for breast cancer. Radiol Clin North Am 55(6):1145–1162. https://doi.org/10.1016/j.rcl.2017.06.004
Mambou S, Maresova P, Krejcar O, Selamat A, Kuca K (2018) Breast cancer detection using infrared thermal imaging and a deep learning model. Sensors 18(9):2799. https://doi.org/10.3390/s18092799
Andrade AVD et al (2023) Accurate diagnosis of breast lesions: Number 4 – April 2023. Rev Bras Ginecol E Obstetrícia RBGO Gynecol Obstet 45(04):215–220. https://doi.org/10.1055/s-0043-1769468
Elter M, Schulz-Wendtland R, Wittenberg T (2007) The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process: Prediction of breast biopsy outcomes using CAD approaches. Med Phys 34(11):4164–4172. https://doi.org/10.1118/1.2786864
AlHinai N (2020) Introduction to biomedical signal processing and artificial intelligence, in biomedical signal processing and artificial intelligence in healthcare. Elsevier. pp 1–28. https://doi.org/10.1016/B978-0-12-818946-7.00001-9
Jalalian A, Mashohor S, Mahmud R, Karasfi B, Saripan MIB, Ramli ARB (2017) Foundation and methodologies in computer-aided diagnosis systems for breast cancer detection. EXCLI J. 16Doc113 ISSN 1611–2156. https://doi.org/10.17179/EXCLI2016-701
Yarabarla MS, Ravi LK, Sivasangari A (2019) Breast cancer prediction via machine learning. in 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India: IEEE, pp 121–124. https://doi.org/10.1109/ICOEI.2019.8862533
Shao Z, Zhao R, Yuan S, Ding M, Wang Y (2022) Tracing the evolution of AI in the past decade and forecasting the emerging trends. Expert Syst Appl 209:118221. https://doi.org/10.1016/j.eswa.2022.118221
OE Gannour, Hamida S, Saleh S, Lamalem Y, Cherradi B, Raihani A (2022) ‘COVID-19 Detection on x-ray images using a combining mechanism of pre-trained CNNs’. Int J Adv Comput Sci Appl 13(6). https://doi.org/10.14569/IJACSA.2022.0130668
Murugesan A, Patel S, Viswanathan VS, Bhargava P, Faraji N (2022) Dear medical students - artificial intelligence is not taking away a radiologist’s job. Curr Probl Diagn Radiol S0363018822001165. https://doi.org/10.1067/j.cpradiol.2022.08.001
Al-Azzam N, Shatnawi I (2021) Comparing supervised and semi-supervised machine learning models on diagnosing breast cancer. Ann Med Surg 62:53–64. https://doi.org/10.1016/j.amsu.2020.12.043
Amrane M, Oukid S, Laboratory L, Gagaoua I, Ensar T. Breast cancer classification using machine learning. p 4. https://doi.org/10.1109/EBBT.2018.8391453
Dhahri H, Al Maghayreh E, Mahmood A, Elkilani W, Faisal Nagi M (2019) Automated breast cancer diagnosis based on machine learning algorithms. J Healthc Eng 2019:1–11. https://doi.org/10.1155/2019/4253641
Islam MdM, Haque MdR, Iqbal H, Hasan MdM, Hasan M, Kabir MN (2020) Breast cancer prediction: a comparative study using machine learning techniques. SN Comput Sci 1(5):290. https://doi.org/10.1007/s42979-020-00305-w
Agarap AFM (2018) On breast cancer detection: an application of machine learning algorithms on the wisconsin diagnostic dataset. in Proceedings of the 2nd International Conference on Machine Learning and Soft Computing - ICMLSC ’18, Phu Quoc Island, Viet Nam: ACM Press, pp 5–9. https://doi.org/10.1145/3184066.3184080
Naji MA, Filali SE, Aarika K, Benlahmar EH, Abdelouhahid RA, Debauche O (2021) Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Comput Sci 191:487–492. https://doi.org/10.1016/j.procs.2021.07.062
Omondiagbe DA, Veeramani S, Sidhu AS (2019) Machine learning classification techniques for breast cancer diagnosis. IOP Conf Ser Mater Sci Eng. 495:012033. https://doi.org/10.1088/1757-899X/495/1/012033
Wang H, Zheng B, Yoon SW, Ko HS (2018) A support vector machine-based ensemble algorithm for breast cancer diagnosis. Eur J Oper Res 267(2):687–699. https://doi.org/10.1016/j.ejor.2017.12.001
Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn 107(8–10):1477–1494. https://doi.org/10.1007/s10994-018-5724-2
Ghawi R, Pfeffer J (2019) Efficient hyperparameter tuning with grid search for text categorization using kNN approach with BM25 similarity. Open Comput Sci 9(1):160–180. https://doi.org/10.1515/comp-2019-0011
Wang H, Zheng H (2013) Model Validation, Machine Learning, in Encyclopedia of Systems Biology, Dubitzky W, Wolkenhauer O, Cho K-H, Yokota H, Eds., New York, NY: Springer New York, pp 1406–1407. https://doi.org/10.1007/978-1-4419-9863-7_233
Dalianis H (2018) Evaluation metrics and evaluation, in clinical text mining, cham: Springer International Publishing pp 45–53. https://doi.org/10.1007/978-3-319-78503-5_6
Mishra S et al. (2017) Principal component analysis. Int J Livest Res p 1. https://doi.org/10.5455/ijlr.20170415115235
Hicham K, Laghmati S, Hamida S, Ghazi AE, Tmiri A, Cherradi B (2023) Assessing the Performance of Deep Learning Models for Colon Polyp Classification using Computed Tomography Scans, in 2023 3rd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Mohammedia, Morocco: IEEE, pp 01–06. https://doi.org/10.1109/IRASET57153.2023.10152889
Hijazi H, Chan C (2013) A classification framework applied to cancer gene expression profiles. J Healthc Eng 4(2):255–284. https://doi.org/10.1260/2040-2295.4.2.255
Saba T (2020) Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and challenges. J Infect Public Health 13(9):1274–1289. https://doi.org/10.1016/j.jiph.2020.06.033
Hamida S, Cherradi B, Raihani A, Ouajji H (2019) Performance Evaluation of Machine Learning Algorithms in Handwritten Digits Recognition, in 2019 1st International Conference on Smart Systems and Data Science (ICSSD), Rabat, Morocco: IEEE, pp 1–6. https://doi.org/10.1109/ICSSD47982.2019.9003052
Ouhmida A, Terrada O, Raihani A, Cherradi B, Hamida S (2021) Voice-based deep learning medical diagnosis system for parkinson’s disease prediction, in 2021 International Congress of Advanced Technology and Engineering (ICOTEN), Taiz, Yemen: IEEE, pp 1–5. https://doi.org/10.1109/ICOTEN52080.2021.9493456
El Gannour O, Cherradi B, Hamida S, Jebbari M, Raihani A (2022) Screening medical face mask for coronavirus prevention using deep learning and AutoML, in 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Meknes, Morocco: IEEE, pp 1–7. https://doi.org/10.1109/IRASET52964.2022.9737903.
Park Y-S, Lek S (2016) Artificial neural networks, in developments in environmental modelling. Elsevier pp 123–140. https://doi.org/10.1016/B978-0-444-63623-2.00007-4
Lawson CE et al (2021) Machine learning for metabolic engineering: A review. Metab Eng 63:34–60. https://doi.org/10.1016/j.ymben.2020.10.005
Laghmati S, Hicham K, Hamida S, Boutahar K, Cherradi B, Tmiri A (2023) A CAD system based on a stacked ensemble model and ML techniques for breast cancer prognosis, in 2023 3rd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Mohammedia, Morocco: IEEE, pp 1–7. https://doi.org/10.1109/IRASET57153.2023.10152913
Altaher A (2017) Phishing Websites Classification using Hybrid SVM and KNN Approach. Int J Adv Comput Sci Appl 8(6). https://doi.org/10.14569/IJACSA.2017.080611
Bansal M, Goyal A, Choudhary A (2022) A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decis Anal J. 3:100071. https://doi.org/10.1016/j.dajour.2022.100071
Zhou Q, Zhang H, Lari Z, Liu Z, El-Sheimy N (2016) Design and implementation of foot-mounted inertial sensor based wearable electronic device for game play application. Sensors 16(10):1752. https://doi.org/10.3390/s16101752
Du M, Wang SM, Gong G (2011) Research on decision tree algorithm based on information entropy. Adv Mater Res 267:732–737. https://doi.org/10.4028/www.scientific.net/AMR.267.732
Parmar A, Katariya R, Patel V (2019) A review on random forest: An ensemble classifier, in International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018, Hemanth J, Fernando X, Lafata P, Baig Z, Eds., in Lecture Notes on Data Engineering and Communications Technologies, vol. 26. Cham: Springer International Publishing, pp 758–763. https://doi.org/10.1007/978-3-030-03146-6_86
Balli S, Sağbaş EA, Peker M (2019) Human activity recognition from smart watch sensor data using a hybrid of principal component analysis and random forest algorithm. Meas Control 52(1–2):37–45. https://doi.org/10.1177/0020294018813692
Chen W, Lei X, Chakrabortty R, Chandra Pal S, Sahana M, Janizadeh S (2021) Evaluation of different boosting ensemble machine learning models and novel deep learning and boosting framework for head-cut gully erosion susceptibility. J Environ Manage. 284:112015. https://doi.org/10.1016/j.jenvman.2021.112015
Guo R, Zhao Z, Wang T, Liu G, Zhao J, Gao D (2020) Degradation state recognition of piston pump based on ICEEMDAN and XGBoost. Appl Sci 10(18):6593. https://doi.org/10.3390/app10186593
Terrada O, Hamida S, Cherradi B, Raihani A, Bouattane O (2020) Supervised machine learning based medical diagnosis support system for prediction of patients with heart disease. Adv Sci Technol Eng Syst J 5(5):269–277. https://doi.org/10.25046/aj050533
Chatterjee R, Datta A, Sanyal DK (2019) Ensemble learning approach to motor imagery eeg signal classification, in machine learning in bio-signal analysis and diagnostic imaging. Elsevier pp 183–208. https://doi.org/10.1016/B978-0-12-816086-2.00008-4
Ben Jabra M, Koubaa A, Benjdira B, Ammar A, Hamam H (2021) COVID-19 diagnosis in chest x-rays using deep learning and majority voting. Appl Sci 11(6):2884. https://doi.org/10.3390/app11062884
Srivastava G, Pradhan N, Saini Y (2022) Ensemble of Deep Neural Networks based on Condorcet’s Jury Theorem for screening Covid-19 and Pneumonia from radiograph images. Comput Biol Med 149:105979. https://doi.org/10.1016/j.compbiomed.2022.105979
Tulyakov S, Jaeger S, Govindaraju V, Doermann D (2008) Review of Classifier Combination Methods, in Machine Learning in Document Analysis and Recognition, Marinai S, and Fujisawa H, Eds., in Studies in Computational Intelligence, vol. 90. Berlin, Heidelberg: Springer Berlin Heidelberg, pp 361–386. https://doi.org/10.1007/978-3-540-76280-5_14
Musa AB, Mohammed M, Mussallum FA, Elbashir MK (2021) SVM and Naïve Bayes stacking approach for improving gene expression data classification using logistic regression. Int J Advance Soft Compu Appl 13(1):136–148
Vujovic ŽÐ (2021) Classification Model Evaluation Metrics. Int J Adv Comput Sci Appl 12(6) https://doi.org/10.14569/IJACSA.2021.0120670
Ragab, Sharkas, and Attallah (2019) Breast cancer diagnosis using an efficient cad system based on multiple classifiers. Diagnostics 9(4): 165. https://doi.org/10.3390/diagnostics9040165
Novaković JD, Veljović A, Ilić SS, Papić Ž, Milica T (2017) Evaluation of classification models in machine learning. Theory Appl Math Amp Comput Sci 7(1):39
Alsmariy R, Healy G, Abdelhafez H (2020) Predicting cervical cancer using machine learning methods. Int J Adv Comput Sci Appl 11(7). https://doi.org/10.14569/IJACSA.2020.0110723
Aszemi NM, Dominic PDD (2019) Hyperparameter optimization in convolutional neural network using genetic algorithms. Int J Adv Comput Sci Appl 10(6). https://doi.org/10.14569/IJACSA.2019.0100638
Bowers AJ, Zhou X (2019) Receiver Operating Characteristic (ROC) Area Under the Curve (AUC): A Diagnostic Measure for Evaluating the Accuracy of Predictors of Education Outcomes. J Educ Stud Placed Risk JESPAR 24(1):20–46. https://doi.org/10.1080/10824669.2018.1523734
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Laghmati, S., Hamida, S., Hicham, K. et al. An improved breast cancer disease prediction system using ML and PCA. Multimed Tools Appl 83, 33785–33821 (2024). https://doi.org/10.1007/s11042-023-16874-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16874-w