Skip to main content

Advertisement

Log in

An improved breast cancer disease prediction system using ML and PCA

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Computer-aided diagnosis (CAD) systems based on machine learning (ML) techniques have altered the field of medical research. The deployement of such models to classify breast cancer is one area of many where exactness has been the main preoccupation. CAD systems aim to reach the performance of trained clinicians in identifying breast cancer at its early stages, thus optimizing the outcome for breast cancer patients while reducing the cost of treatment. This paper presents a supervised machine learning CAD system for breast cancer classification based on feature selection, PCA, grid search for hyperparameter tuning, and cross-validation. The system draws on seven ML classifiers ANN, k-NN, SVM, DT, RF, XGboost, and Adaboost. Two ensemble models were developed by concatenating the prediction of each ML model using Majority voting and stacking with Logistic Regression S-LR for the final prediction. The system's performance is evaluated by computing various evaluation metrics, mainly accuracy, specificity, precision, recall, Matthews Correlation Coefficient, Jaccard, and F1-score. To this end, the data sets used are Wisconsin and Mass mammography. The results indicate that the XGboost model achieved the highest recall of over 96% for the Mammographic Mass dataset. While for the WBCD, both the AdaBoost and the S-LR models outperformed the others with a Recall of 95.35%. The stacking with logistic regression ensemble model obtained the highest accuracies of 93.37% for the Mammographic Mass dataset and 97.37% for the WBCD. Accordingly, the proposed model can be suggested to assist in decision-making in classifying breast cancer tumors. Therefore, a Flask application using the S-LR model is developed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Data availability

The data used in this paper is available on request to corresponding author.

Notes

  1. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic).

  2. http://archive.ics.uci.edu/ml/datasets/mammographic+mass.

References

  1. Ginsburg O et al (2020) Breast cancer early detection: A phased approach to implementation. Cancer 126(S10):2379–2393. https://doi.org/10.1002/cncr.32887

    Article  PubMed  Google Scholar 

  2. Madaminov FSM (2022) Breast cancer detection methods, symptoms, causes, treatment. 10.5281/ZENODO.7401437

  3. Mutebi M et al (2020) Breast cancer treatment: A phased approach to implementation. Cancer 126(S10):2365–2378. https://doi.org/10.1002/cncr.32910

    Article  PubMed  Google Scholar 

  4. Niell BL, Freer PE, Weinfurtner RJ, Arleo EK, Drukteinis JS (2017) Screening for breast cancer. Radiol Clin North Am 55(6):1145–1162. https://doi.org/10.1016/j.rcl.2017.06.004

    Article  PubMed  Google Scholar 

  5. Mambou S, Maresova P, Krejcar O, Selamat A, Kuca K (2018) Breast cancer detection using infrared thermal imaging and a deep learning model. Sensors 18(9):2799. https://doi.org/10.3390/s18092799

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  6. Andrade AVD et al (2023) Accurate diagnosis of breast lesions: Number 4 – April 2023. Rev Bras Ginecol E Obstetrícia RBGO Gynecol Obstet 45(04):215–220. https://doi.org/10.1055/s-0043-1769468

    Article  Google Scholar 

  7. Elter M, Schulz-Wendtland R, Wittenberg T (2007) The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process: Prediction of breast biopsy outcomes using CAD approaches. Med Phys 34(11):4164–4172. https://doi.org/10.1118/1.2786864

    Article  CAS  PubMed  Google Scholar 

  8. AlHinai N (2020) Introduction to biomedical signal processing and artificial intelligence, in biomedical signal processing and artificial intelligence in healthcare. Elsevier. pp 1–28. https://doi.org/10.1016/B978-0-12-818946-7.00001-9

  9. Jalalian A, Mashohor S, Mahmud R, Karasfi B, Saripan MIB, Ramli ARB (2017) Foundation and methodologies in computer-aided diagnosis systems for breast cancer detection. EXCLI J. 16Doc113 ISSN 1611–2156. https://doi.org/10.17179/EXCLI2016-701

  10. Yarabarla MS, Ravi LK, Sivasangari A (2019) Breast cancer prediction via machine learning. in 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India: IEEE, pp 121–124. https://doi.org/10.1109/ICOEI.2019.8862533

  11. Shao Z, Zhao R, Yuan S, Ding M, Wang Y (2022) Tracing the evolution of AI in the past decade and forecasting the emerging trends. Expert Syst Appl 209:118221. https://doi.org/10.1016/j.eswa.2022.118221

    Article  Google Scholar 

  12. OE Gannour, Hamida S, Saleh S, Lamalem Y, Cherradi B, Raihani A (2022) ‘COVID-19 Detection on x-ray images using a combining mechanism of pre-trained CNNs’. Int J Adv Comput Sci Appl 13(6). https://doi.org/10.14569/IJACSA.2022.0130668

  13. Murugesan A, Patel S, Viswanathan VS, Bhargava P, Faraji N (2022) Dear medical students - artificial intelligence is not taking away a radiologist’s job. Curr Probl Diagn Radiol S0363018822001165. https://doi.org/10.1067/j.cpradiol.2022.08.001

  14. Al-Azzam N, Shatnawi I (2021) Comparing supervised and semi-supervised machine learning models on diagnosing breast cancer. Ann Med Surg 62:53–64. https://doi.org/10.1016/j.amsu.2020.12.043

    Article  Google Scholar 

  15. Amrane M, Oukid S, Laboratory L, Gagaoua I, Ensar T. Breast cancer classification using machine learning. p 4. https://doi.org/10.1109/EBBT.2018.8391453

  16. Dhahri H, Al Maghayreh E, Mahmood A, Elkilani W, Faisal Nagi M (2019) Automated breast cancer diagnosis based on machine learning algorithms. J Healthc Eng 2019:1–11. https://doi.org/10.1155/2019/4253641

    Article  Google Scholar 

  17. Islam MdM, Haque MdR, Iqbal H, Hasan MdM, Hasan M, Kabir MN (2020) Breast cancer prediction: a comparative study using machine learning techniques. SN Comput Sci 1(5):290. https://doi.org/10.1007/s42979-020-00305-w

    Article  Google Scholar 

  18. Agarap AFM (2018) On breast cancer detection: an application of machine learning algorithms on the wisconsin diagnostic dataset. in Proceedings of the 2nd International Conference on Machine Learning and Soft Computing - ICMLSC ’18, Phu Quoc Island, Viet Nam: ACM Press, pp 5–9. https://doi.org/10.1145/3184066.3184080

  19. Naji MA, Filali SE, Aarika K, Benlahmar EH, Abdelouhahid RA, Debauche O (2021) Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Comput Sci 191:487–492. https://doi.org/10.1016/j.procs.2021.07.062

    Article  Google Scholar 

  20. Omondiagbe DA, Veeramani S, Sidhu AS (2019) Machine learning classification techniques for breast cancer diagnosis. IOP Conf Ser Mater Sci Eng. 495:012033. https://doi.org/10.1088/1757-899X/495/1/012033

    Article  Google Scholar 

  21. Wang H, Zheng B, Yoon SW, Ko HS (2018) A support vector machine-based ensemble algorithm for breast cancer diagnosis. Eur J Oper Res 267(2):687–699. https://doi.org/10.1016/j.ejor.2017.12.001

    Article  MathSciNet  Google Scholar 

  22. Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn 107(8–10):1477–1494. https://doi.org/10.1007/s10994-018-5724-2

    Article  MathSciNet  Google Scholar 

  23. Ghawi R, Pfeffer J (2019) Efficient hyperparameter tuning with grid search for text categorization using kNN approach with BM25 similarity. Open Comput Sci 9(1):160–180. https://doi.org/10.1515/comp-2019-0011

    Article  Google Scholar 

  24. Wang H, Zheng H (2013) Model Validation, Machine Learning, in Encyclopedia of Systems Biology, Dubitzky W, Wolkenhauer O, Cho K-H, Yokota H, Eds., New York, NY: Springer New York, pp 1406–1407. https://doi.org/10.1007/978-1-4419-9863-7_233

  25. Dalianis H (2018) Evaluation metrics and evaluation, in clinical text mining, cham: Springer International Publishing pp 45–53. https://doi.org/10.1007/978-3-319-78503-5_6

  26. Mishra S et al. (2017) Principal component analysis. Int J Livest Res p 1. https://doi.org/10.5455/ijlr.20170415115235

  27. Hicham K, Laghmati S, Hamida S, Ghazi AE, Tmiri A, Cherradi B (2023) Assessing the Performance of Deep Learning Models for Colon Polyp Classification using Computed Tomography Scans, in 2023 3rd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Mohammedia, Morocco: IEEE, pp 01–06. https://doi.org/10.1109/IRASET57153.2023.10152889

  28. Hijazi H, Chan C (2013) A classification framework applied to cancer gene expression profiles. J Healthc Eng 4(2):255–284. https://doi.org/10.1260/2040-2295.4.2.255

    Article  PubMed  Google Scholar 

  29. Saba T (2020) Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and challenges. J Infect Public Health 13(9):1274–1289. https://doi.org/10.1016/j.jiph.2020.06.033

    Article  PubMed  Google Scholar 

  30. Hamida S, Cherradi B, Raihani A, Ouajji H (2019) Performance Evaluation of Machine Learning Algorithms in Handwritten Digits Recognition, in 2019 1st International Conference on Smart Systems and Data Science (ICSSD), Rabat, Morocco: IEEE, pp 1–6. https://doi.org/10.1109/ICSSD47982.2019.9003052

  31. Ouhmida A, Terrada O, Raihani A, Cherradi B, Hamida S (2021) Voice-based deep learning medical diagnosis system for parkinson’s disease prediction, in 2021 International Congress of Advanced Technology and Engineering (ICOTEN), Taiz, Yemen: IEEE, pp 1–5. https://doi.org/10.1109/ICOTEN52080.2021.9493456

  32. El Gannour O, Cherradi B, Hamida S, Jebbari M, Raihani A (2022) Screening medical face mask for coronavirus prevention using deep learning and AutoML, in 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Meknes, Morocco: IEEE, pp 1–7. https://doi.org/10.1109/IRASET52964.2022.9737903.

  33. Park Y-S, Lek S (2016) Artificial neural networks, in developments in environmental modelling. Elsevier pp 123–140. https://doi.org/10.1016/B978-0-444-63623-2.00007-4

  34. Lawson CE et al (2021) Machine learning for metabolic engineering: A review. Metab Eng 63:34–60. https://doi.org/10.1016/j.ymben.2020.10.005

    Article  CAS  PubMed  Google Scholar 

  35. Laghmati S, Hicham K, Hamida S, Boutahar K, Cherradi B, Tmiri A (2023) A CAD system based on a stacked ensemble model and ML techniques for breast cancer prognosis, in 2023 3rd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Mohammedia, Morocco: IEEE, pp 1–7. https://doi.org/10.1109/IRASET57153.2023.10152913

  36. Altaher A (2017) Phishing Websites Classification using Hybrid SVM and KNN Approach. Int J Adv Comput Sci Appl 8(6). https://doi.org/10.14569/IJACSA.2017.080611

  37. Bansal M, Goyal A, Choudhary A (2022) A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decis Anal J. 3:100071. https://doi.org/10.1016/j.dajour.2022.100071

    Article  Google Scholar 

  38. Zhou Q, Zhang H, Lari Z, Liu Z, El-Sheimy N (2016) Design and implementation of foot-mounted inertial sensor based wearable electronic device for game play application. Sensors 16(10):1752. https://doi.org/10.3390/s16101752

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  39. Du M, Wang SM, Gong G (2011) Research on decision tree algorithm based on information entropy. Adv Mater Res 267:732–737. https://doi.org/10.4028/www.scientific.net/AMR.267.732

    Article  Google Scholar 

  40. Parmar A, Katariya R, Patel V (2019) A review on random forest: An ensemble classifier, in International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018, Hemanth J, Fernando X, Lafata P, Baig Z, Eds., in Lecture Notes on Data Engineering and Communications Technologies, vol. 26. Cham: Springer International Publishing, pp 758–763. https://doi.org/10.1007/978-3-030-03146-6_86

  41. Balli S, Sağbaş EA, Peker M (2019) Human activity recognition from smart watch sensor data using a hybrid of principal component analysis and random forest algorithm. Meas Control 52(1–2):37–45. https://doi.org/10.1177/0020294018813692

    Article  Google Scholar 

  42. Chen W, Lei X, Chakrabortty R, Chandra Pal S, Sahana M, Janizadeh S (2021) Evaluation of different boosting ensemble machine learning models and novel deep learning and boosting framework for head-cut gully erosion susceptibility. J Environ Manage. 284:112015. https://doi.org/10.1016/j.jenvman.2021.112015

    Article  PubMed  Google Scholar 

  43. Guo R, Zhao Z, Wang T, Liu G, Zhao J, Gao D (2020) Degradation state recognition of piston pump based on ICEEMDAN and XGBoost. Appl Sci 10(18):6593. https://doi.org/10.3390/app10186593

    Article  CAS  Google Scholar 

  44. Terrada O, Hamida S, Cherradi B, Raihani A, Bouattane O (2020) Supervised machine learning based medical diagnosis support system for prediction of patients with heart disease. Adv Sci Technol Eng Syst J 5(5):269–277. https://doi.org/10.25046/aj050533

    Article  Google Scholar 

  45. Chatterjee R, Datta A, Sanyal DK (2019) Ensemble learning approach to motor imagery eeg signal classification, in machine learning in bio-signal analysis and diagnostic imaging. Elsevier pp 183–208. https://doi.org/10.1016/B978-0-12-816086-2.00008-4

  46. Ben Jabra M, Koubaa A, Benjdira B, Ammar A, Hamam H (2021) COVID-19 diagnosis in chest x-rays using deep learning and majority voting. Appl Sci 11(6):2884. https://doi.org/10.3390/app11062884

    Article  CAS  Google Scholar 

  47. Srivastava G, Pradhan N, Saini Y (2022) Ensemble of Deep Neural Networks based on Condorcet’s Jury Theorem for screening Covid-19 and Pneumonia from radiograph images. Comput Biol Med 149:105979. https://doi.org/10.1016/j.compbiomed.2022.105979

    Article  PubMed  PubMed Central  Google Scholar 

  48. Tulyakov S, Jaeger S, Govindaraju V, Doermann D (2008) Review of Classifier Combination Methods, in Machine Learning in Document Analysis and Recognition, Marinai S, and Fujisawa H, Eds., in Studies in Computational Intelligence, vol. 90. Berlin, Heidelberg: Springer Berlin Heidelberg, pp 361–386. https://doi.org/10.1007/978-3-540-76280-5_14

  49. Musa AB, Mohammed M, Mussallum FA, Elbashir MK (2021) SVM and Naïve Bayes stacking approach for improving gene expression data classification using logistic regression. Int J Advance Soft Compu Appl 13(1):136–148

  50. Vujovic ŽÐ (2021) Classification Model Evaluation Metrics. Int J Adv Comput Sci Appl 12(6) https://doi.org/10.14569/IJACSA.2021.0120670

  51. Ragab, Sharkas, and Attallah (2019) Breast cancer diagnosis using an efficient cad system based on multiple classifiers. Diagnostics 9(4): 165. https://doi.org/10.3390/diagnostics9040165

  52. Novaković JD, Veljović A, Ilić SS, Papić Ž, Milica T (2017) Evaluation of classification models in machine learning. Theory Appl Math Amp Comput Sci 7(1):39

    MathSciNet  Google Scholar 

  53. Alsmariy R, Healy G, Abdelhafez H (2020) Predicting cervical cancer using machine learning methods. Int J Adv Comput Sci Appl 11(7). https://doi.org/10.14569/IJACSA.2020.0110723

  54. Aszemi NM, Dominic PDD (2019) Hyperparameter optimization in convolutional neural network using genetic algorithms. Int J Adv Comput Sci Appl 10(6). https://doi.org/10.14569/IJACSA.2019.0100638

  55. Bowers AJ, Zhou X (2019) Receiver Operating Characteristic (ROC) Area Under the Curve (AUC): A Diagnostic Measure for Evaluating the Accuracy of Predictors of Education Outcomes. J Educ Stud Placed Risk JESPAR 24(1):20–46. https://doi.org/10.1080/10824669.2018.1523734

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bouchaib Cherradi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laghmati, S., Hamida, S., Hicham, K. et al. An improved breast cancer disease prediction system using ML and PCA. Multimed Tools Appl 83, 33785–33821 (2024). https://doi.org/10.1007/s11042-023-16874-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16874-w

Keywords

Navigation