Skip to main content
Log in

Determination of Relevant Risk Factors for Breast Cancer Using Feature Selection

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

Breast cancer is a serious threat to women’s health worldwide. Although the exact causes of this disease are still unknown, it is known that the incidence of breast cancer is associated with risk factors. Risk factors in cancer are any genetic, reproductive, hormonal, physical, biological, or lifestyle-related conditions that increase the likelihood of developing breast cancer. This research aims to identify the most relevant risk factors in patients with breast cancer in a dataset by following the Knowledge Discovery in Databases process. To determine the relevance of risk factors, this research implements two feature selection methods: the Chi-Squared test and Mutual Information; and seven classifiers are used to validate the results obtained. Our results show that the risk factors identified as the most relevant are related to the age of the patient, her menopausal status, whether she had undergone hormonal therapy, and her type of menopause.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Notes

  1. Breast Cancer Surveillance Consortium page: https://www.bcsc-research.org

  2. Data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer Surveillance Consortium (HHSN261201100031C). http://www.bcsc-research.org/.

  3. Using another more complete dataset could be another option. However, it is difficult to find publicly available breast cancer datasets, particularly those related to risk factors.

  4. Rapid Miner page: https://rapidminer.com

REFERENCES

  1. Global Cancer Observatory, “Cancer Today”. https://gco.iarc.fr/today/online-analysis-pie. Accessed Apr. 25, 2023.

  2. Cancer.Net, “Breast Cancer: Risk Factors and Prevention”. https://www.cancer.net/cancer-types/breast-cancer/risk-factors-and-prevention. Accessed Apr. 25, 2023.

  3. Abreu, P.H., Santos, M.S., Abreu, M.H., Andrade, B., and Silva, D.C., Predicting breast cancer recurrence using machine learning techniques, ACM Comput. Survey, 2016, vol. 49, no. 3, pp. 1–40. https://doi.org/10.1145/2988544

    Article  Google Scholar 

  4. Kawano, H., Knowledge discovery and data mining, J. Jpn. Soc. Fuzzy Theory Syst., 1997, vol. 9, no. 6, pp. 851–860. https://doi.org/10.3156/jfuzzy.9.6_851

    Article  Google Scholar 

  5. Li, A., et al., Association rule-based breast cancer prevention and control system, IEEE Trans. Comput. Soc. Syst., 2019, vol. 6, no. 5, pp. 1106–1114. https://doi.org/10.1109/TCSS.2019.2912629

    Article  Google Scholar 

  6. Kabir, M.F., Ludwig, S.A., and Abdullah, A.S., Rule discovery from breast cancer risk factors using association rule mining, Proc. IEEE Int. Conf. on Big Data (Big Data), Seattle, Dec. 2018, pp. 2433–2441. https://doi.org/10.1109/BigData.2018.8622028

  7. Fahrudin, T.M., Syarif, I., and Barakbah, A.R., The determinant factor of breast cancer on medical oncology using feature selection based clustering, Proc. Int. Conf. on Knowledge Creation and Intelligent Computing (KCIC), Manado, Nov. 2016, pp. 232–239. https://doi.org/10.1109/KCIC.2016.7883652

  8. Maskery, S., Younghong Zhang, Hai Hu, Shriver, C., Hooke, J., and Liebman, M., Caffeine intake, race, and risk of invasive breast cancer lessons learned from data mining a clinical database, Proc. 19th IEEE Symp. on Computer-Based Medical Systems (CBMS’06), Salt Lake City, 2006, vol. 2006, pp. 714–718. https://doi.org/10.1109/CBMS.2006.64

  9. Kabir, M.F. and Ludwig, S., Classification of breast cancer risk factors using several resampling approaches, Proc. 17th IEEE Int. Conf. on Machine Learning and Applications (ICMLA), Orlando, Dec. 2018, pp. 1243–1248. https://doi.org/10.1109/ICMLA.2018.00202

  10. Fu, B., Liu, P., Lin, J., Deng, L., Hu, K., and Zheng, H., Predicting invasive disease-free survival for early stage breast cancer patients using follow-up clinical data, IEEE Trans. Biomed. Eng., 2019, vol. 66, no. 7, pp. 2053–2064. https://doi.org/10.1109/TBME.2018.2882867

    Article  Google Scholar 

  11. Barlow, W.E., et al., Prospective breast cancer risk prediction model for women undergoing screening mammography, JNCI J. Nat. Cancer Inst., 2006, vol. 98, no. 17, pp. 1204–1214. https://doi.org/10.1093/jnci/djj331

    Article  Google Scholar 

  12. Pearson, K., On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, London, Edinburgh, Dublin Philos. Mag. J. Sci., 1900, vol. 50, no. 302, pp. 157–175. https://doi.org/10.1080/14786440009463897

    Article  Google Scholar 

  13. MacKay, D.J.C., Information Theory, Inference & Learning Algorithms, Cambridge Univ. Press, 2002.

    Google Scholar 

  14. Nillson, N.J., Learning Machines: Foundations of Trainable Pattern-Classifying Systems, McGraw-Hill, 1965.

    Google Scholar 

  15. Schapire, R.E., Using output codes to boost multiclass learning problems, Proc. 14th Int. Conf. on Machine Learning, Nashville, 1997, pp. 313–321.

  16. Wolpert, D.H., Stacked generalization, Neural Networks, 1992, vol. 5, no. 2, pp. 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1

    Article  Google Scholar 

  17. Breiman, L. Bagging predictors, Mach. Learn., 1996, vol. 24, no. 2, pp. 123–140. https://doi.org/10.1023/A:1018054314350

    Article  Google Scholar 

  18. Kaur, H., Pannu, H.S., and Malhi, A.K., A systematic review on imbalanced data challengesin machine learning: applications and solutions, ACM Comput. Surv., 2019, vol. 52, no. 4, pp. 1–36. https://doi.org/10.1145/3343440

    Article  Google Scholar 

  19. Volkov, I., Radchenko, G., and Tchernykh, A., Digital twins, Internet of things and mobile medicine: a review of current platforms to support smart healthcare, Program. Comput. Software, 2021, vol. 47, pp. 578–590. https://doi.org/10.1134/S0361768821080284

    Article  Google Scholar 

  20. Vasilev, I., Petrovskiy, M., Mashechkin, I., et al., Predicting COVID-19-induced lung damage based on machine learning methods, Program. Comput. Software, 2022, vol. 48, pp. 243–255. https://doi.org/10.1134/S0361768822040065

    Article  Google Scholar 

  21. Jayashree, J. and Kumar, S., Linear discriminant analysis based genetic algorithm with generalized regression neural network – a hybrid expert system for diagnosis of diabetes, Program. Comput. Software, 2018, vol. 44, pp. 417–427. https://doi.org/10.1134/S0361768818060063

    Article  Google Scholar 

Download references

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zazil Ibarra-Cuevas, Jose Nunez-Varela, Alberto Nunez-Varela, Francisco E. Martinez-Perez, Sandra E. Nava-Muñoz, Cesar A. Ramirez-Gamez or Hector G. Perez-Gonzalez.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ibarra-Cuevas, Z., Nunez-Varela, J., Nunez-Varela, A. et al. Determination of Relevant Risk Factors for Breast Cancer Using Feature Selection. Program Comput Soft 49, 671–681 (2023). https://doi.org/10.1134/S0361768823080091

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768823080091

Navigation