Abstract
Breast cancer is a serious threat to women’s health worldwide. Although the exact causes of this disease are still unknown, it is known that the incidence of breast cancer is associated with risk factors. Risk factors in cancer are any genetic, reproductive, hormonal, physical, biological, or lifestyle-related conditions that increase the likelihood of developing breast cancer. This research aims to identify the most relevant risk factors in patients with breast cancer in a dataset by following the Knowledge Discovery in Databases process. To determine the relevance of risk factors, this research implements two feature selection methods: the Chi-Squared test and Mutual Information; and seven classifiers are used to validate the results obtained. Our results show that the risk factors identified as the most relevant are related to the age of the patient, her menopausal status, whether she had undergone hormonal therapy, and her type of menopause.
Notes
Breast Cancer Surveillance Consortium page: https://www.bcsc-research.org
Data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer Surveillance Consortium (HHSN261201100031C). http://www.bcsc-research.org/.
Using another more complete dataset could be another option. However, it is difficult to find publicly available breast cancer datasets, particularly those related to risk factors.
Rapid Miner page: https://rapidminer.com
REFERENCES
Global Cancer Observatory, “Cancer Today”. https://gco.iarc.fr/today/online-analysis-pie. Accessed Apr. 25, 2023.
Cancer.Net, “Breast Cancer: Risk Factors and Prevention”. https://www.cancer.net/cancer-types/breast-cancer/risk-factors-and-prevention. Accessed Apr. 25, 2023.
Abreu, P.H., Santos, M.S., Abreu, M.H., Andrade, B., and Silva, D.C., Predicting breast cancer recurrence using machine learning techniques, ACM Comput. Survey, 2016, vol. 49, no. 3, pp. 1–40. https://doi.org/10.1145/2988544
Kawano, H., Knowledge discovery and data mining, J. Jpn. Soc. Fuzzy Theory Syst., 1997, vol. 9, no. 6, pp. 851–860. https://doi.org/10.3156/jfuzzy.9.6_851
Li, A., et al., Association rule-based breast cancer prevention and control system, IEEE Trans. Comput. Soc. Syst., 2019, vol. 6, no. 5, pp. 1106–1114. https://doi.org/10.1109/TCSS.2019.2912629
Kabir, M.F., Ludwig, S.A., and Abdullah, A.S., Rule discovery from breast cancer risk factors using association rule mining, Proc. IEEE Int. Conf. on Big Data (Big Data), Seattle, Dec. 2018, pp. 2433–2441. https://doi.org/10.1109/BigData.2018.8622028
Fahrudin, T.M., Syarif, I., and Barakbah, A.R., The determinant factor of breast cancer on medical oncology using feature selection based clustering, Proc. Int. Conf. on Knowledge Creation and Intelligent Computing (KCIC), Manado, Nov. 2016, pp. 232–239. https://doi.org/10.1109/KCIC.2016.7883652
Maskery, S., Younghong Zhang, Hai Hu, Shriver, C., Hooke, J., and Liebman, M., Caffeine intake, race, and risk of invasive breast cancer lessons learned from data mining a clinical database, Proc. 19th IEEE Symp. on Computer-Based Medical Systems (CBMS’06), Salt Lake City, 2006, vol. 2006, pp. 714–718. https://doi.org/10.1109/CBMS.2006.64
Kabir, M.F. and Ludwig, S., Classification of breast cancer risk factors using several resampling approaches, Proc. 17th IEEE Int. Conf. on Machine Learning and Applications (ICMLA), Orlando, Dec. 2018, pp. 1243–1248. https://doi.org/10.1109/ICMLA.2018.00202
Fu, B., Liu, P., Lin, J., Deng, L., Hu, K., and Zheng, H., Predicting invasive disease-free survival for early stage breast cancer patients using follow-up clinical data, IEEE Trans. Biomed. Eng., 2019, vol. 66, no. 7, pp. 2053–2064. https://doi.org/10.1109/TBME.2018.2882867
Barlow, W.E., et al., Prospective breast cancer risk prediction model for women undergoing screening mammography, JNCI J. Nat. Cancer Inst., 2006, vol. 98, no. 17, pp. 1204–1214. https://doi.org/10.1093/jnci/djj331
Pearson, K., On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, London, Edinburgh, Dublin Philos. Mag. J. Sci., 1900, vol. 50, no. 302, pp. 157–175. https://doi.org/10.1080/14786440009463897
MacKay, D.J.C., Information Theory, Inference & Learning Algorithms, Cambridge Univ. Press, 2002.
Nillson, N.J., Learning Machines: Foundations of Trainable Pattern-Classifying Systems, McGraw-Hill, 1965.
Schapire, R.E., Using output codes to boost multiclass learning problems, Proc. 14th Int. Conf. on Machine Learning, Nashville, 1997, pp. 313–321.
Wolpert, D.H., Stacked generalization, Neural Networks, 1992, vol. 5, no. 2, pp. 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1
Breiman, L. Bagging predictors, Mach. Learn., 1996, vol. 24, no. 2, pp. 123–140. https://doi.org/10.1023/A:1018054314350
Kaur, H., Pannu, H.S., and Malhi, A.K., A systematic review on imbalanced data challengesin machine learning: applications and solutions, ACM Comput. Surv., 2019, vol. 52, no. 4, pp. 1–36. https://doi.org/10.1145/3343440
Volkov, I., Radchenko, G., and Tchernykh, A., Digital twins, Internet of things and mobile medicine: a review of current platforms to support smart healthcare, Program. Comput. Software, 2021, vol. 47, pp. 578–590. https://doi.org/10.1134/S0361768821080284
Vasilev, I., Petrovskiy, M., Mashechkin, I., et al., Predicting COVID-19-induced lung damage based on machine learning methods, Program. Comput. Software, 2022, vol. 48, pp. 243–255. https://doi.org/10.1134/S0361768822040065
Jayashree, J. and Kumar, S., Linear discriminant analysis based genetic algorithm with generalized regression neural network – a hybrid expert system for diagnosis of diabetes, Program. Comput. Software, 2018, vol. 44, pp. 417–427. https://doi.org/10.1134/S0361768818060063
Funding
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ibarra-Cuevas, Z., Nunez-Varela, J., Nunez-Varela, A. et al. Determination of Relevant Risk Factors for Breast Cancer Using Feature Selection. Program Comput Soft 49, 671–681 (2023). https://doi.org/10.1134/S0361768823080091
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768823080091