Persistence of data-driven knowledge to predict breast cancer survival
Introduction
Breast cancer (BC) is one of the most common diseases that affect women around the world. In 2012, the number of new BC cases detected exceeded 1.5 million, and was the cause of more than half a million deaths [1]. In recent years, early detection and new advances in the treatment of BC have improved the cure rate and life expectancy of patients with BC [2], [3] for a total increment of 39% from 1990 to 2015 in the USA [4]. One of the quality indicators used in BC treatments is the five-year survival rate after diagnosis [5], [6]. Five-year survival depends on the stage of the BC at the time of diagnosis [4].
Many studies use data-driven machine learning technology to construct computer-based predictive models to determine survival expectancy of new diagnosed patients with BC [7], [8], [9], [10], [11], [12]. In a recent work, Kate and Nadig [6] confirmed that the construction of such models should be conditioned to the stage of the cancer at the time of diagnosis. In their analysis they considered four possible stages: in-situ, localized, regional, and distant, that they called summary stages. In-situ summary stage defines noninvasive neoplasms. Localized summary stages describe invasive neoplasms confined entirely to the organ of origin. Regional stages represent neoplasms that have extended either beyond the limits of the organ of origin directly into surrounding organs or tissues, or into regional lymph nodes by way of the lymphatic system, or by a combination of extension and regional lymph nodes. Distant BC are neoplasms that have spread to parts of the body remote from the primary tumor.
According to the authors, this was the first publication that proposed a stage-specific approach to the intelligent data analysis of BC survival prediction. Their work improved previous approaches and reached some interesting conclusions, some of which are summarized in Table 1. We grouped them into four different types depending on whether they refer to survival rates, model performance, learning facility, or the predictive relevance of the clinical parameters. Some of these conclusions are conditioned to the machine learning technologies applied (e.g., naïve Bayes [15], logistic regression [16], or decision trees [17]) or the predictive models obtained (i.e., joint models when they are designed to predict the survival of patients in any stage, or summary stage-specific models when they are conceived to predict the survival of patients in one particular stage).
However, despite the clear advantages of a stage-specific predictive analysis of BC data as reported in [6], it is still unclear whether the reached conclusions are temporary or permanent and, if temporary, for how long a predictive finding remains valid. For example, Kate and Nadig [6] found that the site where BC surgery is performed, the size of the lymph node chains detected, and the size of the tumor are the three clinical features that provide more information to predict five-year survival of patients with distant-stage BC. In order to arrive to this conclusion, they used 2682 BC incidences occurred between 2004 and 2008. The challenge is to determine whether these clinical parameters are still among the ones that provide more information in the following years, and if so, what is the temporal progression of the amount of information provided by each clinical parameter over the years and in how much time some clinical parameters can become irrelevant for BC survival prediction. These questions can be answered with the use of intelligent data analysis technology if a database is available that contains a significant number of representative cases and these cases cover a large number of years.
The Surveillance, Epidemiology, and End Results (SEER) database [13] of the National Cancer Institute collects data about cancer diagnoses, treatment, and survival of approximately 30% of the US population. Among these data, SEER contains information about 798,624 incidences of BC between the years 1973 and 2015. Consequently, this is a suitable database to carry out the study that we propose. Concretely in this paper, we use the information about BC cases contained in the SEER database to analyze the validity of the conclusions reached by Kate and Nadig [6] when we project them across the years. Our analysis comprises the identification of the number of years and cases required for machine learning methods to train and generate solid predictive models about BC survival, the evolution of the importance of relevant clinical parameters to predict BC survival, and the decline of the quality of predictive models as time passes.
Section snippets
Material and methods
Following the study by Kate and Nadig [6], in order to determine the validity of their conclusions along the years, we used their same framework: For the dataset, we considered the same features and applied the same data selection and preprocessing (see details in Section Appendix A). For the study, we considered the same classification of BC stages. For the technologies, we used the same machine learning modeling algorithms. For the evaluation, we took the same quality rate of prediction.
Results and discussion
A total number of 312,446 BC incidences remained after the selection process described, among which 264,348 (84.61%) correspond to 5-year survival cases. The distribution of in-situ, localized, regional, and distant BC stage-specific incidences and their survival rates can be observed in Table 3.
These figures represent 79.03% increment of the number of cases with respect to [6] and a different distribution of stage-specific cases which, for in-situ and distant BC incidences, grew from 5.79% and
Conclusions
As new applications of data-driven machine learning methods to predict survival of BC patients appear and new knowledge is generated, it becomes more necessary to discriminate between those findings which are valid over time from those which are only temporarily valid, and in this second case, determine which are their validity times. Population-based cancer databases exist for data analysis [24]. One of the most recent works using the SEER database [6] arrived to fifteen relevant conclusions
Authors’ contributions
DR conceived the idea, prepared the dataset, and made the initial experiments. RK implemented the final algorithms, that were double checked by DR, and obtained the results. Both authors participated in the analysis of the results and in the writing of the document.
Authors statement
The authors state that there are no competing interests to declare.
Conflicts of interests
None.
Acknowledgment
This work was supported by the RETOS P-BreasTreat project (DPI2016-77415-R) of the Spanish Ministerio de Economia y Competitividad.
References (25)
- et al.
Why have breast cancer mortality rates declined?
J. Cancer Policy
(2015) - et al.
Stage-specific predictive models for breast cancer survivability
Int. J. Med. Inf.
(2017) - et al.
Predicting breast cancer survivability: a comparison of three data mining methods
Artif. Intell. Med.
(2005) - et al.
Robust predictive model for evaluating breast cancer survivability
Eng. Appl. Artif. Intell.
(2013) - et al.
Breast cancer data analysis for survivability studies and prediction
Comput. Methods Programs Biomed.
(2018) - et al.
Development of a tool for comprehensive evaluation of population-based cancer registries
J. Med. Inf.
(2018) - et al.
Incidence and mortality and epidemiology of breast cancer in the world
Asian Pac. J. Cancer Prev.
(2016) - et al.
Effect of screening and adjuvant therapy on mortality from breast cancer
N. Engl. J. Med.
(2005) Breast Cancer Facts & Figures 2017–2018
(2017)- et al.
Adherence to quality indicators and survival in patients with breast cancer
Med. Care
(2009)
Applications of machine learning in cancer prediction and prognosis
Cancer Inf.
Machine learning applications in cancer prognosis and prediction
Comput. Struct. Biotechnol. J.
Cited by (17)
New perspectives on cancer clinical research in the era of big data and machine learning
2024, Surgical OncologyFeedback on a shared big dataset for intelligent TBM Part I: Feature extraction and machine learning methods
2023, Underground Space (China)A review of AI and Data Science support for cancer management
2021, Artificial Intelligence in MedicineCitation Excerpt :The fact that the knowledge elicited by ML models should be validated over time is also a focal point to promote model generalizability, especially after some years from the original model elicitation. Indeed Kleinen and colleagues [72] advocate for knowledge embedded in predictive models for breast cancer to be updated every ten years to maintain good performance. Once the issue of possessing a good input dataset to train a well-performing model is solved, the next question researchers in this field face is what ML algorithm to use.
PREDICTION OF COMORBID MALIGNANCY PATIENT SURVIVABILITY -EMPIRICAL PERSPECTIVE
2023, Journal of Theoretical and Applied Information TechnologyAn Improved CHI<sup>2</sup> Feature Selection Based a Two-Stage Prediction of Comorbid Cancer Patient Survivability
2023, Revue d'Intelligence ArtificielleReview of Intelligent Algorithms for Breast Cancer Control: a Latin America Perspective
2023, IEEE Latin America Transactions