Persistence of data-driven knowledge to predict breast cancer survival

https://doi.org/10.1016/j.ijmedinf.2019.06.018Get rights and content

Highlights

  • Data-driven breast cancer survival findings can be temporal.

  • Previously accepted knowledge on breast cancer survival was not always true overtime.

  • Breast cancer survival prediction can be improved even for mild summary stages.

  • Knowledge derived from machine learning must be thoroughly validated overtime before accepted.

Abstract

Background

Machine learning predictive models for breast cancer survival can improve if they are made specific to the stage of the cancer at the time of diagnosis. However, the relevance of the clinical parameters in that prediction, and the predictive quality of these models may change over time.

Objective

To determine whether the findings on the influence of clinical parameters and the performance of machine learning models in the prediction of breast cancer survival have to be considered temporary or permanent, and if temporary what is the period of validity of the new generated knowledge.

Methods

Fifteen recently published relevant conclusions on the application of machine learning methods to predict breast cancer survival were identified. Then, the data on breast cancer in the SEER database were used to construct several data-driven models over time to predict five-year survival of breast cancer. Three different machine learning methods were used. Stage-specific models and joint models for all the stages were considered. The predictive quality of the models and the importance of clinical parameters were subjected to a persistence analysis over time in order to determine the validity and durability of these fifteen conclusions.

Results and conclusions

Only 53% of the conclusions were true for the SEER cases in 1988–2009, and only 75% of these were true over time. Relevant conclusions such as the impossibility to improve survival prediction of the most frequent stages with more data or the importance of the grade of the cancer to predict breast cancer survival of patients with distant metastasis turned to be false when subjected to a temporal analysis. Our study concludes that data-driven knowledge obtained with machine learning methods must be subject to over time validation before it can be clinically and professionally applied.

Introduction

Breast cancer (BC) is one of the most common diseases that affect women around the world. In 2012, the number of new BC cases detected exceeded 1.5 million, and was the cause of more than half a million deaths [1]. In recent years, early detection and new advances in the treatment of BC have improved the cure rate and life expectancy of patients with BC [2], [3] for a total increment of 39% from 1990 to 2015 in the USA [4]. One of the quality indicators used in BC treatments is the five-year survival rate after diagnosis [5], [6]. Five-year survival depends on the stage of the BC at the time of diagnosis [4].

Many studies use data-driven machine learning technology to construct computer-based predictive models to determine survival expectancy of new diagnosed patients with BC [7], [8], [9], [10], [11], [12]. In a recent work, Kate and Nadig [6] confirmed that the construction of such models should be conditioned to the stage of the cancer at the time of diagnosis. In their analysis they considered four possible stages: in-situ, localized, regional, and distant, that they called summary stages. In-situ summary stage defines noninvasive neoplasms. Localized summary stages describe invasive neoplasms confined entirely to the organ of origin. Regional stages represent neoplasms that have extended either beyond the limits of the organ of origin directly into surrounding organs or tissues, or into regional lymph nodes by way of the lymphatic system, or by a combination of extension and regional lymph nodes. Distant BC are neoplasms that have spread to parts of the body remote from the primary tumor.

According to the authors, this was the first publication that proposed a stage-specific approach to the intelligent data analysis of BC survival prediction. Their work improved previous approaches and reached some interesting conclusions, some of which are summarized in Table 1. We grouped them into four different types depending on whether they refer to survival rates, model performance, learning facility, or the predictive relevance of the clinical parameters. Some of these conclusions are conditioned to the machine learning technologies applied (e.g., naïve Bayes [15], logistic regression [16], or decision trees [17]) or the predictive models obtained (i.e., joint models when they are designed to predict the survival of patients in any stage, or summary stage-specific models when they are conceived to predict the survival of patients in one particular stage).

However, despite the clear advantages of a stage-specific predictive analysis of BC data as reported in [6], it is still unclear whether the reached conclusions are temporary or permanent and, if temporary, for how long a predictive finding remains valid. For example, Kate and Nadig [6] found that the site where BC surgery is performed, the size of the lymph node chains detected, and the size of the tumor are the three clinical features that provide more information to predict five-year survival of patients with distant-stage BC. In order to arrive to this conclusion, they used 2682 BC incidences occurred between 2004 and 2008. The challenge is to determine whether these clinical parameters are still among the ones that provide more information in the following years, and if so, what is the temporal progression of the amount of information provided by each clinical parameter over the years and in how much time some clinical parameters can become irrelevant for BC survival prediction. These questions can be answered with the use of intelligent data analysis technology if a database is available that contains a significant number of representative cases and these cases cover a large number of years.

The Surveillance, Epidemiology, and End Results (SEER) database [13] of the National Cancer Institute collects data about cancer diagnoses, treatment, and survival of approximately 30% of the US population. Among these data, SEER contains information about 798,624 incidences of BC between the years 1973 and 2015. Consequently, this is a suitable database to carry out the study that we propose. Concretely in this paper, we use the information about BC cases contained in the SEER database to analyze the validity of the conclusions reached by Kate and Nadig [6] when we project them across the years. Our analysis comprises the identification of the number of years and cases required for machine learning methods to train and generate solid predictive models about BC survival, the evolution of the importance of relevant clinical parameters to predict BC survival, and the decline of the quality of predictive models as time passes.

Section snippets

Material and methods

Following the study by Kate and Nadig [6], in order to determine the validity of their conclusions along the years, we used their same framework: For the dataset, we considered the same features and applied the same data selection and preprocessing (see details in Section Appendix A). For the study, we considered the same classification of BC stages. For the technologies, we used the same machine learning modeling algorithms. For the evaluation, we took the same quality rate of prediction.

Results and discussion

A total number of 312,446 BC incidences remained after the selection process described, among which 264,348 (84.61%) correspond to 5-year survival cases. The distribution of in-situ, localized, regional, and distant BC stage-specific incidences and their survival rates can be observed in Table 3.

These figures represent 79.03% increment of the number of cases with respect to [6] and a different distribution of stage-specific cases which, for in-situ and distant BC incidences, grew from 5.79% and

Conclusions

As new applications of data-driven machine learning methods to predict survival of BC patients appear and new knowledge is generated, it becomes more necessary to discriminate between those findings which are valid over time from those which are only temporarily valid, and in this second case, determine which are their validity times. Population-based cancer databases exist for data analysis [24]. One of the most recent works using the SEER database [6] arrived to fifteen relevant conclusions

Authors’ contributions

DR conceived the idea, prepared the dataset, and made the initial experiments. RK implemented the final algorithms, that were double checked by DR, and obtained the results. Both authors participated in the analysis of the results and in the writing of the document.

Authors statement

The authors state that there are no competing interests to declare.

Conflicts of interests

None.

Acknowledgment

This work was supported by the RETOS P-BreasTreat project (DPI2016-77415-R) of the Spanish Ministerio de Economia y Competitividad.

References (25)

  • J.A. Cruz et al.

    Applications of machine learning in cancer prediction and prognosis

    Cancer Inf.

    (2007)
  • K. Kourou et al.

    Machine learning applications in cancer prognosis and prediction

    Comput. Struct. Biotechnol. J.

    (2014)
  • Cited by (17)

    • A review of AI and Data Science support for cancer management

      2021, Artificial Intelligence in Medicine
      Citation Excerpt :

      The fact that the knowledge elicited by ML models should be validated over time is also a focal point to promote model generalizability, especially after some years from the original model elicitation. Indeed Kleinen and colleagues [72] advocate for knowledge embedded in predictive models for breast cancer to be updated every ten years to maintain good performance. Once the issue of possessing a good input dataset to train a well-performing model is solved, the next question researchers in this field face is what ML algorithm to use.

    • PREDICTION OF COMORBID MALIGNANCY PATIENT SURVIVABILITY -EMPIRICAL PERSPECTIVE

      2023, Journal of Theoretical and Applied Information Technology
    View all citing articles on Scopus
    View full text