Persistence of data-driven knowledge to predict breast cancer survival

doi:10.1016/j.ijmedinf.2019.06.018

International Journal of Medical Informatics

Volume 129, September 2019, Pages 303-311

https://doi.org/10.1016/j.ijmedinf.2019.06.018 Get rights and content

Highlights

•
Data-driven breast cancer survival findings can be temporal.
•
Previously accepted knowledge on breast cancer survival was not always true overtime.
•
Breast cancer survival prediction can be improved even for mild summary stages.
•
Knowledge derived from machine learning must be thoroughly validated overtime before accepted.

Abstract

Background

Machine learning predictive models for breast cancer survival can improve if they are made specific to the stage of the cancer at the time of diagnosis. However, the relevance of the clinical parameters in that prediction, and the predictive quality of these models may change over time.

Objective

To determine whether the findings on the influence of clinical parameters and the performance of machine learning models in the prediction of breast cancer survival have to be considered temporary or permanent, and if temporary what is the period of validity of the new generated knowledge.

Methods

Fifteen recently published relevant conclusions on the application of machine learning methods to predict breast cancer survival were identified. Then, the data on breast cancer in the SEER database were used to construct several data-driven models over time to predict five-year survival of breast cancer. Three different machine learning methods were used. Stage-specific models and joint models for all the stages were considered. The predictive quality of the models and the importance of clinical parameters were subjected to a persistence analysis over time in order to determine the validity and durability of these fifteen conclusions.

Results and conclusions

Only 53% of the conclusions were true for the SEER cases in 1988–2009, and only 75% of these were true over time. Relevant conclusions such as the impossibility to improve survival prediction of the most frequent stages with more data or the importance of the grade of the cancer to predict breast cancer survival of patients with distant metastasis turned to be false when subjected to a temporal analysis. Our study concludes that data-driven knowledge obtained with machine learning methods must be subject to over time validation before it can be clinically and professionally applied.

Introduction

Breast cancer (BC) is one of the most common diseases that affect women around the world. In 2012, the number of new BC cases detected exceeded 1.5 million, and was the cause of more than half a million deaths [1]. In recent years, early detection and new advances in the treatment of BC have improved the cure rate and life expectancy of patients with BC [2], [3] for a total increment of 39% from 1990 to 2015 in the USA [4]. One of the quality indicators used in BC treatments is the five-year survival rate after diagnosis [5], [6]. Five-year survival depends on the stage of the BC at the time of diagnosis [4].

Many studies use data-driven machine learning technology to construct computer-based predictive models to determine survival expectancy of new diagnosed patients with BC [7], [8], [9], [10], [11], [12]. In a recent work, Kate and Nadig [6] confirmed that the construction of such models should be conditioned to the stage of the cancer at the time of diagnosis. In their analysis they considered four possible stages: in-situ, localized, regional, and distant, that they called summary stages. In-situ summary stage defines noninvasive neoplasms. Localized summary stages describe invasive neoplasms confined entirely to the organ of origin. Regional stages represent neoplasms that have extended either beyond the limits of the organ of origin directly into surrounding organs or tissues, or into regional lymph nodes by way of the lymphatic system, or by a combination of extension and regional lymph nodes. Distant BC are neoplasms that have spread to parts of the body remote from the primary tumor.

According to the authors, this was the first publication that proposed a stage-specific approach to the intelligent data analysis of BC survival prediction. Their work improved previous approaches and reached some interesting conclusions, some of which are summarized in Table 1. We grouped them into four different types depending on whether they refer to survival rates, model performance, learning facility, or the predictive relevance of the clinical parameters. Some of these conclusions are conditioned to the machine learning technologies applied (e.g., naïve Bayes [15], logistic regression [16], or decision trees [17]) or the predictive models obtained (i.e., joint models when they are designed to predict the survival of patients in any stage, or summary stage-specific models when they are conceived to predict the survival of patients in one particular stage).

However, despite the clear advantages of a stage-specific predictive analysis of BC data as reported in [6], it is still unclear whether the reached conclusions are temporary or permanent and, if temporary, for how long a predictive finding remains valid. For example, Kate and Nadig [6] found that the site where BC surgery is performed, the size of the lymph node chains detected, and the size of the tumor are the three clinical features that provide more information to predict five-year survival of patients with distant-stage BC. In order to arrive to this conclusion, they used 2682 BC incidences occurred between 2004 and 2008. The challenge is to determine whether these clinical parameters are still among the ones that provide more information in the following years, and if so, what is the temporal progression of the amount of information provided by each clinical parameter over the years and in how much time some clinical parameters can become irrelevant for BC survival prediction. These questions can be answered with the use of intelligent data analysis technology if a database is available that contains a significant number of representative cases and these cases cover a large number of years.

The Surveillance, Epidemiology, and End Results (SEER) database [13] of the National Cancer Institute collects data about cancer diagnoses, treatment, and survival of approximately 30% of the US population. Among these data, SEER contains information about 798,624 incidences of BC between the years 1973 and 2015. Consequently, this is a suitable database to carry out the study that we propose. Concretely in this paper, we use the information about BC cases contained in the SEER database to analyze the validity of the conclusions reached by Kate and Nadig [6] when we project them across the years. Our analysis comprises the identification of the number of years and cases required for machine learning methods to train and generate solid predictive models about BC survival, the evolution of the importance of relevant clinical parameters to predict BC survival, and the decline of the quality of predictive models as time passes.

Section snippets

Material and methods

Following the study by Kate and Nadig [6], in order to determine the validity of their conclusions along the years, we used their same framework: For the dataset, we considered the same features and applied the same data selection and preprocessing (see details in Section Appendix A). For the study, we considered the same classification of BC stages. For the technologies, we used the same machine learning modeling algorithms. For the evaluation, we took the same quality rate of prediction.

Results and discussion

A total number of 312,446 BC incidences remained after the selection process described, among which 264,348 (84.61%) correspond to 5-year survival cases. The distribution of in-situ, localized, regional, and distant BC stage-specific incidences and their survival rates can be observed in Table 3.

These figures represent 79.03% increment of the number of cases with respect to [6] and a different distribution of stage-specific cases which, for in-situ and distant BC incidences, grew from 5.79% and

Conclusions

As new applications of data-driven machine learning methods to predict survival of BC patients appear and new knowledge is generated, it becomes more necessary to discriminate between those findings which are valid over time from those which are only temporarily valid, and in this second case, determine which are their validity times. Population-based cancer databases exist for data analysis [24]. One of the most recent works using the SEER database [6] arrived to fifteen relevant conclusions

Authors’ contributions

DR conceived the idea, prepared the dataset, and made the initial experiments. RK implemented the final algorithms, that were double checked by DR, and obtained the results. Both authors participated in the analysis of the results and in the writing of the document.

Authors statement

The authors state that there are no competing interests to declare.

Conflicts of interests

None.

Acknowledgment

This work was supported by the RETOS P-BreasTreat project (DPI2016-77415-R) of the Spanish Ministerio de Economia y Competitividad.

References (25)

S.A. Narod et al.
Why have breast cancer mortality rates declined?
J. Cancer Policy
(2015)
R.J. Kate et al.
Stage-specific predictive models for breast cancer survivability
Int. J. Med. Inf.
(2017)
D. Delen et al.
Predicting breast cancer survivability: a comparison of three data mining methods
Artif. Intell. Med.
(2005)
K. Park et al.
Robust predictive model for evaluating breast cancer survivability
Eng. Appl. Artif. Intell.
(2013)
N. Shukla et al.
Breast cancer data analysis for survivability studies and prediction
Comput. Methods Programs Biomed.
(2018)
A. Sheikhtaheri et al.
Development of a tool for comprehensive evaluation of population-based cancer registries
J. Med. Inf.
(2018)
M. Ghoncheh et al.
Incidence and mortality and epidemiology of breast cancer in the world
Asian Pac. J. Cancer Prev.
(2016)
D.A. Berry et al.
Effect of screening and adjuvant therapy on mortality from breast cancer
N. Engl. J. Med.
(2005)
American Cancer Society
Breast Cancer Facts & Figures 2017–2018
(2017)
S.H. Cheng et al.
Adherence to quality indicators and survival in patients with breast cancer
Med. Care
(2009)

J.A. Cruz et al.

Applications of machine learning in cancer prediction and prognosis

Cancer Inf.

(2007)

K. Kourou et al.

Machine learning applications in cancer prognosis and prediction

Comput. Struct. Biotechnol. J.

(2014)

Cited by (17)

New perspectives on cancer clinical research in the era of big data and machine learning
2024, Surgical Oncology
In the 21st century, the development of medical science has entered the era of big data, and machine learning has become an essential tool for mining medical big data. The establishment of the SEER database has provided a wealth of epidemiological data for cancer clinical research, and the number of studies based on SEER and machine learning has been growing in recent years. This article reviews recent research based on SEER and machine learning and finds that the current focus of such studies is primarily on the development and validation of models using machine learning algorithms, with the main directions being lymph node metastasis prediction, distant metastasis prediction, and prognosis-related research. Compared to traditional models, machine learning algorithms have the advantage of stronger adaptability, but also suffer from disadvantages such as overfitting and poor interpretability, which need to be weighed in practical applications. At present, machine learning algorithms, as the foundation of artificial intelligence, have just begun to emerge in the field of cancer clinical research. The future development of oncology will enter a more precise era of cancer research, characterized by larger data, higher dimensions, and more frequent information exchange. Machine learning is bound to shine brightly in this field.
Feedback on a shared big dataset for intelligent TBM Part I: Feature extraction and machine learning methods
2023, Underground Space (China)
This review summarizes the research outcomes and findings documented in 45 journal papers using a shared tunnel boring machine (TBM) dataset for performance prediction and boring efficiency optimization using machine learning methods. The big dataset was collected during the Yinsong water diversion project construction in China, covering the tunnel excavation of a 20 km-section with 199 items of monitoring metrics taken with an interval of one second. The research papers were the result of a call for contributions during a TBM machine learning contest in 2019 and covered a variety of topics related to the intelligent construction of TBM. This review comprises two parts. Part I is concerned with the data processing, feature extraction, and machine learning methods applied by the contributors. The review finds that the data-driven and knowledge-driven approaches in extracting important features applied by various authors are diversified, requiring further studies to achieve commonly accepted criteria. The techniques for cleaning and amending the raw data adopted by the contributors were summarized, indicating some highlights such as the importance of sufficiently high frequency of data acquisition (higher than 1 second), classification and standardization for the data preprocessing process, and the appropriate selections of features in a boring cycle. The review finds that both supervised and unsupervised machine learning methods have been utilized by various researchers. The ensemble and deep learning methods have found wide applications. Part I highlights the important features of the individual methods applied by the contributors, including the structures of the algorithm, selection of hyperparameters, and model validation approaches.
A review of AI and Data Science support for cancer management
2021, Artificial Intelligence in Medicine
Citation Excerpt :
The fact that the knowledge elicited by ML models should be validated over time is also a focal point to promote model generalizability, especially after some years from the original model elicitation. Indeed Kleinen and colleagues [72] advocate for knowledge embedded in predictive models for breast cancer to be updated every ten years to maintain good performance. Once the issue of possessing a good input dataset to train a well-performing model is solved, the next question researchers in this field face is what ML algorithm to use.
Thanks to improvement of care, cancer has become a chronic condition. But due to the toxicity of treatment, the importance of supporting the quality of life (QoL) of cancer patients increases. Monitoring and managing QoL relies on data collected by the patient in his/her home environment, its integration, and its analysis, which supports personalization of cancer management recommendations. We review the state-of-the-art of computerized systems that employ AI and Data Science methods to monitor the health status and provide support to cancer patients managed at home.
Our main objective is to analyze the literature to identify open research challenges that a novel decision support system for cancer patients and clinicians will need to address, point to potential solutions, and provide a list of established best-practices to adopt.
We designed a review study, in compliance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, analyzing studies retrieved from PubMed related to monitoring cancer patients in their home environments via sensors and self-reporting: what data is collected, what are the techniques used to collect data, semantically integrate it, infer the patient’s state from it and deliver coaching/behavior change interventions.
Starting from an initial corpus of 819 unique articles, a total of 180 papers were considered in the full-text analysis and 109 were finally included in the review. Our findings are organized and presented in four main sub-topics consisting of data collection, data integration, predictive modeling and patient coaching.
Development of modern decision support systems for cancer needs to utilize best practices like the use of validated electronic questionnaires for quality-of-life assessment, adoption of appropriate information modeling standards supplemented by terminologies/ontologies, adherence to FAIR data principles, external validation, stratification of patients in subgroups for better predictive modeling, and adoption of formal behavior change theories. Open research challenges include supporting emotional and social dimensions of well-being, including PROs in predictive modeling, and providing better customization of behavioral interventions for the specific population of cancer patients.
PREDICTION OF COMORBID MALIGNANCY PATIENT SURVIVABILITY -EMPIRICAL PERSPECTIVE
2023, Journal of Theoretical and Applied Information Technology
An Improved CHI<sup>2</sup> Feature Selection Based a Two-Stage Prediction of Comorbid Cancer Patient Survivability
2023, Revue d'Intelligence Artificielle
Review of Intelligent Algorithms for Breast Cancer Control: a Latin America Perspective
2023, IEEE Latin America Transactions

View all citing articles on Scopus

View full text

Persistence of data-driven knowledge to predict breast cancer survival

Highlights

Abstract

Background

Objective

Methods

Results and conclusions

Introduction

Section snippets

Material and methods

Results and discussion

Conclusions

Authors’ contributions

Authors statement

Conflicts of interests

Acknowledgment

J. Cancer Policy

Int. J. Med. Inf.

Artif. Intell. Med.

Eng. Appl. Artif. Intell.

Comput. Methods Programs Biomed.

J. Med. Inf.

Incidence and mortality and epidemiology of breast cancer in the world

Asian Pac. J. Cancer Prev.

Effect of screening and adjuvant therapy on mortality from breast cancer

N. Engl. J. Med.

Breast Cancer Facts & Figures 2017–2018

Adherence to quality indicators and survival in patients with breast cancer

Med. Care

Applications of machine learning in cancer prediction and prognosis

Cancer Inf.

Machine learning applications in cancer prognosis and prediction

Comput. Struct. Biotechnol. J.