A deep learning-based multi-model ensemble method for cancer prediction

doi:10.1016/j.cmpb.2017.09.005

Computer Methods and Programs in Biomedicine

Volume 153, January 2018, Pages 1-9

https://doi.org/10.1016/j.cmpb.2017.09.005 Get rights and content

Highlights

•
An ensemble of multiple machine learning models outperforms single classifiers.
•
We propose a deep learning-based ensemble method for cancer prediction.
•
We select differentially expressed genes from gene expression data.
•
We present prediction results on lung, stomach and breast cancer data.

Abstract

Background and Objective: Cancer is a complex worldwide health problem associated with high mortality. With the rapid development of the high-throughput sequencing technology and the application of various machine learning methods that have emerged in recent years, progress in cancer prediction has been increasingly made based on gene expression, providing insight into effective and accurate treatment decision making. Thus, developing machine learning methods, which can successfully distinguish cancer patients from healthy persons, is of great current interest. However, among the classification methods applied to cancer prediction so far, no one method outperforms all the others.

Methods: In this paper, we demonstrate a new strategy, which applies deep learning to an ensemble approach that incorporates multiple different machine learning models. We supply informative gene data selected by differential gene expression analysis to five different classification models. Then, a deep learning method is employed to ensemble the outputs of the five classifiers.

Results: The proposed deep learning-based multi-model ensemble method was tested on three public RNA-seq data sets of three kinds of cancers, Lung Adenocarcinoma, Stomach Adenocarcinoma and Breast Invasive Carcinoma. The test results indicate that it increases the prediction accuracy of cancer for all the tested RNA-seq data sets as compared to using a single classifier or the majority voting algorithm.

Conclusions: By taking full advantage of different classifiers, the proposed deep learning-based multi-model ensemble method is shown to be accurate and effective for cancer prediction.

Introduction

Cancer has been characterized as a collection of related diseases involving abnormal cell growth with the potential to divide without stopping and spread into surrounding tissues [1]. According to the GLOBOCAN project [2], in 2012 alone, about 14.1 million new cases of cancer occurred globally (not including skin cancer other than melanoma), which caused about 14.6% of the death. Since cancer is a major cause of morbidity and mortality, diagnosis and detection of cancer in its early stage is of great importance for its cure. Over the past decades, a continuous evolution of cancer research has been performed [3]. Among the diverse methods and techniques developed for cancer prediction, the utilization of gene expression level is one of the research hotspots in this field. Data analysis on gene expression level has facilitated cancer diagnosis and treatment to a great extent. Accurate prediction of cancer is one of the most critical and urgent tasks for physicians [4].

With the rapid development of computer-aided techniques in recent years, application of machine learning methods is playing an increasingly important role in the cancer diagnosis, and various prediction algorithms are being explored continuously by researchers. Sayed et al. [5] conducted a comparative study on feature selection and classification using data collected from the central database of the National Cancer Registry Program of Egypt, and three classifiers were applied, including support vector machines (SVMs), k-nearest neighbour (kNN) and Naive Bayes (NBs). The results showed that SVMs with polynomial kernel functions yielded higher classification accuracy compared with kNN and NBs. Statnikov et al. [6] carried a comprehensive comparison of random forests (RFs) and SVMs for cancer diagnosis. The results were obtained that SVMs outperformed RFs in fifteen data sets, RFs outperformed SVMs in four data sets, and the two algorithms performed the same in three data sets. These results were obtained by using full set of genes. Similar results were derived based on the gene selection method. From a large body of literature in cancer prediction research, none of these machine learning methods is fully accurate and each method may be lacking in different facets in the classification procedure. For instance, it is difficult for SVMs to figure out an appropriate kernel function, and although RFs have solved the over-fitting of decision trees (DTs), RFs may lead the classification result to the category with more samples.

In view of the fact that each machine learning method may outperform others or have defects in different cases, it is thus natural to expect that a method that takes advantages of multiple machine learning methods would lead to superior performance. To this end, several studies have been reported in the literature that aim to integrate models to increase the accuracy of the prediction. For example, Breiman [7] introduced Bagging, which combines outputs from decision trees generated by several randomly selected subsets of the training data and votes for the final outcome. Freund and Schapire [8] introduced Boosting, which updates the weights of training samples after each iteration of training and combines the classification outputs by weighted votes. Wolpert [9] proposed to use linear regression to combine outputs of the neural networks, which was later known as Stacking. Tan and Gilbert [10] applied Bagging and Boosting on cancerous microarray data for cancer classification. Cho and Won [11] applied the majority voting algorithm to combine four classifiers using three benchmark cancer data sets. The Stacking and majority voting take advantages of different machine learning methods. Although the majority voting algorithm is the most common in classification tasks, it is still too simple a combination strategy to discover complex information from different classifiers. Stacking, through the use of a learning method in the combination stage, is a much more powerful ensemble technique. Given that the small number of deep learning studies in biomedicine have shown success with this method [12], deep learning has become a strong learning method with many advantages. Unlike the majority voting which only considers the linear relationships among classifiers and requires for manual participation, deep learning has the ability to “learn” the intricate structures, especially nonlinear structures, from the original large data sets automatically. Thus, in order to better describe the unknown relationships among different classifiers, we adopt deep learning in the Stacking-based ensemble learning of multiple classifiers.

In this paper, we attempt to use deep neural networks to ensemble five classification models, which are kNN, SVMs, DTs, RFs and gradient boosting decision trees (GBDTs), to construct a multi-model ensemble model to predict cancer in normal and tumor conditions. To avoid over-fitting, we employ the differential gene expression analysis to select important and informative genes. The selected genes are then supplied to the five classification models. After that, a deep neural network is used to ensemble the outputs of the five classification models to obtain the final prediction result. We evaluate the proposed method on three public RNA-seq data sets from lung tissues, stomach tissues and breast tissues, respectively. The final results indicate that the proposed deep learning-based multi-model ensemble method makes more effective use of the information of the limited clinical data and generates more accurate prediction than single classifiers or the majority voting algorithm.

Section snippets

Methods

The flowchart of the proposed deep learning-based ensemble strategy is shown in Fig. 1. Initially, differential expression analysis is used to select the significantly differentially expressed genes, namely the most informative features, which are then fed to the following classification process. Then, we employ the technique of S-fold cross validation to divide the initial data into S groups of training and testing data sets. After that, multiple classifiers (first-stage models) are learned

Data collection

We evaluated the proposed method on three RNA-seq data sets of three kinds of cancers, including Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD) and Breast Invasive Carcinoma (BRCA). The gene expression data were obtained from the TCGA project web page [19]. These data sets, which include all stages of cancers, were collected from subjects of various clinical conditions and different ages, genders and races. As described in the profile [20], the tumor tissues from patients not treated

Discussion

Based on the results, we observe that the proposed deep learning-based multi-model ensemble method yields satisfactory results that are superior to single classifiers and the majority voting algorithm in cancer prediction. Due to the complexity and high mortality of cancer, timely and accurate diagnosis is critical. Thus, improving the prediction accuracy by applying computer-aided techniques is of great help to cancer treatment.

In the study, we made a comparison between the multi-model

Conclusions

Cancer is a major health problem worldwide. Although the machine learning methods have been more and more widely used in cancer prediction, no one method outperforms all the others. In this paper, we presented a deep learning-based multi-model ensemble approach to the prediction of cancer. Specifically, we analyzed gene expression data obtained from three kinds of tissues, lung, stomach and breast. In order to avoid over-fitting in classification, we identified differentially expressed gene

Conflict of interest

The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grant No. 31270210. The authors would like to thank the reviewers in advance for their comments and suggestions.

References (21)

D. Hanahan et al.
Hallmarks of cancer: the next generation
Cell
(2011)
K. Kourou et al.
Machine learning applications in cancer prognosis and prediction
Comput. Struct. Biotechnol. J.
(2015)
D.H. Wolpert
Stacked generalization
Neural Netw.
(1992)
About Cancer, 2015, (National Cancer...
GLOBOCAN 2012: Estimated Cancer Incidence, Mortality and Prevalence Worldwide in 2012....
E. Sayed et al.
Feature selection for cancer classification: an SVM based approach
Int. J. Comput. Appl.
(2012)
A. Statnikov et al.
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.
BMC Bioinform.
(2008)
L. Breiman
Bagging predictors
Mach. Learn.
(1996)
Y. Freund et al.
Experiments with a new boosting algorithm
Proceedings of International Conference on Machine Learning
(1996)
A.C. Tan et al.
Ensemble machine learning on gene expression data for cancer classification
Appl. Bioinform.
(2003)

There are more references available in the full text version of this article.

Cited by (361)

Enhancing medical image classification through controlled diversity in ensemble learning
2024, Engineering Applications of Artificial Intelligence
Ensemble models in classification problems often encounter learning collision during joint training, where multiple base learners learn similar data representations concurrently. This phenomenon can diminish diversity and confidence in classification, especially in the context of medical image analysis, potentially leading to biased class predictions and classification errors. In this study, we tackle this issue by proposing ensemble models that combine both joint and independent training methodologies on the same medical image dataset. Our key contribution lies in explicitly controlling diversity through the design of the loss function. We fine-tuned the ResNet50V2 base learner, resulting in a significant 3% increase in training accuracy (86.00% compared to the previous 83.00%). Losses decreased from $0.42 \pm 0.44$ to $0.40 \pm 0.42$ . For ResNet101V2, we observed a 2.27 percentage point increase in training accuracy (84.57% compared to the previous 82.30%) and reduced loss values to $0.357 \pm 0.367$ from the previous $0.428 \pm 0.448$ . Furthermore, we conducted a comparative analysis of our optimized ensemble models, both with and without pruning, to assess their impact on model performance and to better understand their efficacy compared to earlier research work. The results underscore the effectiveness of our approach in mitigating learning collision and enhancing classification accuracy, particularly in the domain of medical image classification. Overall, our approach effectively reduces learning collision and improves classification accuracy as well as test accuracy on unseen medical images, addressing a significant gap in COVID-19 identification. This novel approach holds promise for ensemble models in medical image classification, particularly for lung-related diseases.
Deep learning model for heavy rainfall nowcasting in South Korea
2024, Weather and Climate Extremes
Accurate nowcasting is critical for preemptive action in response to heavy rainfall events (HREs). However, operational numerical weather prediction models have difficulty predicting HREs in the short term, especially for rapidly and sporadically developing cases. Here, we present multi-year evaluation statistics showing that deep-learning-based HRE nowcasting, trained with radar images and ground measurements, outperforms short-term numerical weather prediction at lead times of up to 6 h. The deep learning nowcasting shows an improved accuracy of 162%–31% over numerical prediction, at the 1-h to 6-h lead times, for predicting HREs in South Korea during the Asian summer monsoon. The spatial distribution and diurnal cycle of HREs are also well predicted. Isolated HRE predictions in the late afternoon to early evening which mostly result from convective processes associated with surface heating are particularly useful. This result suggests that the deep learning algorithm may be available for HRE nowcasting, potentially serving as an alternative to the operational numerical weather prediction model.
A weighted distance-based dynamic ensemble regression framework for gastric cancer survival time prediction
2024, Artificial Intelligence in Medicine
Accurate prediction of gastric cancer patient survival time is essential for clinical decision-making. However, unified static models lack specificity and flexibility in predictions owing to the varying survival outcomes among gastric cancer patients. We address these problems by using an ensemble learning approach and adaptively assigning greater weights to similar patients to make more targeted predictions when predicting an individual’s survival time. We treat these problems as regression problems and introduce a weighted dynamic ensemble regression framework. To better identify similar patients, we devise a method to measure patient similarity, considering the diverse impacts of features. Subsequently, we use this measure to design both a weighted K-means clustering method and a fuzzy K-means sampling technique to group patients and train corresponding base regressors. To achieve more targeted predictions, we calculate the weight of each base regressor based on the similarity between the patient to be predicted and the patient clusters, culminating in the integration of the results. The model is validated on a dataset of 7791 patients, outperforming other models in terms of three evaluation metrics, namely, the root mean square error, mean absolute error, and the coefficient of determination. The weighted dynamic ensemble regression strategy can improve the baseline model by 1.75%, 2.12%, and 13.45% in terms of the three respective metrics while also mitigating the imbalanced survival time distribution issue. This enhanced performance has been statistically validated, even when tested on six public datasets with different sizes. By considering feature variations, patients with distinct survival profiles can be effectively differentiated, and the model predictive performance can be enhanced. The results generated by our proposed model can be invaluable in guiding decisions related to treatment plans and resource allocation. Furthermore, the model has the potential for broader applications in prognosis for other types of cancers or similar regression problems in various domains.
Prediction and related genes of cancer distant metastasis based on deep learning
2024, Computers in Biology and Medicine
Cancer metastasis is one of the main causes of cancer progression and difficulty in treatment. Genes play a key role in the process of cancer metastasis, as they can influence tumor cell invasiveness, migration ability and fitness. At the same time, there is heterogeneity in the organs of cancer metastasis. Breast cancer, prostate cancer, etc. tend to metastasize in the bone. Previous studies have pointed out that the occurrence of metastasis is closely related to which tissue is transferred to and genes. In this paper, we identified genes associated with cancer metastasis to different tissues based on LASSO and Pearson correlation coefficients. In total, we identified 45 genes associated with bone metastases, 89 genes associated with lung metastases, and 86 genes associated with liver metastases. Through the expression of these genes, we propose a CNN-based model to predict the occurrence of metastasis. We call this method MDCNN, which introduces a modulation mechanism that allows the weights of convolution kernels to be adjusted at different positions and feature maps, thereby adaptively changing the convolution operation at different positions. Experiments have proved that MDCNN has achieved satisfactory prediction accuracy in bone metastasis, lung metastasis and liver metastasis, and is better than other 4 methods of the same kind. We performed enrichment analysis and immune infiltration analysis on bone metastasis-related genes, and found multiple pathways and GO terms related to bone metastasis, and found that the abundance of macrophages and monocytes was the highest in patients with bone metastasis.
Reviewing methods of deep learning for intelligent healthcare systems in genomics and biomedicine
2023, Biomedical Signal Processing and Control
The advancements in genomics and biomedical technologies have generated vast amounts of biological and physiological data, which present opportunities for understanding human health. Deep learning (DL) and machine learning (ML) are frontiers and interdisciplinary fields of computer science that consider comprehensive computational models and provide integral roles for disease diagnosis and therapy investigation. DL-based algorithms can discover the intrinsic hierarchies in the training data to show great promise for extracting features and learning patterns from complex datasets and performing various analytical tasks. This review comprehensively discusses the wide-ranging DL approaches for intelligent healthcare systems (IHS) in genomics and biomedicine. This paper explores advanced concepts in deep learning (DL) and discusses the workflow of utilizing role-based algorithms in genomics and biomedicine to integrate intelligent healthcare systems (IHS). The aim is to overcome biomedical obstacles like patient disease classification, core biomedical processes, and empowering patient-disease integration. The paper also highlights how DL approaches are well-suited for addressing critical challenges in these domains, offering promising solutions for improved healthcare outcomes. We also provided a concise concept of DL architectures and model optimization in genomics and bioinformatics at the molecular level to deal with biomedicine classification, genomic sequence analysis, protein structure classification, and prediction. Finally, we discussed DL's current challenges and future perspectives in genomics and biomedicine for future directions.
Artificial intelligence-aided optical imaging for cancer theranostics
2023, Seminars in Cancer Biology
The use of artificial intelligence (AI) to assist biomedical imaging have demonstrated its high accuracy and high efficiency in medical decision-making for individualized cancer medicine. In particular, optical imaging methods are able to visualize both the structural and functional information of tumors tissues with high contrast, low cost, and noninvasive property. However, no systematic work has been performed to inspect the recent advances on AI-aided optical imaging for cancer theranostics. In this review, we demonstrated how AI can guide optical imaging methods to improve the accuracy on tumor detection, automated analysis and prediction of its histopathological section, its monitoring during treatment, and its prognosis by using computer vision, deep learning and natural language processing. By contrast, the optical imaging techniques involved mainly consisted of various tomography and microscopy imaging methods such as optical endoscopy imaging, optical coherence tomography, photoacoustic imaging, diffuse optical tomography, optical microscopy imaging, Raman imaging, and fluorescent imaging. Meanwhile, existing problems, possible challenges and future prospects for AI-aided optical imaging protocol for cancer theranostics were also discussed. It is expected that the present work can open a new avenue for precision oncology by using AI and optical imaging tools.

View all citing articles on Scopus

View full text

A deep learning-based multi-model ensemble method for cancer prediction

Highlights

Abstract

Introduction

Section snippets

Methods

Data collection

Discussion

Conclusions

Conflict of interest

Acknowledgments

Cell

Comput. Struct. Biotechnol. J.

Neural Netw.

Feature selection for cancer classification: an SVM based approach

Int. J. Comput. Appl.

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.

BMC Bioinform.

Bagging predictors

Mach. Learn.

Experiments with a new boosting algorithm

Proceedings of International Conference on Machine Learning

Ensemble machine learning on gene expression data for cancer classification

Appl. Bioinform.