A deep learning-based multi-model ensemble method for cancer prediction
Introduction
Cancer has been characterized as a collection of related diseases involving abnormal cell growth with the potential to divide without stopping and spread into surrounding tissues [1]. According to the GLOBOCAN project [2], in 2012 alone, about 14.1 million new cases of cancer occurred globally (not including skin cancer other than melanoma), which caused about 14.6% of the death. Since cancer is a major cause of morbidity and mortality, diagnosis and detection of cancer in its early stage is of great importance for its cure. Over the past decades, a continuous evolution of cancer research has been performed [3]. Among the diverse methods and techniques developed for cancer prediction, the utilization of gene expression level is one of the research hotspots in this field. Data analysis on gene expression level has facilitated cancer diagnosis and treatment to a great extent. Accurate prediction of cancer is one of the most critical and urgent tasks for physicians [4].
With the rapid development of computer-aided techniques in recent years, application of machine learning methods is playing an increasingly important role in the cancer diagnosis, and various prediction algorithms are being explored continuously by researchers. Sayed et al. [5] conducted a comparative study on feature selection and classification using data collected from the central database of the National Cancer Registry Program of Egypt, and three classifiers were applied, including support vector machines (SVMs), k-nearest neighbour (kNN) and Naive Bayes (NBs). The results showed that SVMs with polynomial kernel functions yielded higher classification accuracy compared with kNN and NBs. Statnikov et al. [6] carried a comprehensive comparison of random forests (RFs) and SVMs for cancer diagnosis. The results were obtained that SVMs outperformed RFs in fifteen data sets, RFs outperformed SVMs in four data sets, and the two algorithms performed the same in three data sets. These results were obtained by using full set of genes. Similar results were derived based on the gene selection method. From a large body of literature in cancer prediction research, none of these machine learning methods is fully accurate and each method may be lacking in different facets in the classification procedure. For instance, it is difficult for SVMs to figure out an appropriate kernel function, and although RFs have solved the over-fitting of decision trees (DTs), RFs may lead the classification result to the category with more samples.
In view of the fact that each machine learning method may outperform others or have defects in different cases, it is thus natural to expect that a method that takes advantages of multiple machine learning methods would lead to superior performance. To this end, several studies have been reported in the literature that aim to integrate models to increase the accuracy of the prediction. For example, Breiman [7] introduced Bagging, which combines outputs from decision trees generated by several randomly selected subsets of the training data and votes for the final outcome. Freund and Schapire [8] introduced Boosting, which updates the weights of training samples after each iteration of training and combines the classification outputs by weighted votes. Wolpert [9] proposed to use linear regression to combine outputs of the neural networks, which was later known as Stacking. Tan and Gilbert [10] applied Bagging and Boosting on cancerous microarray data for cancer classification. Cho and Won [11] applied the majority voting algorithm to combine four classifiers using three benchmark cancer data sets. The Stacking and majority voting take advantages of different machine learning methods. Although the majority voting algorithm is the most common in classification tasks, it is still too simple a combination strategy to discover complex information from different classifiers. Stacking, through the use of a learning method in the combination stage, is a much more powerful ensemble technique. Given that the small number of deep learning studies in biomedicine have shown success with this method [12], deep learning has become a strong learning method with many advantages. Unlike the majority voting which only considers the linear relationships among classifiers and requires for manual participation, deep learning has the ability to “learn” the intricate structures, especially nonlinear structures, from the original large data sets automatically. Thus, in order to better describe the unknown relationships among different classifiers, we adopt deep learning in the Stacking-based ensemble learning of multiple classifiers.
In this paper, we attempt to use deep neural networks to ensemble five classification models, which are kNN, SVMs, DTs, RFs and gradient boosting decision trees (GBDTs), to construct a multi-model ensemble model to predict cancer in normal and tumor conditions. To avoid over-fitting, we employ the differential gene expression analysis to select important and informative genes. The selected genes are then supplied to the five classification models. After that, a deep neural network is used to ensemble the outputs of the five classification models to obtain the final prediction result. We evaluate the proposed method on three public RNA-seq data sets from lung tissues, stomach tissues and breast tissues, respectively. The final results indicate that the proposed deep learning-based multi-model ensemble method makes more effective use of the information of the limited clinical data and generates more accurate prediction than single classifiers or the majority voting algorithm.
Section snippets
Methods
The flowchart of the proposed deep learning-based ensemble strategy is shown in Fig. 1. Initially, differential expression analysis is used to select the significantly differentially expressed genes, namely the most informative features, which are then fed to the following classification process. Then, we employ the technique of S-fold cross validation to divide the initial data into S groups of training and testing data sets. After that, multiple classifiers (first-stage models) are learned
Data collection
We evaluated the proposed method on three RNA-seq data sets of three kinds of cancers, including Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD) and Breast Invasive Carcinoma (BRCA). The gene expression data were obtained from the TCGA project web page [19]. These data sets, which include all stages of cancers, were collected from subjects of various clinical conditions and different ages, genders and races. As described in the profile [20], the tumor tissues from patients not treated
Discussion
Based on the results, we observe that the proposed deep learning-based multi-model ensemble method yields satisfactory results that are superior to single classifiers and the majority voting algorithm in cancer prediction. Due to the complexity and high mortality of cancer, timely and accurate diagnosis is critical. Thus, improving the prediction accuracy by applying computer-aided techniques is of great help to cancer treatment.
In the study, we made a comparison between the multi-model
Conclusions
Cancer is a major health problem worldwide. Although the machine learning methods have been more and more widely used in cancer prediction, no one method outperforms all the others. In this paper, we presented a deep learning-based multi-model ensemble approach to the prediction of cancer. Specifically, we analyzed gene expression data obtained from three kinds of tissues, lung, stomach and breast. In order to avoid over-fitting in classification, we identified differentially expressed gene
Conflict of interest
The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under grant No. 31270210. The authors would like to thank the reviewers in advance for their comments and suggestions.
References (21)
- et al.
Hallmarks of cancer: the next generation
Cell
(2011) - et al.
Machine learning applications in cancer prognosis and prediction
Comput. Struct. Biotechnol. J.
(2015) Stacked generalization
Neural Netw.
(1992)- About Cancer, 2015, (National Cancer...
- GLOBOCAN 2012: Estimated Cancer Incidence, Mortality and Prevalence Worldwide in 2012....
- et al.
Feature selection for cancer classification: an SVM based approach
Int. J. Comput. Appl.
(2012) - et al.
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.
BMC Bioinform.
(2008) Bagging predictors
Mach. Learn.
(1996)- et al.
Experiments with a new boosting algorithm
Proceedings of International Conference on Machine Learning
(1996) - et al.
Ensemble machine learning on gene expression data for cancer classification
Appl. Bioinform.
(2003)
Cited by (361)
Enhancing medical image classification through controlled diversity in ensemble learning
2024, Engineering Applications of Artificial IntelligenceDeep learning model for heavy rainfall nowcasting in South Korea
2024, Weather and Climate ExtremesA weighted distance-based dynamic ensemble regression framework for gastric cancer survival time prediction
2024, Artificial Intelligence in MedicinePrediction and related genes of cancer distant metastasis based on deep learning
2024, Computers in Biology and MedicineReviewing methods of deep learning for intelligent healthcare systems in genomics and biomedicine
2023, Biomedical Signal Processing and ControlArtificial intelligence-aided optical imaging for cancer theranostics
2023, Seminars in Cancer Biology