A deep learning-based multi-model ensemble method for cancer prediction

https://doi.org/10.1016/j.cmpb.2017.09.005Get rights and content

Highlights

  • An ensemble of multiple machine learning models outperforms single classifiers.

  • We propose a deep learning-based ensemble method for cancer prediction.

  • We select differentially expressed genes from gene expression data.

  • We present prediction results on lung, stomach and breast cancer data.

Abstract

Background and Objective: Cancer is a complex worldwide health problem associated with high mortality. With the rapid development of the high-throughput sequencing technology and the application of various machine learning methods that have emerged in recent years, progress in cancer prediction has been increasingly made based on gene expression, providing insight into effective and accurate treatment decision making. Thus, developing machine learning methods, which can successfully distinguish cancer patients from healthy persons, is of great current interest. However, among the classification methods applied to cancer prediction so far, no one method outperforms all the others.

Methods: In this paper, we demonstrate a new strategy, which applies deep learning to an ensemble approach that incorporates multiple different machine learning models. We supply informative gene data selected by differential gene expression analysis to five different classification models. Then, a deep learning method is employed to ensemble the outputs of the five classifiers.

Results: The proposed deep learning-based multi-model ensemble method was tested on three public RNA-seq data sets of three kinds of cancers, Lung Adenocarcinoma, Stomach Adenocarcinoma and Breast Invasive Carcinoma. The test results indicate that it increases the prediction accuracy of cancer for all the tested RNA-seq data sets as compared to using a single classifier or the majority voting algorithm.

Conclusions: By taking full advantage of different classifiers, the proposed deep learning-based multi-model ensemble method is shown to be accurate and effective for cancer prediction.

Introduction

Cancer has been characterized as a collection of related diseases involving abnormal cell growth with the potential to divide without stopping and spread into surrounding tissues [1]. According to the GLOBOCAN project [2], in 2012 alone, about 14.1 million new cases of cancer occurred globally (not including skin cancer other than melanoma), which caused about 14.6% of the death. Since cancer is a major cause of morbidity and mortality, diagnosis and detection of cancer in its early stage is of great importance for its cure. Over the past decades, a continuous evolution of cancer research has been performed [3]. Among the diverse methods and techniques developed for cancer prediction, the utilization of gene expression level is one of the research hotspots in this field. Data analysis on gene expression level has facilitated cancer diagnosis and treatment to a great extent. Accurate prediction of cancer is one of the most critical and urgent tasks for physicians [4].

With the rapid development of computer-aided techniques in recent years, application of machine learning methods is playing an increasingly important role in the cancer diagnosis, and various prediction algorithms are being explored continuously by researchers. Sayed et al. [5] conducted a comparative study on feature selection and classification using data collected from the central database of the National Cancer Registry Program of Egypt, and three classifiers were applied, including support vector machines (SVMs), k-nearest neighbour (kNN) and Naive Bayes (NBs). The results showed that SVMs with polynomial kernel functions yielded higher classification accuracy compared with kNN and NBs. Statnikov et al. [6] carried a comprehensive comparison of random forests (RFs) and SVMs for cancer diagnosis. The results were obtained that SVMs outperformed RFs in fifteen data sets, RFs outperformed SVMs in four data sets, and the two algorithms performed the same in three data sets. These results were obtained by using full set of genes. Similar results were derived based on the gene selection method. From a large body of literature in cancer prediction research, none of these machine learning methods is fully accurate and each method may be lacking in different facets in the classification procedure. For instance, it is difficult for SVMs to figure out an appropriate kernel function, and although RFs have solved the over-fitting of decision trees (DTs), RFs may lead the classification result to the category with more samples.

In view of the fact that each machine learning method may outperform others or have defects in different cases, it is thus natural to expect that a method that takes advantages of multiple machine learning methods would lead to superior performance. To this end, several studies have been reported in the literature that aim to integrate models to increase the accuracy of the prediction. For example, Breiman [7] introduced Bagging, which combines outputs from decision trees generated by several randomly selected subsets of the training data and votes for the final outcome. Freund and Schapire [8] introduced Boosting, which updates the weights of training samples after each iteration of training and combines the classification outputs by weighted votes. Wolpert [9] proposed to use linear regression to combine outputs of the neural networks, which was later known as Stacking. Tan and Gilbert [10] applied Bagging and Boosting on cancerous microarray data for cancer classification. Cho and Won [11] applied the majority voting algorithm to combine four classifiers using three benchmark cancer data sets. The Stacking and majority voting take advantages of different machine learning methods. Although the majority voting algorithm is the most common in classification tasks, it is still too simple a combination strategy to discover complex information from different classifiers. Stacking, through the use of a learning method in the combination stage, is a much more powerful ensemble technique. Given that the small number of deep learning studies in biomedicine have shown success with this method [12], deep learning has become a strong learning method with many advantages. Unlike the majority voting which only considers the linear relationships among classifiers and requires for manual participation, deep learning has the ability to “learn” the intricate structures, especially nonlinear structures, from the original large data sets automatically. Thus, in order to better describe the unknown relationships among different classifiers, we adopt deep learning in the Stacking-based ensemble learning of multiple classifiers.

In this paper, we attempt to use deep neural networks to ensemble five classification models, which are kNN, SVMs, DTs, RFs and gradient boosting decision trees (GBDTs), to construct a multi-model ensemble model to predict cancer in normal and tumor conditions. To avoid over-fitting, we employ the differential gene expression analysis to select important and informative genes. The selected genes are then supplied to the five classification models. After that, a deep neural network is used to ensemble the outputs of the five classification models to obtain the final prediction result. We evaluate the proposed method on three public RNA-seq data sets from lung tissues, stomach tissues and breast tissues, respectively. The final results indicate that the proposed deep learning-based multi-model ensemble method makes more effective use of the information of the limited clinical data and generates more accurate prediction than single classifiers or the majority voting algorithm.

Section snippets

Methods

The flowchart of the proposed deep learning-based ensemble strategy is shown in Fig. 1. Initially, differential expression analysis is used to select the significantly differentially expressed genes, namely the most informative features, which are then fed to the following classification process. Then, we employ the technique of S-fold cross validation to divide the initial data into S groups of training and testing data sets. After that, multiple classifiers (first-stage models) are learned

Data collection

We evaluated the proposed method on three RNA-seq data sets of three kinds of cancers, including Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD) and Breast Invasive Carcinoma (BRCA). The gene expression data were obtained from the TCGA project web page [19]. These data sets, which include all stages of cancers, were collected from subjects of various clinical conditions and different ages, genders and races. As described in the profile [20], the tumor tissues from patients not treated

Discussion

Based on the results, we observe that the proposed deep learning-based multi-model ensemble method yields satisfactory results that are superior to single classifiers and the majority voting algorithm in cancer prediction. Due to the complexity and high mortality of cancer, timely and accurate diagnosis is critical. Thus, improving the prediction accuracy by applying computer-aided techniques is of great help to cancer treatment.

In the study, we made a comparison between the multi-model

Conclusions

Cancer is a major health problem worldwide. Although the machine learning methods have been more and more widely used in cancer prediction, no one method outperforms all the others. In this paper, we presented a deep learning-based multi-model ensemble approach to the prediction of cancer. Specifically, we analyzed gene expression data obtained from three kinds of tissues, lung, stomach and breast. In order to avoid over-fitting in classification, we identified differentially expressed gene

Conflict of interest

The authors do not have financial and personal relationships with other people or organizations that could inappropriately influence (bias) their work.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grant No. 31270210. The authors would like to thank the reviewers in advance for their comments and suggestions.

References (21)

  • D. Hanahan et al.

    Hallmarks of cancer: the next generation

    Cell

    (2011)
  • K. Kourou et al.

    Machine learning applications in cancer prognosis and prediction

    Comput. Struct. Biotechnol. J.

    (2015)
  • D.H. Wolpert

    Stacked generalization

    Neural Netw.

    (1992)
  • About Cancer, 2015, (National Cancer...
  • GLOBOCAN 2012: Estimated Cancer Incidence, Mortality and Prevalence Worldwide in 2012....
  • E. Sayed et al.

    Feature selection for cancer classification: an SVM based approach

    Int. J. Comput. Appl.

    (2012)
  • A. Statnikov et al.

    A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.

    BMC Bioinform.

    (2008)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • Y. Freund et al.

    Experiments with a new boosting algorithm

    Proceedings of International Conference on Machine Learning

    (1996)
  • A.C. Tan et al.

    Ensemble machine learning on gene expression data for cancer classification

    Appl. Bioinform.

    (2003)
There are more references available in the full text version of this article.

Cited by (361)

View all citing articles on Scopus
View full text