An Alzheimers disease related genes identification method based on multiple classifier integration

https://doi.org/10.1016/j.cmpb.2017.08.006Get rights and content

Abstract

Background and Objective: Alzheimers disease (AD) is a fatal neurodegenerative disease and the onset of AD is insidious. Full understanding of the AD-related genes (ADGs) has not been completed. The National Center for Biotechnology Information (NCBI) provides an AD dataset of 22,283 genes. Among these genes, 71 genes have been identified as ADGs. But there may still be underlying ADGs that have not yet been identified in the remaining 22,212 genes. This paper aims to identify additional ADGs using machine learning techniques.

Methods: To improve the accuracy of ADG identification, we propose a gene identification method through multiple classifier integration. First, a feature selection algorithm is applied to select the most relevant attributes. Second, a two-stage cascading classifier is developed to identify ADGs. The first stage classification task is based on the relevance vector machine and, in the second stage, the results of three classifiers, support vector machine, random forest and extreme learning machine, are combined through voting.

Results: According to our results, feature selection improves accuracy and reduces training time. Voting based classifier reduces the classification errors. The proposed ADG identification system provides accuracy, sensitivity and specificity at levels of 78.77%, 83.10% and 74.67%, respectively. Based on the proposed ADG identification method, potentially additional ADGs are identified and top 13 genes (predicted ADGs) are presented.

Conclusions: In this paper, an ADG identification method for identifying ADGs is presented. The proposed method which combines feature selection, cascading classifier and majority voting leads to higher specificity and significantly increases the accuracy and sensitivity of ADG identification. Potentially new ADGs are identified.

Introduction

Alzheimer’s disease (AD) has received intense studies during past decades. AD is a chronic neurodegenerative disease that usually starts slowly and gets worse over time [1], [2]. AD is the most common cause of dementia in older adults with loss of cognitive functions and memory [3], [4]. The cause for most Alzheimer’s cases is mostly unknown except that three genes have been firmly implicated in the pathophysiology of early onset AD (EOAD), (onset  <  65 years), which only accounts for 1–5% of all cases [5]. The amount of risk of Alzheimer’s disease that is attributable to genetics is estimated to be around 70% [6]. These correlated genes are called AD-related genes (ADGs).

Several ADGs have been obtained in clinical trials. However, the study of AD is still unable to identify all the ADGs because of the complexity of AD [7]. In order to address this issue, some research attempts to use mathematics and computer science methods to find ADGs to provide directions and recommendations for future clinical studies through mining meaningful genes from large amounts of gene data [8], [9], [10].

Gene microarrays [8], [9], [10] provide new tools for addressing the complexity of AD since they allow views of the simultaneous activities of multiple cellular pathways. Recently, advances in gene microarray technologies have enabled biologists to measure the expression levels of many genes simultaneously in one experiment, which provides an opportunity for machine learning methods to be used to extract valuable biological information from these large datasets. Researchers have employed some machine learning approaches to analyze these high-throughput microarray gene databases.

Random forest (RF) is a classification algorithm well suited for microarray data. It shows excellent performance even when most predictive variables or microarray data are noise. It can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes. It also returns the measures of variable importance. Thus, Uriarte et al. [8] investigated the use of random forest for classification of microarray data (including multi-class problems) and proposed a new method of gene selection in classification problems based on random forest. The results show that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Tang et al. [9] applied the least squares support vector machine (LS-SVM) to two microarray datasets and compared with other well-known gene selection methods. The results indicate that the proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to existing methods.

Zhang et al. [10] evaluated the performance of the extreme learning machine (ELM) algorithm for three multi-category microarray gene expression data sets for cancer diagnosis. The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to other state-of-art methods.

Using AD microarray data, researchers have developed various methods for exploring the genes associated with the disease. For example, Blalock et al. [11] proposed an ADG identification algorithm using a strategy combining powerful new gene microarray technology, which permits measurement of the expression of many thousands of genes simultaneously, with statistical correlation analyses. Zhang et al.[12] proposed an ADG identification algorithm which combines principal component analysis and K-means clustering to obtain the genes which are correlated to AD. 8 ADGs were identified and, however, the identification sensitivity is very low. Yang et al. [13] proposed an ADG identification approach by applying the Fisher score to evaluate the ability of a gene to distinguish normal individuals from AD patients by its expression value.

Similar to [11], [12], [13], we investigate the ADG identification issue. Specifically, we address the problem of the low sensitivity of ADG identification. In this paper, we propose an ADG identification approach utilizing AD microarray data. Firstly, AD microarray data are pre-processed for normalization and standardization. Then the feature selection algorithm, ReliefF [14], is employed to determine features for distinguishing the ADGs and normal genes. Finally, a two-stage cascading classifier is produced to identify the ADGs. The first stage classification task is based the relevance vector machine (RVM) classifier and the second stage classification task is based on the majority voting of SVM, RF and ELM classifiers. We compare the performance of our approach with existing methods. Results show that our proposed method significantly outperforms prior methods in ADG identification.

To summarize, in this paper we have the following contributions:

  • (1)

    Supervised learning based on known ADGs.

  • (2)

    Use of feature selection to improve identification performance.

  • (3)

    Development of two-stage cascaded classifier to improve identification performance.

The rest of this paper is organized as follows. Section 2 presents details of the methods employed in this study. Section 3 gives the experiment results using datasets with 31 patients reflecting different levels of AD (Control 9, Incipient 7, Moderate 8, Severe 7). Section 4 presents the discussion and analysis. Finally, Section 5 concludes the paper.

Section snippets

Methods

In our proposed ADG identification algorithm, there are three main stages: pre-processing, feature selection and classification. The flowchart of the algorithm is shown in Fig. 1.

Experiment data

The National Center for Biotechnology Information (NCBI) [11] provides a dataset including 22,283 genes acquired from 9 normal patients, 7 incipient patients, 8 moderate patients, and 7 severe patients. The blood of 31 patients were dripped into the gene chips of the same test environment respectively, which produced a data matrix consists of 22,283 columns and 31 rows. Each column was corresponding to one gene chip (one patient). In other words, all the 22,283 gene data were integrated into

Discussions

In contrast to other cluster based methods (e.g., [12], [25]), our approach uses supervised learning algorithms to identify ADGs. In [12], 8 ADGs were identified and other 30 candidate genes were predicted to be related to AD. However, only 11.26% of ADGs were identified correctly [12]. The results from Table 3 show that our procedure significantly improves the accuracy, sensitivity and specificity of ADG identification.

Recently, several approaches (e.g., [8], [9], [10], [13]) that select the

Conclusions

In this paper, an ADG identification method is presented. A feature selection algorithm, ReliefF, is utilized to select the most relevant features. A two-stage cascading classifier is trained to identify and predict ADGs. Results of this study demonstrate that the proposed ADG identification method combing feature selection, cascading classifier and majority voting has higher sensitivity and specificity and it significantly increases the accuracy of ADG identification.

References (25)

  • E.K. Tang et al.

    Gene selection algorithms for microarray data based on least squares support vector machine

    BMC Bioinform.

    (2006)
  • R.X. Zhang et al.

    Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis

    IEEE/ACM Trans. Comput. Biol. Bioinf.

    (2007)
  • Cited by (19)

    • Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles

      2019, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      It includes a variety of cholesterol-rich pathways, e.g., the caveolae-mediated pathway [4]. Many studies determined that the functional missing of clathrins in cell systems would affect a variety of human diseases, e.g., cancer, Alzheimer, neurodegenerative, and so on [5–7]. Due to their essential role in human diseases, clathrin proteins attracted various researchers who conducted their research on them.

    • Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning

      2019, Gene
      Citation Excerpt :

      In addition, processing the original high-dimensional microarray data not only degenerates the final performance of classification algorithms but also increases the computation burden of hardware. To this end, it is urgent to reduce the dimensionality of original high dimensional microarray data by selecting a discriminate subset of genes which can obtain better classification results (Mitra et al., 2002; Dy and Brodley, 2004; He et al., 2005; Chuang et al., 2012; Song et al., 2016; Ramos et al., 2017; Miao et al., 2017; Wang et al., 2017a; Tang et al., 2018a; Algamal et al., 2018). Generally speaking, the gene selection task is very similar to the feature selection in data mining and machine learning community (Mitra et al., 2002; Alrajab et al., 2017; Odeh and Baareh, 2016; Luo et al., 2013; Shi et al., 2015; Luo et al., 2016; Shen et al., 2018a; Shen et al., 2018b; Shen et al., 2016; Li et al., 2019; Tang et al., 2018b; Tang et al., 2018c; Tang et al., 2018d; Tang et al., 2019a; Tang et al., 2019b).

    • Improved lung nodule diagnosis accuracy using lung CT images with uncertain class

      2018, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      ELM has been developed for single hidden layer feedforward neural networks learning algorithm. The theories of ELM show that it has both universal approximation and classification capabilities [30–38], because hidden neurons can be randomly generated and independent from applications as well. ELM provides better generalization performance than other classification algorithms at an extremely fast learning speed, ease of implementation and minimal human intervention [12].

    View all citing articles on Scopus
    View full text