An Alzheimers disease related genes identification method based on multiple classifier integration
Introduction
Alzheimer’s disease (AD) has received intense studies during past decades. AD is a chronic neurodegenerative disease that usually starts slowly and gets worse over time [1], [2]. AD is the most common cause of dementia in older adults with loss of cognitive functions and memory [3], [4]. The cause for most Alzheimer’s cases is mostly unknown except that three genes have been firmly implicated in the pathophysiology of early onset AD (EOAD), (onset < 65 years), which only accounts for 1–5% of all cases [5]. The amount of risk of Alzheimer’s disease that is attributable to genetics is estimated to be around 70% [6]. These correlated genes are called AD-related genes (ADGs).
Several ADGs have been obtained in clinical trials. However, the study of AD is still unable to identify all the ADGs because of the complexity of AD [7]. In order to address this issue, some research attempts to use mathematics and computer science methods to find ADGs to provide directions and recommendations for future clinical studies through mining meaningful genes from large amounts of gene data [8], [9], [10].
Gene microarrays [8], [9], [10] provide new tools for addressing the complexity of AD since they allow views of the simultaneous activities of multiple cellular pathways. Recently, advances in gene microarray technologies have enabled biologists to measure the expression levels of many genes simultaneously in one experiment, which provides an opportunity for machine learning methods to be used to extract valuable biological information from these large datasets. Researchers have employed some machine learning approaches to analyze these high-throughput microarray gene databases.
Random forest (RF) is a classification algorithm well suited for microarray data. It shows excellent performance even when most predictive variables or microarray data are noise. It can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes. It also returns the measures of variable importance. Thus, Uriarte et al. [8] investigated the use of random forest for classification of microarray data (including multi-class problems) and proposed a new method of gene selection in classification problems based on random forest. The results show that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Tang et al. [9] applied the least squares support vector machine (LS-SVM) to two microarray datasets and compared with other well-known gene selection methods. The results indicate that the proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to existing methods.
Zhang et al. [10] evaluated the performance of the extreme learning machine (ELM) algorithm for three multi-category microarray gene expression data sets for cancer diagnosis. The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to other state-of-art methods.
Using AD microarray data, researchers have developed various methods for exploring the genes associated with the disease. For example, Blalock et al. [11] proposed an ADG identification algorithm using a strategy combining powerful new gene microarray technology, which permits measurement of the expression of many thousands of genes simultaneously, with statistical correlation analyses. Zhang et al.[12] proposed an ADG identification algorithm which combines principal component analysis and K-means clustering to obtain the genes which are correlated to AD. 8 ADGs were identified and, however, the identification sensitivity is very low. Yang et al. [13] proposed an ADG identification approach by applying the Fisher score to evaluate the ability of a gene to distinguish normal individuals from AD patients by its expression value.
Similar to [11], [12], [13], we investigate the ADG identification issue. Specifically, we address the problem of the low sensitivity of ADG identification. In this paper, we propose an ADG identification approach utilizing AD microarray data. Firstly, AD microarray data are pre-processed for normalization and standardization. Then the feature selection algorithm, ReliefF [14], is employed to determine features for distinguishing the ADGs and normal genes. Finally, a two-stage cascading classifier is produced to identify the ADGs. The first stage classification task is based the relevance vector machine (RVM) classifier and the second stage classification task is based on the majority voting of SVM, RF and ELM classifiers. We compare the performance of our approach with existing methods. Results show that our proposed method significantly outperforms prior methods in ADG identification.
To summarize, in this paper we have the following contributions:
- (1)
Supervised learning based on known ADGs.
- (2)
Use of feature selection to improve identification performance.
- (3)
Development of two-stage cascaded classifier to improve identification performance.
The rest of this paper is organized as follows. Section 2 presents details of the methods employed in this study. Section 3 gives the experiment results using datasets with 31 patients reflecting different levels of AD (Control 9, Incipient 7, Moderate 8, Severe 7). Section 4 presents the discussion and analysis. Finally, Section 5 concludes the paper.
Section snippets
Methods
In our proposed ADG identification algorithm, there are three main stages: pre-processing, feature selection and classification. The flowchart of the algorithm is shown in Fig. 1.
Experiment data
The National Center for Biotechnology Information (NCBI) [11] provides a dataset including 22,283 genes acquired from 9 normal patients, 7 incipient patients, 8 moderate patients, and 7 severe patients. The blood of 31 patients were dripped into the gene chips of the same test environment respectively, which produced a data matrix consists of 22,283 columns and 31 rows. Each column was corresponding to one gene chip (one patient). In other words, all the 22,283 gene data were integrated into
Discussions
In contrast to other cluster based methods (e.g., [12], [25]), our approach uses supervised learning algorithms to identify ADGs. In [12], 8 ADGs were identified and other 30 candidate genes were predicted to be related to AD. However, only 11.26% of ADGs were identified correctly [12]. The results from Table 3 show that our procedure significantly improves the accuracy, sensitivity and specificity of ADG identification.
Recently, several approaches (e.g., [8], [9], [10], [13]) that select the
Conclusions
In this paper, an ADG identification method is presented. A feature selection algorithm, ReliefF, is utilized to select the most relevant features. A two-stage cascading classifier is trained to identify and predict ADGs. Results of this study demonstrate that the proposed ADG identification method combing feature selection, cascading classifier and majority voting has higher sensitivity and specificity and it significantly increases the accuracy of ADG identification.
References (25)
- et al.
Alzheimer’s disease
Lancet
(2006) - et al.
Alzheimer disease: epidemiology, diagnostic criteria, risk factors and biomarkers
Biochem. Pharmacol.
(2014) - et al.
Alzheimer’s disease
Lancet
(2011) - et al.
Libsvm: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2007) - et al.
Extreme learning machine: theory and applications
Neurocomputing
(2006) - et al.
Alzheimer’s disease
The BMJ
(2009) - et al.
Alzheimer’s disease
N. Engl. J. Med.
(2010) Alzheimer’s disease: genes, proteins, and therapy
Physiol. Rev.
(2001)- et al.
Apolipoprotein e: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial alzheimer disease
PNAS.
(1993) - et al.
Gene selection and classification of microarray data using random forest
BMC Bioinform.
(2006)
Gene selection algorithms for microarray data based on least squares support vector machine
BMC Bioinform.
Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis
IEEE/ACM Trans. Comput. Biol. Bioinf.
Cited by (19)
Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles
2019, Computer Methods and Programs in BiomedicineCitation Excerpt :It includes a variety of cholesterol-rich pathways, e.g., the caveolae-mediated pathway [4]. Many studies determined that the functional missing of clathrins in cell systems would affect a variety of human diseases, e.g., cancer, Alzheimer, neurodegenerative, and so on [5–7]. Due to their essential role in human diseases, clathrin proteins attracted various researchers who conducted their research on them.
Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning
2019, GeneCitation Excerpt :In addition, processing the original high-dimensional microarray data not only degenerates the final performance of classification algorithms but also increases the computation burden of hardware. To this end, it is urgent to reduce the dimensionality of original high dimensional microarray data by selecting a discriminate subset of genes which can obtain better classification results (Mitra et al., 2002; Dy and Brodley, 2004; He et al., 2005; Chuang et al., 2012; Song et al., 2016; Ramos et al., 2017; Miao et al., 2017; Wang et al., 2017a; Tang et al., 2018a; Algamal et al., 2018). Generally speaking, the gene selection task is very similar to the feature selection in data mining and machine learning community (Mitra et al., 2002; Alrajab et al., 2017; Odeh and Baareh, 2016; Luo et al., 2013; Shi et al., 2015; Luo et al., 2016; Shen et al., 2018a; Shen et al., 2018b; Shen et al., 2016; Li et al., 2019; Tang et al., 2018b; Tang et al., 2018c; Tang et al., 2018d; Tang et al., 2019a; Tang et al., 2019b).
Improved lung nodule diagnosis accuracy using lung CT images with uncertain class
2018, Computer Methods and Programs in BiomedicineCitation Excerpt :ELM has been developed for single hidden layer feedforward neural networks learning algorithm. The theories of ELM show that it has both universal approximation and classification capabilities [30–38], because hidden neurons can be randomly generated and independent from applications as well. ELM provides better generalization performance than other classification algorithms at an extremely fast learning speed, ease of implementation and minimal human intervention [12].
N-semble-based method for identifying Parkinson’s disease genes
2023, Neural Computing and ApplicationsDiagnosis of Parkinson’s disease genes using LSTM and MLP-based multi-feature extraction methods
2023, International Journal of Data Mining and Bioinformatics