An Alzheimers disease related genes identification method based on multiple classifier integration

doi:10.1016/j.cmpb.2017.08.006

Computer Methods and Programs in Biomedicine

Volume 150, October 2017, Pages 107-115

https://doi.org/10.1016/j.cmpb.2017.08.006 Get rights and content

Abstract

Background and Objective: Alzheimers disease (AD) is a fatal neurodegenerative disease and the onset of AD is insidious. Full understanding of the AD-related genes (ADGs) has not been completed. The National Center for Biotechnology Information (NCBI) provides an AD dataset of 22,283 genes. Among these genes, 71 genes have been identified as ADGs. But there may still be underlying ADGs that have not yet been identified in the remaining 22,212 genes. This paper aims to identify additional ADGs using machine learning techniques.

Methods: To improve the accuracy of ADG identification, we propose a gene identification method through multiple classifier integration. First, a feature selection algorithm is applied to select the most relevant attributes. Second, a two-stage cascading classifier is developed to identify ADGs. The first stage classification task is based on the relevance vector machine and, in the second stage, the results of three classifiers, support vector machine, random forest and extreme learning machine, are combined through voting.

Results: According to our results, feature selection improves accuracy and reduces training time. Voting based classifier reduces the classification errors. The proposed ADG identification system provides accuracy, sensitivity and specificity at levels of 78.77%, 83.10% and 74.67%, respectively. Based on the proposed ADG identification method, potentially additional ADGs are identified and top 13 genes (predicted ADGs) are presented.

Conclusions: In this paper, an ADG identification method for identifying ADGs is presented. The proposed method which combines feature selection, cascading classifier and majority voting leads to higher specificity and significantly increases the accuracy and sensitivity of ADG identification. Potentially new ADGs are identified.

Introduction

Alzheimer’s disease (AD) has received intense studies during past decades. AD is a chronic neurodegenerative disease that usually starts slowly and gets worse over time [1], [2]. AD is the most common cause of dementia in older adults with loss of cognitive functions and memory [3], [4]. The cause for most Alzheimer’s cases is mostly unknown except that three genes have been firmly implicated in the pathophysiology of early onset AD (EOAD), (onset  <  65 years), which only accounts for 1–5% of all cases [5]. The amount of risk of Alzheimer’s disease that is attributable to genetics is estimated to be around 70% [6]. These correlated genes are called AD-related genes (ADGs).

Several ADGs have been obtained in clinical trials. However, the study of AD is still unable to identify all the ADGs because of the complexity of AD [7]. In order to address this issue, some research attempts to use mathematics and computer science methods to find ADGs to provide directions and recommendations for future clinical studies through mining meaningful genes from large amounts of gene data [8], [9], [10].

Gene microarrays [8], [9], [10] provide new tools for addressing the complexity of AD since they allow views of the simultaneous activities of multiple cellular pathways. Recently, advances in gene microarray technologies have enabled biologists to measure the expression levels of many genes simultaneously in one experiment, which provides an opportunity for machine learning methods to be used to extract valuable biological information from these large datasets. Researchers have employed some machine learning approaches to analyze these high-throughput microarray gene databases.

Random forest (RF) is a classification algorithm well suited for microarray data. It shows excellent performance even when most predictive variables or microarray data are noise. It can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes. It also returns the measures of variable importance. Thus, Uriarte et al. [8] investigated the use of random forest for classification of microarray data (including multi-class problems) and proposed a new method of gene selection in classification problems based on random forest. The results show that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Tang et al. [9] applied the least squares support vector machine (LS-SVM) to two microarray datasets and compared with other well-known gene selection methods. The results indicate that the proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to existing methods.

Zhang et al. [10] evaluated the performance of the extreme learning machine (ELM) algorithm for three multi-category microarray gene expression data sets for cancer diagnosis. The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to other state-of-art methods.

Using AD microarray data, researchers have developed various methods for exploring the genes associated with the disease. For example, Blalock et al. [11] proposed an ADG identification algorithm using a strategy combining powerful new gene microarray technology, which permits measurement of the expression of many thousands of genes simultaneously, with statistical correlation analyses. Zhang et al.[12] proposed an ADG identification algorithm which combines principal component analysis and K-means clustering to obtain the genes which are correlated to AD. 8 ADGs were identified and, however, the identification sensitivity is very low. Yang et al. [13] proposed an ADG identification approach by applying the Fisher score to evaluate the ability of a gene to distinguish normal individuals from AD patients by its expression value.

Similar to [11], [12], [13], we investigate the ADG identification issue. Specifically, we address the problem of the low sensitivity of ADG identification. In this paper, we propose an ADG identification approach utilizing AD microarray data. Firstly, AD microarray data are pre-processed for normalization and standardization. Then the feature selection algorithm, ReliefF [14], is employed to determine features for distinguishing the ADGs and normal genes. Finally, a two-stage cascading classifier is produced to identify the ADGs. The first stage classification task is based the relevance vector machine (RVM) classifier and the second stage classification task is based on the majority voting of SVM, RF and ELM classifiers. We compare the performance of our approach with existing methods. Results show that our proposed method significantly outperforms prior methods in ADG identification.

To summarize, in this paper we have the following contributions:

(1)
Supervised learning based on known ADGs.
(2)
Use of feature selection to improve identification performance.
(3)
Development of two-stage cascaded classifier to improve identification performance.

The rest of this paper is organized as follows. Section 2 presents details of the methods employed in this study. Section 3 gives the experiment results using datasets with 31 patients reflecting different levels of AD (Control 9, Incipient 7, Moderate 8, Severe 7). Section 4 presents the discussion and analysis. Finally, Section 5 concludes the paper.

Section snippets

Methods

In our proposed ADG identification algorithm, there are three main stages: pre-processing, feature selection and classification. The flowchart of the algorithm is shown in Fig. 1.

Experiment data

The National Center for Biotechnology Information (NCBI) [11] provides a dataset including 22,283 genes acquired from 9 normal patients, 7 incipient patients, 8 moderate patients, and 7 severe patients. The blood of 31 patients were dripped into the gene chips of the same test environment respectively, which produced a data matrix consists of 22,283 columns and 31 rows. Each column was corresponding to one gene chip (one patient). In other words, all the 22,283 gene data were integrated into

Discussions

In contrast to other cluster based methods (e.g., [12], [25]), our approach uses supervised learning algorithms to identify ADGs. In [12], 8 ADGs were identified and other 30 candidate genes were predicted to be related to AD. However, only 11.26% of ADGs were identified correctly [12]. The results from Table 3 show that our procedure significantly improves the accuracy, sensitivity and specificity of ADG identification.

Recently, several approaches (e.g., [8], [9], [10], [13]) that select the

Conclusions

In this paper, an ADG identification method is presented. A feature selection algorithm, ReliefF, is utilized to select the most relevant features. A two-stage cascading classifier is trained to identify and predict ADGs. Results of this study demonstrate that the proposed ADG identification method combing feature selection, cascading classifier and majority voting has higher sensitivity and specificity and it significantly increases the accuracy of ADG identification.

References (25)

K. Blennow et al.
Alzheimer’s disease
Lancet
(2006)
C. Reitz et al.
Alzheimer disease: epidemiology, diagnostic criteria, risk factors and biomarkers
Biochem. Pharmacol.
(2014)
C. Ballard et al.
Alzheimer’s disease
Lancet
(2011)
C.C. Chang et al.
Libsvm: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
(2007)
G.B. Huang et al.
Extreme learning machine: theory and applications
Neurocomputing
(2006)
A. Burns et al.
Alzheimer’s disease
The BMJ
(2009)
H.W. Querfurth et al.
Alzheimer’s disease
N. Engl. J. Med.
(2010)
D.J. Selkoe
Alzheimer’s disease: genes, proteins, and therapy
Physiol. Rev.
(2001)
W.J. Strittmatter et al.
Apolipoprotein e: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial alzheimer disease
PNAS.
(1993)
R.D. Uriarte et al.
Gene selection and classification of microarray data using random forest
BMC Bioinform.
(2006)

E.K. Tang et al.

Gene selection algorithms for microarray data based on least squares support vector machine

BMC Bioinform.

(2006)

R.X. Zhang et al.

Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis

IEEE/ACM Trans. Comput. Biol. Bioinf.

(2007)

Cited by (19)

Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles
2019, Computer Methods and Programs in Biomedicine
Citation Excerpt :
It includes a variety of cholesterol-rich pathways, e.g., the caveolae-mediated pathway [4]. Many studies determined that the functional missing of clathrins in cell systems would affect a variety of human diseases, e.g., cancer, Alzheimer, neurodegenerative, and so on [5–7]. Due to their essential role in human diseases, clathrin proteins attracted various researchers who conducted their research on them.
Clathrin is an adaptor protein that serves as the principal element of the vesicle-coating complex and is important for the membrane cleavage to dispense the invaginated vesicle from the plasma membrane. The functional loss of clathrins has been tied to a lot of human diseases, i.e., neurodegenerative disorders, cancer, Alzheimer's diseases, and so on. Therefore, creating a precise model to identify its functions is a crucial step towards understanding human diseases and designing drug targets.
We present a deep learning model using a two-dimensional convolutional neural network (CNN) and position-specific scoring matrix (PSSM) profiles to identify clathrin proteins from high throughput sequences. Traditionally, the 2D CNNs take images as an input so we treated the PSSM profile with a 20 × 20 matrix as an image of 20 × 20 pixels. The input PSSM profile was then connected to our 2D CNN in which we set a variety of parameters to improve the performance of the model. Based on the 10-fold cross-validation results, hyper-parameter optimization process was employed to find the best model for our dataset. Finally, an independent dataset was used to assess the predictive ability of the current model.
Our model could identify clathrin proteins with sensitivity of 92.2%, specificity of 91.2%, accuracy of 91.8%, and MCC of 0.83 in the independent dataset. Compared to state-of-the-art traditional neural networks, our method achieved a significant improvement in all typical measurement metrics.
Throughout the proposed study, we provide an effective tool for investigating clathrin proteins and our achievement could promote the use of deep learning in biomedical research. We also provide source codes and dataset freely at https://www.github.com/khanhlee/deep-clathrin/.
Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning
2019, Gene
Citation Excerpt :
In addition, processing the original high-dimensional microarray data not only degenerates the final performance of classification algorithms but also increases the computation burden of hardware. To this end, it is urgent to reduce the dimensionality of original high dimensional microarray data by selecting a discriminate subset of genes which can obtain better classification results (Mitra et al., 2002; Dy and Brodley, 2004; He et al., 2005; Chuang et al., 2012; Song et al., 2016; Ramos et al., 2017; Miao et al., 2017; Wang et al., 2017a; Tang et al., 2018a; Algamal et al., 2018). Generally speaking, the gene selection task is very similar to the feature selection in data mining and machine learning community (Mitra et al., 2002; Alrajab et al., 2017; Odeh and Baareh, 2016; Luo et al., 2013; Shi et al., 2015; Luo et al., 2016; Shen et al., 2018a; Shen et al., 2018b; Shen et al., 2016; Li et al., 2019; Tang et al., 2018b; Tang et al., 2018c; Tang et al., 2018d; Tang et al., 2019a; Tang et al., 2019b).
Due to the rapid development of DNA microarray technology, a large number of microarray data come into being and classifying these data has been verified useful for cancer diagnosis, treatment and prevention. However, microarray data classification is still a challenging task since there are often a huge number of genes but a small number of samples in gene expression data. As a result, a computational method for reducing the dimension of microarray data is necessary. In this paper, we introduce a computational gene selection model for microarray data classification via adaptive hypergraph embedded dictionary learning (AHEDL). Specifically, a dictionary is learned from the feature space of original high dimensional microarray data, and this learned dictionary is used to represent original genes with a reconstruction coefficient matrix. Then we use a l_{2, 1}-norm regularization to impose the row sparsity on the coefficient matrix for selecting discriminate genes. Meanwhile, in order to capture the localmanifold geometrical structure of original microarray data in a high-order manner, a hypergraph is adaptively learned and embedded into the model. An iterative updating algorithm is designed for solving the optimization problem. In order to validate the efficacy of the proposed model, we have conducted experiments on six publicly available microarray data sets and the results demonstrate that AHEDL outperforms other state-of-the-art methods in terms of microarray data classification.
Improved lung nodule diagnosis accuracy using lung CT images with uncertain class
2018, Computer Methods and Programs in Biomedicine
Citation Excerpt :
ELM has been developed for single hidden layer feedforward neural networks learning algorithm. The theories of ELM show that it has both universal approximation and classification capabilities [30–38], because hidden neurons can be randomly generated and independent from applications as well. ELM provides better generalization performance than other classification algorithms at an extremely fast learning speed, ease of implementation and minimal human intervention [12].
Background and objective: Among all malignant tumors, lung cancer ranks in the top in mortality rate. Pulmonary nodule is the early manifestation of lung cancer, and plays an important role in its discovery, diagnosis and treatment. The technology of medical imaging has encountered a rapid development in recent years, thus the amount of pulmonary nodules can be discovered are on the raise, which means even tiny or minor changes in lung can be recorded by the CT images. This paper proposes a pulmonary nodule computer aided diagnosis (CAD) based on semi-supervised extreme learning machine(SS-ELM).
Methods: First, the feature model based on the pulmonary nodules regions of lung CT images is established. After that, the same feature data sets have been put into ELM, support vector machine (SVM) methods, probabilistic neural network (PNN) and multilayer perceptron (MLP) so as to compare the performance of the methods. ELM turned out to have better performance in training time and testing accuracy compared with SVM, PNN and MLP. Then, we propose a pulmonary nodules computer aided diagnosis algorithm based on semi-supervised ELM (SS-ELM), which enables both certain class feature sets with labels and unlabeled feature sets to be input for training and computer aided diagnosing. This algorithm has provided a solution for the using of uncertain class data and improve the testing accuracy of benign and malignant diagnosis.
Results: 1018 sets of thoracic CT images from the Lung Database Consortium and Image Database Resource Initiative (LIDC-IDRI) have been used in experiment in order to test the effectiveness of the algorithm. Compared with ELM, the pulmonary nodules CAD based on SS-ELM has better testing accuracy performance.
Conclusions: We have proposed a pulmonary nodule CAD system based on SS-ELM, which achieving better generalization performance at faster learning speed and higher testing accuracy than ELM, SVM, PNN and MLP. The SS-ELM based pulmonary nodules CAD has been proposed to solve the problem of uncertain class data using.
N-semble-based method for identifying Parkinson’s disease genes
2023, Neural Computing and Applications
Identifying Effective Feature Selection Methods for Alzheimer’s Disease Biomarker Gene Detection Using Machine Learning
2023, Diagnostics
Diagnosis of Parkinson’s disease genes using LSTM and MLP-based multi-feature extraction methods
2023, International Journal of Data Mining and Bioinformatics

View all citing articles on Scopus

View full text

An Alzheimers disease related genes identification method based on multiple classifier integration

Abstract

Introduction

Section snippets

Methods

Experiment data

Discussions

Conclusions

Lancet

Biochem. Pharmacol.

Lancet

ACM Trans. Intell. Syst. Technol.

Neurocomputing

Alzheimer’s disease

The BMJ

Alzheimer’s disease

N. Engl. J. Med.

Alzheimer’s disease: genes, proteins, and therapy

Physiol. Rev.

Apolipoprotein e: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial alzheimer disease

PNAS.

Gene selection and classification of microarray data using random forest

BMC Bioinform.

Gene selection algorithms for microarray data based on least squares support vector machine

BMC Bioinform.

Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis

IEEE/ACM Trans. Comput. Biol. Bioinf.