Evaluating switching neural networks through artificial and real gene expression data
Introduction
DNA microarrays provide the gene expression level for thousands of genes pertaining to a given tissue, thus allowing to understand mechanisms regulating biological processes, such as the onset of a disease or the effects of a drug [1], [2], [3], [4], [5]. Nevertheless, treating such a huge amount of data requires appropriate statistic and information analysis tools. An important challenge in this analysis is to determine the subset of genes involved in the biological process under examination. Such problem is generally referred to as gene selection and several statistic and machine learning techniques have been proposed in literature to face with it [6], [7], [8], [9].
Golub et al. [10] have obtained interesting results in discriminating two different kinds of leukemia by adopting a simple univariate statistic method, involving a measure of the signal to noise ratio (S2N). This technique has been recently recognized as a particular member of a more general family of feature selection algorithms, denoted as BAHSIC (based on the Hilbert–Schmidt independence criterion) [11]. To the same family belong other interesting gene selection methods, such as the PAM technique [12] and algorithms relying on Pearson’s correlation [13], T-test [14], B-statistics [15].
An alternative approach has been proposed by Guyon et al. [16]: it is based on an iterative procedure, called recursive feature elimination (RFE), which subsequently removes genes marked as less relevant by a specific classifier. To this end, Guyon et al. decided to employ linear support vector machines (SVM), whose quality has been theoretically and experimentally demonstrated; the resulting gene selection procedure is usually referred to as SVM-RFE. Further refinements and modifications of SVM-RFE have been recently proposed [17], [18], [19]; moreover, the RFE approach has been adopted to rank relevant genes produced with other classification algorithms, such as the maximum margin criterion (MMC) [20].
Another promising class of machine learning techniques for gene selection is rule generation methods, which solve a classification problem by generating a collection of intelligible rules in the if-then form. In particular, switching neural networks (SNN) [21] have been shown to obtain an excellent accuracy, when applied to solve real world problems deriving from DNA microarray. This paper proposes to employ SNN for gene selection by adopting the opposite approach with respect to RFE: it subsequently adds the features considered as more relevant by a proper classifier. Since this approach is called recursive feature addition (RFA) [22], the proposed gene selection method will be denoted as SNN-RFA.
Unfortunately, real data cannot be adopted to evaluate in an objective way the quality of a gene selection method, such as S2N, SVM-RFE, or SNN-RFA. In fact, the whole set of genes really involved in a biological process is not known: medical and biological literature provide at most a partial knowledge about it.
To overcome this problem, the subset of genes found by the method at hand is usually considered for the construction of a classifier, whose accuracy provides a measure of the validity of the gene selection task. In fact, when redundant input variables are ignored, a better solution of a classification problem can be attained; therefore, the identification of a good subset of genes must lead to an improvement in the generalization ability of classifiers relying on that subset.
However, this approach for evaluating gene selection method is affected by the technique adopted for the construction of the classifier, as it is pointed out by results presented in many papers among which [19]. A valid alternative approach consists in using the biologically plausible mathematical model described in [23], which is able to generate artificial expression data that present the same statistic behavior as those deriving from DNA microarray experiments. In this case, the whole set of artificial genes involved in the construction of the examples is known, thus allowing a fair evaluation of different techniques for gene selection.
The quality of SNN-RFA is analyzed by considering three real problems involving microarray experiments, described in [10], [24], [25], together with three artificial datasets possessing similar statistic behavior. In particular, the results obtained by SNN-RFA are compared with those produced by S2N and SVM-RFE, by evaluating the subsets of common genes retrieved and by assessing the number of correct genes detected in the artificial cases.
Section snippets
Mathematical model for gene expression data
To derive a mathematical model for artificial data we suppose that the relationship between gene expression values and functional state of the tissue is deterministic, i.e., no labeling error occurs during the execution of DNA-microarray experiments. Since in a real situation this cannot be assumed, the proposed model will be composed by a deterministic part described through a function , where m is the number of analyzed genes, and by a random term e corresponding to the probability
Considered gene selection methods
When analyzing a gene expression dataset consisting of n vectors , associated with as many tissues in two different functional states and , the main target is to retrieve the subset of genes that are differentially expressed in and . A possible way of achieving this goal is to employ a feature selection technique, which aims to derive in a general classification problem the minimal subset of inputs involved in any optimal decision function solving the problem at hand.
Nevertheless, a
Results
To evaluate the results obtained by S2N, SVM-RFE and SNN-RFA when performing gene selection on real world problems, three datasets containing gene expression levels produced by DNA microarrays have been considered:
- •
Leukemia dataset [10]: it examines the problem of discriminating two types of leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The data pertain 72 tissues, 47 with ALL and 25 with AML. Each experiment analyzes the expression level of 7129 genes. The
Conclusions
The problem of identifying the subset of genes involved in the arising of a given pathological or physiological state is crucial in current biomedical research. A possible approach for its solution is offered by the availability of advanced instruments, such as DNA microarrays, capable of determining the expression levels of thousands of genes for a given tissue. However, the huge quantity of data produced and the uncertainty in the acquisition process makes it difficult to derive the desired
Acknowledgement
This work was partially supported by the Italian MIUR project “Laboratory of Interdisciplinary Technologies in Bioinformatics (LITBIO)”.
References (28)
- et al.
DNA microarrays and gene expression
(2002) - et al.
Analysis of variance components in gene expression data
Bioinformatics
(2004) - et al.
Defining transcription modules using large-scale gene expression data
Bioinformatics
(2004) Analysis of microarray gene expression data
(2004)Computational analysis of microarray data
Nature Reviews Genetics
(2001)Data analysis tools for DNA microarrays
(2003)- et al.
Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method
Bioinformatics
(2001) - et al.
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression
Bioinformatics
(2004) - Xuan J, Wang Y, Dong Y, Feng Y, Wang B, Khan J, et al. Gene selection for multiclass prediction by weighted Fisher...
- et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
(1999)
Gene selection via the BAHSIC family of algorithms
Bioinformatics
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Proceedings of the National Academy of Sciences
Gene expression profiling predicts clinical outcome of breast cancer
Nature
Significance analysis of microarrays applied to the ionizing radiation response
Proceedings of the National Academy of Sciences
Cited by (10)
Natural occurrence of nocturnal hypoglycemia detection using hybrid particle swarm optimized fuzzy reasoning model
2012, Artificial Intelligence in MedicineCitation Excerpt :However, the regression method does not perform well if the data distribution is highly irregular. Recently, computational intelligence technologies, such as fuzzy systems [19,20], support vector machines [21], and neural networks [22,23], have been applied to modeling and classification for medical diagnostic purposes of electrocardiogram (ECG) and electroencephalograph (EGG) classifications [24–27], cardiovascular responses [28,29], breast cancer [30], blood cells [31], skull and brain [32], dermatological disease [33,34], gene selection [35], and heart disease [36], etc. The main feature of a fuzzy system is its decision-making ability based on the system representation provided by human experts.
Computational intelligence and machine learning in bioinformatics
2009, Artificial Intelligence in MedicineDifferential diagnosis of pleural mesothelioma using Logic Learning Machine
2015, BMC Bioinformatics