Evaluating switching neural networks through artificial and real gene expression data

doi:10.1016/j.artmed.2008.08.002

Artificial Intelligence in Medicine

Volume 45, Issues 2–3, February–March 2009, Pages 163-171

https://doi.org/10.1016/j.artmed.2008.08.002 Get rights and content

Summary

Objective

DNA microarrays offer the possibility of analyzing the expression level for thousands of genes concerning a specific tissue. An important target of this analysis is to derive the subset of genes involved in a biological process of interest. Here, a new promising method for gene selection is proposed, which presents a good level of accuracy and reliability.

Methods and materials

The proposed technique adopts switching neural networks (SNN), a particular kind of connectionist models, to assign a relevance value to each gene, thus employing recursive feature addition (RFA) to derive the final list of relevant genes. To fairly evaluate the quality of the new approach, called SNN-RFA, its application on three real and three artificial gene expression datasets, generated according to a proper mathematical model that possesses biological and statistical plausibility, has been considered. In particular, a comparison with other two widely used gene selection methods, namely the signal to noise ratio (S2N) and support vector machines with recursive feature elimination (SVM-RFE), has been performed.

Results

In all the considered cases SNN-RFA achieves the best performances, arriving to determine the whole collection of relevant genes in one of the three artificial datasets. The S2N method exhibits a quality similar to that of SNN-RFA, whereas SVM-RFE shows the worst behavior.

Conclusion

The quality of the proposed method SNN-RFA has been established together with the usefulness of the mathematical model adopted to generate the artificial datasets of gene expression levels.

Introduction

DNA microarrays provide the gene expression level for thousands of genes pertaining to a given tissue, thus allowing to understand mechanisms regulating biological processes, such as the onset of a disease or the effects of a drug [1], [2], [3], [4], [5]. Nevertheless, treating such a huge amount of data requires appropriate statistic and information analysis tools. An important challenge in this analysis is to determine the subset of genes involved in the biological process under examination. Such problem is generally referred to as gene selection and several statistic and machine learning techniques have been proposed in literature to face with it [6], [7], [8], [9].

Golub et al. [10] have obtained interesting results in discriminating two different kinds of leukemia by adopting a simple univariate statistic method, involving a measure of the signal to noise ratio (S2N). This technique has been recently recognized as a particular member of a more general family of feature selection algorithms, denoted as BAHSIC (based on the Hilbert–Schmidt independence criterion) [11]. To the same family belong other interesting gene selection methods, such as the PAM technique [12] and algorithms relying on Pearson’s correlation [13], T-test [14], B-statistics [15].

An alternative approach has been proposed by Guyon et al. [16]: it is based on an iterative procedure, called recursive feature elimination (RFE), which subsequently removes genes marked as less relevant by a specific classifier. To this end, Guyon et al. decided to employ linear support vector machines (SVM), whose quality has been theoretically and experimentally demonstrated; the resulting gene selection procedure is usually referred to as SVM-RFE. Further refinements and modifications of SVM-RFE have been recently proposed [17], [18], [19]; moreover, the RFE approach has been adopted to rank relevant genes produced with other classification algorithms, such as the maximum margin criterion (MMC) [20].

Another promising class of machine learning techniques for gene selection is rule generation methods, which solve a classification problem by generating a collection of intelligible rules in the if-then form. In particular, switching neural networks (SNN) [21] have been shown to obtain an excellent accuracy, when applied to solve real world problems deriving from DNA microarray. This paper proposes to employ SNN for gene selection by adopting the opposite approach with respect to RFE: it subsequently adds the features considered as more relevant by a proper classifier. Since this approach is called recursive feature addition (RFA) [22], the proposed gene selection method will be denoted as SNN-RFA.

Unfortunately, real data cannot be adopted to evaluate in an objective way the quality of a gene selection method, such as S2N, SVM-RFE, or SNN-RFA. In fact, the whole set of genes really involved in a biological process is not known: medical and biological literature provide at most a partial knowledge about it.

To overcome this problem, the subset of genes found by the method at hand is usually considered for the construction of a classifier, whose accuracy provides a measure of the validity of the gene selection task. In fact, when redundant input variables are ignored, a better solution of a classification problem can be attained; therefore, the identification of a good subset of genes must lead to an improvement in the generalization ability of classifiers relying on that subset.

However, this approach for evaluating gene selection method is affected by the technique adopted for the construction of the classifier, as it is pointed out by results presented in many papers among which [19]. A valid alternative approach consists in using the biologically plausible mathematical model described in [23], which is able to generate artificial expression data that present the same statistic behavior as those deriving from DNA microarray experiments. In this case, the whole set of artificial genes involved in the construction of the examples is known, thus allowing a fair evaluation of different techniques for gene selection.

The quality of SNN-RFA is analyzed by considering three real problems involving microarray experiments, described in [10], [24], [25], together with three artificial datasets possessing similar statistic behavior. In particular, the results obtained by SNN-RFA are compared with those produced by S2N and SVM-RFE, by evaluating the subsets of common genes retrieved and by assessing the number of correct genes detected in the artificial cases.

Section snippets

Mathematical model for gene expression data

To derive a mathematical model for artificial data we suppose that the relationship between gene expression values and functional state of the tissue is deterministic, i.e., no labeling error occurs during the execution of DNA-microarray experiments. Since in a real situation this cannot be assumed, the proposed model will be composed by a deterministic part described through a function $f : R^{m} \to {0, 1}$ , where m is the number of analyzed genes, and by a random term e corresponding to the probability

Considered gene selection methods

When analyzing a gene expression dataset consisting of n vectors $x_{j}$ , associated with as many tissues in two different functional states $S_{1}$ and $S_{2}$ , the main target is to retrieve the subset of genes that are differentially expressed in $S_{1}$ and $S_{2}$ . A possible way of achieving this goal is to employ a feature selection technique, which aims to derive in a general classification problem the minimal subset of inputs involved in any optimal decision function solving the problem at hand.

Nevertheless, a

Results

To evaluate the results obtained by S2N, SVM-RFE and SNN-RFA when performing gene selection on real world problems, three datasets containing gene expression levels produced by DNA microarrays have been considered:

•
Leukemia dataset [10]: it examines the problem of discriminating two types of leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The data pertain 72 tissues, 47 with ALL and 25 with AML. Each experiment analyzes the expression level of 7129 genes. The

Conclusions

The problem of identifying the subset of genes involved in the arising of a given pathological or physiological state is crucial in current biomedical research. A possible approach for its solution is offered by the availability of advanced instruments, such as DNA microarrays, capable of determining the expression levels of thousands of genes for a given tissue. However, the huge quantity of data produced and the uncertainty in the acquisition process makes it difficult to derive the desired

Acknowledgement

This work was partially supported by the Italian MIUR project “Laboratory of Interdisciplinary Technologies in Bioinformatics (LITBIO)”.

References (28)

P. Baldi et al.
DNA microarrays and gene expression
(2002)
J.J. Chen et al.
Analysis of variance components in gene expression data
Bioinformatics
(2004)
J. Ihmels et al.
Defining transcription modules using large-scale gene expression data
Bioinformatics
(2004)
M.L.T. Lee
Analysis of microarray gene expression data
(2004)
J. Quackenbush
Computational analysis of microarray data
Nature Reviews Genetics
(2001)
S. Draghici
Data analysis tools for DNA microarrays
(2003)
L. Li et al.
Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method
Bioinformatics
(2001)
T. Li et al.
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression
Bioinformatics
(2004)
Xuan J, Wang Y, Dong Y, Feng Y, Wang B, Khan J, et al. Gene selection for multiclass prediction by weighted Fisher...
T.R. Golub et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
(1999)

L. Song et al.

Gene selection via the BAHSIC family of algorithms

Bioinformatics

(2007)

R. Tibshirani et al.

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Proceedings of the National Academy of Sciences

(2002)

L.J. van’t Veer et al.

Gene expression profiling predicts clinical outcome of breast cancer

Nature

(2002)

V.G. Tusher et al.

Significance analysis of microarrays applied to the ionizing radiation response

Proceedings of the National Academy of Sciences

(2001)

Cited by (10)

Natural occurrence of nocturnal hypoglycemia detection using hybrid particle swarm optimized fuzzy reasoning model
2012, Artificial Intelligence in Medicine
Citation Excerpt :
However, the regression method does not perform well if the data distribution is highly irregular. Recently, computational intelligence technologies, such as fuzzy systems [19,20], support vector machines [21], and neural networks [22,23], have been applied to modeling and classification for medical diagnostic purposes of electrocardiogram (ECG) and electroencephalograph (EGG) classifications [24–27], cardiovascular responses [28,29], breast cancer [30], blood cells [31], skull and brain [32], dermatological disease [33,34], gene selection [35], and heart disease [36], etc. The main feature of a fuzzy system is its decision-making ability based on the system representation provided by human experts.
Low blood glucose (hypoglycemia) is a common and serious side effect of insulin therapy in patients with diabetes. This paper will make a contribution to knowledge in the modeling and design of a non-invasive hypoglycemia monitor for patients with type 1 diabetes mellitus (T1DM) using a fuzzy-reasoning system.
Based on the heart rate and the corrected QT interval of the electrocardiogram (ECG) signal, we have developed a hybrid particle-swarm-optimization-based fuzzy-reasoning model to recognize the presence of hypoglycemic episodes. To optimize the fuzzy rules and the fuzzy-membership functions, a hybrid particle-swarm-optimization with wavelet mutation operation is investigated.
From our clinical study of 16 children with T1DM, natural occurrence of nocturnal-hypoglycemic episodes was associated with increased heart rates and increased corrected QT intervals. All the data sets were collected from the Government of Western Australia's Department of Health. All data were organized randomly into a training set (8 patients with 320 data points) and a testing set (another 8 patients with 269 data points). To prevent the phenomenon of overtraining, we separated the training set into 2 sets (4 patients in each set) and a fitness function was introduced for this training process. The testing performances of the proposed algorithm for detection of advanced hypoglycemic episodes (sensitivity = 85.71% and specificity = 79.84%) and hypoglycemic episodes (sensitivity = 80.00% and specificity = 55.14%) were given.
We have investigated the detection for the natural occurrence of nocturnal hypoglycemic episodes in T1DM using a hybrid particle-swarm-optimization-based fuzzy-reasoning model with physiological parameters. In this study, no restricted environment (e.g. patient's dietary requirements) is required. Furthermore, the sampling time is between 5 and 10 min. To conclude, we have shown that the testing performances of the proposed algorithm for detection of advanced hypoglycemic and hypoglycemic episodes for T1DM patients are satisfactory.
Computational intelligence and machine learning in bioinformatics
2009, Artificial Intelligence in Medicine
Analyzing gene expression data for pediatric and adult cancer diagnosis using logic learning machine and standard supervised methods
2019, BMC Bioinformatics
Logic Learning Machine and standard supervised methods for Hodgkin’s lymphoma prognosis using gene expression data and clinical variables
2018, Health Informatics Journal
Identifying Environmental and Social Factors Predisposing to Pathological Gambling Combining Standard Logistic Regression and Logic Learning Machine
2017, Journal of Gambling Studies
Differential diagnosis of pleural mesothelioma using Logic Learning Machine
2015, BMC Bioinformatics

View all citing articles on Scopus

View full text

Evaluating switching neural networks through artificial and real gene expression data

Summary

Objective

Methods and materials

Results

Conclusion

Introduction

Section snippets

Mathematical model for gene expression data

Considered gene selection methods

Results

Conclusions

Acknowledgement

DNA microarrays and gene expression

Analysis of variance components in gene expression data

Bioinformatics

Defining transcription modules using large-scale gene expression data

Bioinformatics

Analysis of microarray gene expression data

Computational analysis of microarray data

Nature Reviews Genetics

Data analysis tools for DNA microarrays

Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method

Bioinformatics

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

Bioinformatics

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Gene selection via the BAHSIC family of algorithms

Bioinformatics

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Proceedings of the National Academy of Sciences

Gene expression profiling predicts clinical outcome of breast cancer

Nature

Significance analysis of microarrays applied to the ionizing radiation response

Proceedings of the National Academy of Sciences