Evaluating switching neural networks through artificial and real gene expression data

https://doi.org/10.1016/j.artmed.2008.08.002Get rights and content

Summary

Objective

DNA microarrays offer the possibility of analyzing the expression level for thousands of genes concerning a specific tissue. An important target of this analysis is to derive the subset of genes involved in a biological process of interest. Here, a new promising method for gene selection is proposed, which presents a good level of accuracy and reliability.

Methods and materials

The proposed technique adopts switching neural networks (SNN), a particular kind of connectionist models, to assign a relevance value to each gene, thus employing recursive feature addition (RFA) to derive the final list of relevant genes. To fairly evaluate the quality of the new approach, called SNN-RFA, its application on three real and three artificial gene expression datasets, generated according to a proper mathematical model that possesses biological and statistical plausibility, has been considered. In particular, a comparison with other two widely used gene selection methods, namely the signal to noise ratio (S2N) and support vector machines with recursive feature elimination (SVM-RFE), has been performed.

Results

In all the considered cases SNN-RFA achieves the best performances, arriving to determine the whole collection of relevant genes in one of the three artificial datasets. The S2N method exhibits a quality similar to that of SNN-RFA, whereas SVM-RFE shows the worst behavior.

Conclusion

The quality of the proposed method SNN-RFA has been established together with the usefulness of the mathematical model adopted to generate the artificial datasets of gene expression levels.

Introduction

DNA microarrays provide the gene expression level for thousands of genes pertaining to a given tissue, thus allowing to understand mechanisms regulating biological processes, such as the onset of a disease or the effects of a drug [1], [2], [3], [4], [5]. Nevertheless, treating such a huge amount of data requires appropriate statistic and information analysis tools. An important challenge in this analysis is to determine the subset of genes involved in the biological process under examination. Such problem is generally referred to as gene selection and several statistic and machine learning techniques have been proposed in literature to face with it [6], [7], [8], [9].

Golub et al. [10] have obtained interesting results in discriminating two different kinds of leukemia by adopting a simple univariate statistic method, involving a measure of the signal to noise ratio (S2N). This technique has been recently recognized as a particular member of a more general family of feature selection algorithms, denoted as BAHSIC (based on the Hilbert–Schmidt independence criterion) [11]. To the same family belong other interesting gene selection methods, such as the PAM technique [12] and algorithms relying on Pearson’s correlation [13], T-test [14], B-statistics [15].

An alternative approach has been proposed by Guyon et al. [16]: it is based on an iterative procedure, called recursive feature elimination (RFE), which subsequently removes genes marked as less relevant by a specific classifier. To this end, Guyon et al. decided to employ linear support vector machines (SVM), whose quality has been theoretically and experimentally demonstrated; the resulting gene selection procedure is usually referred to as SVM-RFE. Further refinements and modifications of SVM-RFE have been recently proposed [17], [18], [19]; moreover, the RFE approach has been adopted to rank relevant genes produced with other classification algorithms, such as the maximum margin criterion (MMC) [20].

Another promising class of machine learning techniques for gene selection is rule generation methods, which solve a classification problem by generating a collection of intelligible rules in the if-then form. In particular, switching neural networks (SNN) [21] have been shown to obtain an excellent accuracy, when applied to solve real world problems deriving from DNA microarray. This paper proposes to employ SNN for gene selection by adopting the opposite approach with respect to RFE: it subsequently adds the features considered as more relevant by a proper classifier. Since this approach is called recursive feature addition (RFA) [22], the proposed gene selection method will be denoted as SNN-RFA.

Unfortunately, real data cannot be adopted to evaluate in an objective way the quality of a gene selection method, such as S2N, SVM-RFE, or SNN-RFA. In fact, the whole set of genes really involved in a biological process is not known: medical and biological literature provide at most a partial knowledge about it.

To overcome this problem, the subset of genes found by the method at hand is usually considered for the construction of a classifier, whose accuracy provides a measure of the validity of the gene selection task. In fact, when redundant input variables are ignored, a better solution of a classification problem can be attained; therefore, the identification of a good subset of genes must lead to an improvement in the generalization ability of classifiers relying on that subset.

However, this approach for evaluating gene selection method is affected by the technique adopted for the construction of the classifier, as it is pointed out by results presented in many papers among which [19]. A valid alternative approach consists in using the biologically plausible mathematical model described in [23], which is able to generate artificial expression data that present the same statistic behavior as those deriving from DNA microarray experiments. In this case, the whole set of artificial genes involved in the construction of the examples is known, thus allowing a fair evaluation of different techniques for gene selection.

The quality of SNN-RFA is analyzed by considering three real problems involving microarray experiments, described in [10], [24], [25], together with three artificial datasets possessing similar statistic behavior. In particular, the results obtained by SNN-RFA are compared with those produced by S2N and SVM-RFE, by evaluating the subsets of common genes retrieved and by assessing the number of correct genes detected in the artificial cases.

Section snippets

Mathematical model for gene expression data

To derive a mathematical model for artificial data we suppose that the relationship between gene expression values and functional state of the tissue is deterministic, i.e., no labeling error occurs during the execution of DNA-microarray experiments. Since in a real situation this cannot be assumed, the proposed model will be composed by a deterministic part described through a function f:Rm{0,1}, where m is the number of analyzed genes, and by a random term e corresponding to the probability

Considered gene selection methods

When analyzing a gene expression dataset consisting of n vectors xj, associated with as many tissues in two different functional states S1 and S2, the main target is to retrieve the subset of genes that are differentially expressed in S1 and S2. A possible way of achieving this goal is to employ a feature selection technique, which aims to derive in a general classification problem the minimal subset of inputs involved in any optimal decision function solving the problem at hand.

Nevertheless, a

Results

To evaluate the results obtained by S2N, SVM-RFE and SNN-RFA when performing gene selection on real world problems, three datasets containing gene expression levels produced by DNA microarrays have been considered:

  • Leukemia dataset [10]: it examines the problem of discriminating two types of leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The data pertain 72 tissues, 47 with ALL and 25 with AML. Each experiment analyzes the expression level of 7129 genes. The

Conclusions

The problem of identifying the subset of genes involved in the arising of a given pathological or physiological state is crucial in current biomedical research. A possible approach for its solution is offered by the availability of advanced instruments, such as DNA microarrays, capable of determining the expression levels of thousands of genes for a given tissue. However, the huge quantity of data produced and the uncertainty in the acquisition process makes it difficult to derive the desired

Acknowledgement

This work was partially supported by the Italian MIUR project “Laboratory of Interdisciplinary Technologies in Bioinformatics (LITBIO)”.

References (28)

  • P. Baldi et al.

    DNA microarrays and gene expression

    (2002)
  • J.J. Chen et al.

    Analysis of variance components in gene expression data

    Bioinformatics

    (2004)
  • J. Ihmels et al.

    Defining transcription modules using large-scale gene expression data

    Bioinformatics

    (2004)
  • M.L.T. Lee

    Analysis of microarray gene expression data

    (2004)
  • J. Quackenbush

    Computational analysis of microarray data

    Nature Reviews Genetics

    (2001)
  • S. Draghici

    Data analysis tools for DNA microarrays

    (2003)
  • L. Li et al.

    Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method

    Bioinformatics

    (2001)
  • T. Li et al.

    A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

    Bioinformatics

    (2004)
  • Xuan J, Wang Y, Dong Y, Feng Y, Wang B, Khan J, et al. Gene selection for multiclass prediction by weighted Fisher...
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • L. Song et al.

    Gene selection via the BAHSIC family of algorithms

    Bioinformatics

    (2007)
  • R. Tibshirani et al.

    Diagnosis of multiple cancer types by shrunken centroids of gene expression

    Proceedings of the National Academy of Sciences

    (2002)
  • L.J. van’t Veer et al.

    Gene expression profiling predicts clinical outcome of breast cancer

    Nature

    (2002)
  • V.G. Tusher et al.

    Significance analysis of microarrays applied to the ionizing radiation response

    Proceedings of the National Academy of Sciences

    (2001)
  • Cited by (10)

    • Natural occurrence of nocturnal hypoglycemia detection using hybrid particle swarm optimized fuzzy reasoning model

      2012, Artificial Intelligence in Medicine
      Citation Excerpt :

      However, the regression method does not perform well if the data distribution is highly irregular. Recently, computational intelligence technologies, such as fuzzy systems [19,20], support vector machines [21], and neural networks [22,23], have been applied to modeling and classification for medical diagnostic purposes of electrocardiogram (ECG) and electroencephalograph (EGG) classifications [24–27], cardiovascular responses [28,29], breast cancer [30], blood cells [31], skull and brain [32], dermatological disease [33,34], gene selection [35], and heart disease [36], etc. The main feature of a fuzzy system is its decision-making ability based on the system representation provided by human experts.

    View all citing articles on Scopus
    View full text