Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction

https://doi.org/10.1016/j.compbiomed.2009.11.014Get rights and content

Abstract

Since Golub applied gene expression profiles (GEP) to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis, a number of studies on GEP-based tumor classification have been done. However, the challenges from high dimension and small sample size of tumor dataset still exist. This paper presents a new tumor classification approach based on an ensemble of probabilistic neural network (PNN) and neighborhood rough set model based gene reduction. Informative genes were initially selected by gene ranking based on an iterative search margin algorithm and then were further refined by gene reduction to select many minimum gene subsets. Finally, the candidate base PNN classifiers trained by each of the selected gene subsets were integrated by majority voting strategy to construct an ensemble classifier. Experiments on tumor datasets showed that this approach can obtain both high and stable classification performance, which is not too sensitive to the number of initially selected genes and competitive to most existing methods. Additionally, the classification results can be cross-verified in a single biomedical experiment by the selected gene subsets, and biologically experimental results also proved that the genes included in the selected gene subsets are functionally related to carcinogenesis, indicating that the performance obtained by the proposed method is convincing.

Introduction

Tumor is identified as systematic biology diseases [1]. So far the mechanism of tumor development is not thoroughly known yet. Since tumor treatment of patients of later stage cancers is often not therapeutically effective, medical experts agree that early diagnosis of tumor is of great benefit to the successful therapies of tumor. However, it is difficult for traditional tumor mass detection techniques, such as X-ray imaging, to conduct early detection of tumor. In recent 10 years, gene expression profiles (GEP) based molecular diagnosis of tumor have attracted a great number of medical researchers and computer scientists for the goal of realizing precise and early tumor diagnosis [2], [3], [4], [5], [6]. However, the curse of dimensionality caused by high dimensionality and small sample size of tumor dataset seriously challenges the tumor classification. So how to select important gene subsets from thousands of genes in GEP dataset to drastically reduce the dimensionality of tumor dataset is the first key step to address this problem. Usually, the prediction performance of the selected gene subsets is evaluated by a classifier. The commonly used classifiers including support vector machines (SVM) [7], [8], [9], [10], k-nearest neighbor (k-NN) [11], [12], C4.5 [13], artificial neural networks (ANN) [14], [15], self-organizing map (SOM) [16], self-organizing tree algorithm (SOTA) [17], and probabilistic neural networks (PNN) [18], [19], [20] have been extensively applied to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis. From those experimental results, we could conclude that gene selection such as selecting informative genes by using regulation probability [21] and by using independent component analysis [22] plays an important role in tumor classification.

Finding minimum tumor-related gene subsets can really improve the predictive performance of classification model because too many redundant or irrelevant genes might degrade the classification accuracy [23]. In addition to removing noise in GEP, the selected gene subsets also have important biomedical meanings and may be applied to the discovery of drug targets. Generally speaking, gene selection methods are categorized into two groups [24]. One is Wrapper methods which combine gene selection with a classifier, and another is Filter methods in which the procedure of gene selection is independent of classifiers. In most cases, Wrapper methods is superior to Filter methods in improving classification accuracy [25]. However, Wrapper methods by adopting different classifiers usually obtain different optimal gene subsets, which indicates that the Wrapper methods would be unstable in gene selection to some extent because the obtained accuracy is sensitive to the selected gene subsets. Another drawback is their high computational time. These are intrinsic drawbacks for most of the existing Wrapper methods when facing the curse of dimensionality and a variety of uncertainties in tumor dataset (the gathering process of microarray data including fabrication, hybridization and image processing always adds various sources of noise) [26], [27]. To address these problems, traditional intelligent methods are apt to over-fitting in classifying tumor dataset due to the lack of training sample set [28]. In fact, there are numerously optimal gene subsets with very high classification accuracy in tumor dataset [29], [30], which is mainly caused by gene co-expression and the function similarity of many genes, so how to obtain convincingly classification accuracy from these optimal gene subsets is still an important problem.

Solutions to the above problem include various ensemble schemes [31], [32], [33], [34], [35]. These studies suggested that ensemble machine learning or classifiers consistently perform better [13] in that a powerful classifier can be constructed by the ensemble of many base classifiers even though these base classifiers are weak in making decisions [36]. For example, Peng [27] proposed a robust ensemble approach to tumor classification by generating a pool of candidate base classifiers based on gene sub-sampling and then selecting a set of appropriate base classifiers to construct a high performance classification committee based on classifier clustering. Both theoretical and experimental studies have shown that the integrating of a set of diverse and accurate base classifiers would lead to a powerful ensemble classifier, where the diversity of base classifiers is prerequisite to the powerful ensemble classifier that outperforms each base classifier [37], because combining a set of same classifiers will not intuitively generates any improvement. However, most of the conventional ensemble methods employed to tumor classification such as re-sampling methods based on samples or gene re-sampling are so random that their biological meanings are difficult to interpret. Therefore, the diversity and accuracy of base classifiers should be considered simultaneously in designing an ensemble classifier. In this study, we propose a novel ensemble method which combines base PNN classifiers with neighborhood rough set model based gene reduction. Experiments on three well-known tumor datasets show that the proposed methods not only have higher classification accuracy rate but also are more stable in classification performance.

The remainder of this paper is organized as follows. In Section 2, we first introduced the neighborhood rough set model for gene reduction, the framework of PNN ensemble algorithm and two gene pre-selection methods: an iterative search margin based algorithm and a weighted feature score criterion. Section 3 described our four experimental methods and provided their experimental results on three well-known tumor datasets and the biomedical interpretation of some selected genes. Comparison with other related works were also roughly performed in this section. Finally, Section 4 presented the conclusions.

Section snippets

Neighborhood rough set model

How to generate diverse base classifiers is a critical problem in ensemble machine learning. In our ensemble method, diverse base classifiers were produced by diverse gene subsets obtained by using gene reduction based on neighborhood rough set model (NRSM) [38], [39]. The principle of NRSM was briefly introduced as follows.

Let G={g1,gn} be a set of genes and S={s1,sm} be a set of samples. The corresponding gene expression matrix can be represented as X=(xi,j)m×n, where xi,j is the expression

Sample datasets

The proposed method is applied to three published tumor datasets: leukemia dataset [16], colon tumor dataset [50] and small round blue cell tumor (SRBCT) dataset [51]. The leukemia and colon tumor dataset contain only two subclasses, respectively, as shown in Table 1. From the web site: http://research.nhgri.nih.gov/microarray/Supplement, we downloaded the SRBCT dataset which contains 88 samples with 2,308 genes in each sample as shown in Table 2. According to Ref. [51], there are 63 training

Conclusions

Finding tumor-related genes is helpful for the personalized medicine and earlier tumor diagnosis [35]. In this paper, we designed a new ensemble method for tumor classification. This method began with gene ranking based gene selection by using Simba algorithm or WFSC criterion, then applied FARNeM-based gene reduction to obtain 100 gene subsets with which 100 base PNN classifiers were trained, respectively. Finally, the top 25 optimal base PNN classifiers were integrated by majority voting

Conflict of interest statement

None Declared.

Acknowledgments

This work was supported by the National Science Foundation of China, (Grant nos. 60973153 and 30700161), the Guide Project of Innovative Base of Chinese Academy of Sciences (Grant no. KSCX1-YW-R-30), the Knowledge Innovation Program of the Chinese Academy of Sciences (0823A16121), and the China Postdoctoral Science Foundation (Grant no. 20090450707).

Shulin Wang was born in Sichuan, China. Currently, he is working as Postdoctor at Intelligent Computing Lab, Heifei Institute of Intelligent Machines, Chinese Academy of Sciences, China. He obtained his Ph.D. degree in the National University of Defense Technology, China. He received his M.Sc. degree in Computer Application from the National University of Defense Technology, China, in 1997, and obtained his B.Sc. degree in Computer Application from China University of Geosciences in 1989. He

References (103)

  • Q.H. Hu et al.

    Neighborhood classifiers

    Expert Systems with Applications

    (2008)
  • S.R. Vavricka et al.

    hPepT1 transports muramyl dipeptide, activating NF-kappaB and stimulating IL-8 secretion in human colonic Caco2/bbe cells

    Gastroenterology

    (2004)
  • K.W. Suh et al.

    Thymidylate synthase gene polymorphism as a prognostic factor for colon cancer

    Journal of Gastrointestinal Surgery

    (2005)
  • E.M. Reyes-Reyes et al.

    Cell-surface nucleolin is a signal transducing P-selectin binding protein for human colon carcinoma cells

    Experimental Cell Research

    (2008)
  • S.J. Orr et al.

    CD33 responses are blocked by SOCS3 through accelerated proteasomal-mediated turnover

    Blood

    (2007)
  • M.V. Shah et al.

    Molecular profiling of LGL leukemia reveals role of sphingolipid signaling in survival of cytotoxic lymphocytes

    Blood

    (2008)
  • T. Macalma et al.

    Molecular characterization of human Zyxin

    The Journal of Biological Chemistry

    (1996)
  • Y.Z. Wu et al.

    Identification of a S100 calcium-binding protein expressed in HL-60 cells treated with all-trans retinoic acid by two-dimensional electrophoresis and mass spectrometry

    Leukemia Research

    (2004)
  • D. Steinbach et al.

    Clinical implications of PRAME gene expression in childhood

    Cancer Genet Cytogenet

    (2002)
  • J. Roman-Gomez et al.

    Promoter hypermethylation of cancer-related genes: a strong independent prognostic factor in acute lymphoblastic leukemia

    Blood

    (2004)
  • C.P. Minniti et al.

    The insulin-like growth factor II (IGF-II)/mannose 6-phosphate receptor mediates IGF-II-induced motility in human rhabdomyosarcoma cells

    Journal of Biological Chemistry

    (1992)
  • J. Hulit et al.

    The cyclin D1 gene is transcriptionally repressed by caveolin-1

    Journal of Biological Chemistry

    (2000)
  • M. Dettling

    BagBoosting for tumor classification with gene expression data

    Bioinformatics

    (2004)
  • A.M. Bagirov et al.

    New algorithm for multi-class cancer diagnosis using tumor gene expression signatures

    Bioinformatics

    (2003)
  • A.C. Tan et al.

    Simple decision rules for classifying human cancers from gene expression profiles

    Bioinformatics

    (2005)
  • J.J. Dai et al.

    Dimension reduction for classification with gene expression microarray data

    Statistical Applications in Genetics and Molecular Biology

    (2006)
  • M. Mramor et al.

    Visualization-based cancer microarray data classification analysis

    Bioinformatics

    (2007)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machine

    Machine Learning

    (2002)
  • K.B. Duan et al.

    Multiple SVM-RFE for gene selection in cancer classification with expression data

    IEEE Transactions on NanoBioscience

    (2005)
  • A. Blanco, M. Martn-Merino, J.D.L. Rivas, Ensemble of support vector machines to improve the cancer class prediction...
  • L. Li et al.

    Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method

    Combinatorial Chemistry & High Throughput Screening

    (2001)
  • A.C. Tan et al.

    Ensemble machine learning on gene expression data for cancer classification

    Applied Bioinformatics

    (2003)
  • J. Ryu, S.B. Cho, Gene expression classification using optimal feature/classifier ensemble with negative correlation,...
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • A. Mateos, J. Herrero, J. Tamames, J. Dopazo, Supervised neural networks for clustering conditions in DNA array data...
  • R. Xu, D.C. Wunsch, Probabilistic neural networks for multi-class tissue discrimination with gene expression data, in:...
  • C.J. Huang, W.C. Liao, A comparative study of feature selection methods for probabilistic neural networks in cancer...
  • H.Q. Wang et al.

    Regulation probability method for gene selection

    Pattern Recognition Letters

    (2006)
  • C.H. Zheng et al.

    Tumor clustering using non-negative matrix factorization with gene selection

    IEEE Transactions on Information Technology in Biomedicine

    (2009)
  • G.H. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in: Proceedings of the Eleventh...
  • K. Ron et al.

    Wrappers for feature subset selection

    Artificial Intelligence

    (1997)
  • X. Wang et al.

    Quantitative quality control in microarray experiments and the application in data filtering, normalization and false positive rate prediction

    Bioinformatics

    (2003)
  • S.L. Wang, H.W. Chen, S.T. Li, Gene selection using neighborhood rough set from gene expression profiles, in:...
  • S.L. Wang et al.

    Heuristic breadth-first search algorithm for informative gene selection based on gene expression profiles

    Chinese Journal of Computers

    (2008)
  • H.H. Won, S.B. Cho, Neural network ensemble with negatively correlated features for cancer classification, in: Joint...
  • S.B. Cho et al.

    Cancer classification using ensemble of neural networks with multiple significant gene subsets

    Applied Intelligence

    (2007)
  • O. Okun et al.

    Ensembles of nearest neighbors for gene expression based cancer classification

    Studies in Computational Intelligence

    (2008)
  • T. Hastie et al.

    The Elements of Statistical Learning: Data Mining, Inference, and Prediction

    (2001)
  • Y. Zhao, Y. Chen, X.Q. Zhang, A novel ensemble approach for cancer data classification, in: Fourth International...
  • Q.H. Hu et al.

    Numerical attribute reduction based on neighborhood granulation and rough approximation

    Journal of Software

    (2008)
  • Cited by (96)

    • A review on machine learning techniques for acute leukemia classification

      2021, Biosignal Processing and Classification Using Computational Learning and Intelligence: Principles, Algorithms, and Applications
    • Granule structures, distances and measures in neighborhood systems

      2019, Knowledge-Based Systems
      Citation Excerpt :

      Yao [32] and Hu [33] proposed the neighborhood rough set model that can deal with the knowledge classification systems with continuous values. It has been widely used in attribute reduction [34–36], feature selection and extraction [37–40], classification and clustering [23,41,42], gene selection [43–45], image processing [46] etc. However, neighborhood relations are not strict equivalence relations, the classical tools and methods of uncertainty measurement are not suitable to the neighborhood knowledge classification systems.

    View all citing articles on Scopus

    Shulin Wang was born in Sichuan, China. Currently, he is working as Postdoctor at Intelligent Computing Lab, Heifei Institute of Intelligent Machines, Chinese Academy of Sciences, China. He obtained his Ph.D. degree in the National University of Defense Technology, China. He received his M.Sc. degree in Computer Application from the National University of Defense Technology, China, in 1997, and obtained his B.Sc. degree in Computer Application from China University of Geosciences in 1989. He also worked in Hunan University from 2000 to 2007.

    Research Interests: Bioinformatics, Software Engineering, and Complex System.

    View full text